Hierarchical cluster algorithm for remote sensing data of earth

ISSN 1054�6618, Pattern Recognition and Image Analysis, 2012, Vol. 22, No. 2, pp. 373–379. © Pleiades Publishing, Ltd., 2012.

1. INTRODUCTION

The large volume of multispectral data requireshigh�speed algorithms and RAM memory saving. Themost widespread cluster algorithms used in the remotesensing of the Earth’s surface are K�means and histo�gram�based algorithms [1]. The clusterization of mul�tidimensional data can be a basis for their preliminaryanalysis before applying supervised classificationmethods to recognizing objects based on specifiedsamples. However, if the clusters are well separated inthe vector space of features, one can suggest anotherrecognition approach, i.e., unsupervised classification[1–3]. This approach is of special interest for analyz�ing multispectral satellite data when it is difficult orimpossible to obtain training samples. From this pointof view, the Narendra histogram algorithm [4] isattractive. This algorithm is quick and nonparametric,and it separates the vector space of features by unimo�dal clusters, the modal vectors and boundaries ofwhich correspond to local maximums and valleys ofthe histogram, respectively. The multidimensional his�togram is considered to be an approximation of theprobability density of the feature’s vectors.

To construct a multidimensional histogram, analgorithm of interchange of two tables using hashingand sorting of data was proposed [5]. This algorithmprovides for quick access to data without requiringmuch memory.

To reduce the volume of data and the number of theobtained clusters in the Narendra algorithm, the dis�crete vector space was preliminarily quantized. If thestructure of the data is complex, the clusterizationresults significantly depend on the choice of the num�ber of quantization levels. The reliability of the quan�tization depends on the good separability of the

obtained clusters. However, cluster methods do notensure the separability of clusters. The results of clus�terization must be followed by validation [6].

In [7], it was proposed to choose the number ofquantization levels of the vector space automatic basedon the estimate of separability of the obtained clusters.However, the data can be nonuniform and the struc�ture and separability of the clusters are different in dif�ferent domains and subdomains of the vector space. Inthis work, we consider the development of thisapproach using the estimate of cluster separability inthe process of clusterization of spectral features [7].The new hierarchical divisible algorithm proposes theautomatic choice of numbers of quantization levels forthe regions of data depending on their cluster separa�bility. The algorithm finds the subdomains using thehierarchical approach to searching for the hierarchicalsequence of distributions that are best in separability.The parameter of the clusterization details is the num�ber of quantization levels of the vector space and wedenote it as n.

2. SEPARABILITY MEASURES

In [7], the measure of separability for a unimodalcluster mj(n) (1) and the measure of quality of the dis�tribution in the whole m(n) by K(n) clusters (2) weredefined as follows:

(1)

(2)

where (n) is the value of the histogram at the ith

point of the boundary of the cluster j, B j(n) is the num�

m j n( )1

B j n( )H j n( )�� hi

j n( ),

i 1=

Bj

n( )

∑=

m n( )1

K n( )�� m j n( ),

j 1=

K n( )

∑=

hij

Hierarchical Cluster Algorithm for Remote Sensing Data of EarthV. S. Sidorova

Institute of Computational Mathematics and Mathematical Geophysics, Siberian Branch of the Russian Academy of Sciences, pr. Lavrent’eva 6, Novosibirsk, 630090 Russia

e�mail: [email protected]

Abstract—A new automatic, hierarchical, multidimensional, histogram�based clusterization algorithm isconsidered. A method for choosing the clusterization detailedness in different regions of the vector space ofspectral features depending on the average separability of clusters is proposed. The algorithm is applied forthe automatic classification of multispectral satellite data in recognizing the land cover.

Keywords: remote sensing, unsupervised classification, multidimensional histogram, separability of clusters,hierarchic algorithm.

DOI: 10.1134/S1054661812020149

Received April 4, 2011

REPRESENTATION, PROCESSING, ANALYSIS,AND UNDERSTANDING OF IMAGES

374

PATTERN RECOGNITION AND IMAGE ANALYSIS Vol. 22 No. 2 2012

SIDOROVA

ber of points of the cluster boundary, and H j(n) is themaximal value of the histogram.

Measure (1) is defined as the ratio of the averagevalue of the histogram at the boundary to the maximalvalue of the cluster histogram. Measure (2) measuresthe average separability of clusters. The minimums of(2) correspond to the best classifications upon varia�tion of n; it is always true that m j(n) ≤ 1 and m(n) ≤ 1.Note that the boundaries are not common for neigh�boring clusters because the Narendra algorithm isrigid; i.e., the clusters have no common points. Theboundary in formula (1) is defined for each cluster as aset of points for each of which there is at least oneneighboring point in the list of the neighbors of vectorsthat does not belong to the same cluster.

For closely positioned unimodal clusters, measure(2) satisfies the requirements that are imposed on clas�sification quality measures in problems of cluster reli�ability [7]. The nature of the classified objects, i.e.,types of the Earth’s surfaces, is such that the over�whelming majority of spectral attributes forms closelyspaced clusters. According to the studies, with anincrease in the number of quantization levels, the ten�dency of the average separability of clusters to decreaseand the value of measure (2) to increase is observed.Clusters that have no neighbors at the boundary arerare. The considered measures are described in moredetail in [7, 8].

3. CHOICE OF NUMBER OF QUANTIZATION LEVELS INSIDE A CLUSTER

At every stage of the hierarchy, the best distributionwith respect to unimodal clusters is sought by the fol�lowing algorithm [7], which uses the detail parameter,i.e., the number n of quantization levels of the vectorspace. Let their initial number be n < n0, n0 = 256.Then, the size of a cell of the vector space is kf = (n0 –1)/(n – 1). If L is the number of attributes and f =[ f(1), f(2), …, f(L)] is an attribute vector, then wespecify a new vector g = [g(1), g(2), …, g(L)] to whichf is transformed as follows:

where [⋅] is the integer part of the number.The main algorithm finds n (and corresponding g)

at which the distribution of vectors (obtained by theNarendra method) into unimodal clusters yields theminimum to the measure of cluster separability in thevariation range of n as follows: 256 > n > n1.

Using of graphs, the Narendra algorithm dividesthe vector space of features into unimodal clusters, themodal vectors of which correspond to local minimumsof the histogram. At every point of the discrete vectorspace, the direction of an elementary graph coincideswith the maximal gradient of the histogram and isdetermined by the vicinity of the nearest neighbors ofthe point. The neighbors are vectors each component

g k( )f k( )

kf�� , k 1 … L,, ,= =

of which is distant from this point not more than byunity. The list of neighbors of all vectors can be calcu�lated beforehand. When constructing the histogram,the search for neighbors is a fast procedure because themultidimensional attribute vectors that form the dataare stored in a linearly ordered list (as in a dictionary,first in ascending order for the first component of thevector, then in ascending order for the second compo�nent, and so on). Tracing elementary graphs for seek�ing maximums of the histogram and classifying vectorslinearly depends on the number of different vectors;therefore, this is also a fast procedure. In addition, thegraphs are constructed using only scalar values of thehistogram.

4. HIERARCHICAL ALGORITHM

The hierarchical algorithm first finds the number ofquantization levels at which a new system of vectorsunited by quantization is obtained so that the unimo�dal clusters of the system are separated in the best wayby varying n from 0 to n1. Then, inside each obtainedcluster, the initial value n is increased by unity com�pared to the previous stage of the hierarchy. In the newrange, the main algorithm again finds its best clusterdistribution and the new number of levels, and so on.The parent and daughter clusters are continuouslynumbered, and the algorithm keeps the value of mea�sure (1) and the number of the parent cluster for eachnew cluster.

The considered separability measure for isolatingof an individual cluster (1) does not depend on otherclusters; therefore, we still take the average separabilityMK as a measure of the separability of the overall dis�tribution by K clusters for the hierarchical algorithm asfollows:

(3)

However, now, nj is the number of quantization levelsobtained when constructing the cluster j.

The choice of the general distribution at which theseparability of the collection of all obtained clustersMK is best, i.e., MK reaches its minimum, can be thefinal aim of the proposed hierarchical algorithm. It isproposed to agglomerate the clusters based on thereturn to the previous stage of the hierarchy of a defi�nite part of the data. When n reaches n1 or becomesgreater than it in the partition of all clusters, the choiceof the hierarchy stage begins for each cluster with thegoal to reduce (3) for their total collection. Beginningfrom the older stage of the hierarchy in the reverseorder, the values of the average measure of separability(3) of a group of clusters obtained by partitioning theparent cluster of the previous hierarchy stage and theseparability measures (1) of the parent cluster itself aresequentially compared. The data are grouped into a

MK1K�� m j n j

( ).

j 1=

K

∑=


HIERARCHICAL CLUSTER ALGORITHM FOR REMOTE SENSING DATA OF EARTH 375

parent or daughter cluster corresponding to the lesservalue of the measure of separability.

Small clusters can be filtered. Clusters whose areasare less than the introduced threshold (in pixels) areexcluded from sums (2) and (3). This step is justifiable,since a small number of elements of these clusters doesnot ensure the statistical reliability in the estimate ofthe isolation of a cluster (1); the histogram for theseelements cannot be an approximation of the probabil�ity density of attribute vectors, and application of theNarendra method is not justified. These clusters canbe annexed according to the principle of worse separa�bility or, in the case of complete isolation, by the near�est distance. Here, only sisters from the common par�ent cluster are analyzed. The investigations show thatthere can be many small clusters, but their total datavolume is not large. However, their contribution tosum (3) can be significant because they are oftenpoorly separated.

As a result of the final classification, the distribu�tion of data into clusters is obtained. The data are rep�resented with varying detail in keeping with theirdegree of isolation. For a more detailed study of thedata, one can consider each cluster to be new data andseek the better separability of its hierarchical subclus�ters.

5. EXPERIMENTS

Let us take a fragment of an image of the Earth’ssurface covering some parts of Western Siberia andKazakhstan, which was obtained from the NOAA sat�ellite on April 24, 2003. The image is five�spectral andwas scanned in two ranges of the visible part of thespectrum, namely, in the infrared range at a wave�length of 3.7, 11 and 12 μm.

5.1. Classification of a Two�Spectral Image

As a demonstration, we first consider the classifica�tion by two channels, i.e., the near infrared channel ata wavelength of 3.7 μm R and blue channel B of thevisible spectrum. The photograph in the color registra�tion format with a marked coordinate grid and namesof some cities, localities and lakes is shown in Fig. 1.The top and bottom of the image present melting snowand a thawed surface almost free of snow, respectively.The Irtysh river runs diagonally below Omsk.

At the first stage of hierarchy, the number of quan�tization levels n = 8 was found for which measure (2)reached its minimum m(8) = 0.18 in the variationrange of n of 0 < n < 256. The number of clusters is 3.These three clusters correspond to melting snow,thawed areas without snow, and some small lakes(water). At the first stage, large lakes assigned turnedout to be in cluster 1 (melting snow).

The obtained minimum corresponds to the bestclassification for the main algorithm (without hierar�chy). Figure 2a shows a vector diagram of the distribu�

tion of two features; the clusters of the first stage of thehierarchy are shown.

At the same time, the new hierarchic algorithmdemonstrated a cluster distribution with better separa�bility of clusters with respect to measure (3). The bestgeneral classification of the hierarchical algorithmcorresponds to the minimum of measure (3) MK =0.06, and the total number of clusters is K = 14. Theminimum was obtained for the fourth stage of hierar�chy. The number of quantization levels n reached n =23 for a snow�covered area; for a thawed area, n = 48.Clusters with an area of less than three pixels were notincluded in (3). Figure 2b shows the clusters of the bestdistribution. The profile of the histogram along theline R = 233 for n = 256 is shown in Fig. 2c. For theobtained numbers of quantization levels, the two largemodes correspond to two clusters of the thawed areathat fell along the chosen line. For these clusters, thenumbers of levels turned out to be similar, i.e., n = 48.

Table 1 shows the obtained hierarchical tree. Everycluster is characterized by its separability with respectto measure (1). Solid lines denote the branches thatremained in the minimization of the average measure(3). If the separability (1) of the cluster was less thanthe separability (3) of the whole collection of its sub�clusters found at the last stage of the hierarchy, thiscluster remained at the stage of the hierarchy where itwas obtained. For example, the average measure forthe group of subclusters of cluster 2 that correspond toa thawed surface is M4(2) = (0.08 + 0.06 + 0.01 +0.00)/4 = 0.04. This is significantly less than the sepa�rability of cluster 2 itself, which is equal to 0.11.

Cluster 3 corresponds to several small lakes; conse�quently, the partition process was stopped as the zeroseparability was achieved at the second stage.

Figure 3 presents a cluster map constructed basedon the proposed hierarchical algorithm. Four clustersare assigned to melting snow. Cluster 2 distinguishes

Omsk

ChanyBarabinsk

Novosibirsk

Karasuk

Kulundinskoe Lake

Siletiteniz Lake Tavolzhan Lake

ShaldaiPavlodar

Kalmakan Lake

Ekibastuz Koktobe

Maikain

Astana Bayanaul

Kurchatov

Bozshakol

Fig. 1. Fragment of the satellite photograph.

376


SIDOROVA

1

2

3

12 3

456

8

7

9

11

10

12

1314R R

R = 233

(a) (b)B B

H0.000174

0 255B

Fig. 2. (a) Cluster distribution for first stage of hierarchy, (b) best cluster distribution obtained at the fourth stage, and (c) profileof the two�dimensional histogram along the line R = 233.

I

II

III

IV

n

1 2 3

0.340.110.07

0.27 0.10 0.13 0.22 0.04 0.16 0.00

0.20 0.35 0.23 0.00 0.11 0.15 0.00 0.000.33 0.33 0.03

23 48

0.25 0.20 0.03 0.24 0.25 0.67 0.06 0.32 0.00 0.00 0.22 0.21 0.08 0.05 0.27 0.06 0.01

All data

Hierarchical representation of the data obtained by the divisible cluster algorithm. Roman numerals indicate the stages of the hierarchy and Arabic numerals show the separability of clusters with respect to measure (1)



2

25

13

4

7

6

12

10

11

98

13

13

Fig. 3. Map of automatic clusterization with the hierarchi�cal algorithm: (1, 2) melting snow, (3, 4) coniferous forestunder melting snow, (5) not continuous snow cover, (6, 7,8, 9) lakes, (10) forests in places, (11, 12) no snow, and (13)coniferous forest.

Chany lake under snow and cluster 4 includes diagonalzones of Suzun ribbon forests. Cluster 5 correspondsto a noncontinuous snow cover. Four clusters corre�spond to thawed surfaces. Cluster 13 corresponds toconiferous forests. Cluster 11 seems to be associatedwith the height of the relief; its borders on the physicalmap are close to those of Kazakh Uplands.

For comparison, let us note that the details of thethawed area (n = 48) in the main algorithm (withoutthe hierarchy) are obtained with 55 clusters and a valueof the measure of average separability (2) m(n) = 0.43.Clusters with areas of less than three pixels were alsonot considered when calculating the average separa�bility (2).

5.2. Classification of Five�Spectral Image

Figure 4 presents the images of the fragment in fivespectral channels. Let us consider the classification bydata of all the channels. Now it is impossible to con�struct an illustrative vector diagram, and the separabil�ity of clusters can be inferred by the value of the mea�sure.

At the first stage of the hierarchy, the minimum ofmeasure (3) is MK = 0.09 for the number of quantiza�tion levels n = 5. The best classification that corre�sponds to the minimum of measure (3) MK = 0.02 wasobtained at the third stage of the hierarchy. The num�ber of clusters was K = 15. The map of clusters is pre�sented in Fig. 5a. The details of the thawed area (n =16) were obtained by the main algorithm (without thehierarchy) with 58 clusters instead of 15 and with thevalue of the separability measure (2) m(n) = 0.18.

For snows (in the highlighted fragment), goodagreement was obtained with the classification of thescheme in Fig. 5b obtained by the Hydrometeorologi�cal Center of Russia (three clusters as well). Let uschoose some clusters for a more detailed study. Thenext minimum MK = 0.08 was obtained for an increas�ing number of quantization levels for subclusters of asnow�covered area (Fig. 6).

To analyze the data structure in more detail, onecan define the target of the hierarchical algorithm inanother way by separating the clusters until the separa�bility of the obtained subclusters exceeds the specifiedaccuracy (with an increase in the number of quantiza�tion levels), or continue the partitioning until the min�imal volumes of clusters are reached. Let us present amore detailed clusterization of the largest cluster ofthe thawed surface. At each subsequent stage of thehierarchy, the largest of the obtained clusters was cho�sen; it turned out that, at n = 22, its separability is zero,i.e., it was possible to separate it from small clusters

(a) (b) (c)

(d) (e)

Fig. 4. (a, b) Ranges of the visible part of the spectrum, (c) infrared 3.7 µm range, (d) infrared 11 µm range, and (e) infrared12�µm range.

378


SIDOROVA

with poor separability. The new map is shown inFig. 7a. The details (n = 22) were obtained by the mainalgorithm over the whole volume of data with 260 clus�ters. The highlighted cluster (shown in white) reflectsthe distribution of data for the spectral channels 4 and5 of the infrared range. Figures 2d and 2e clearly showdark objects that are absent in the images of the visible(Figs. 2a and 2b) and near infrared (Fig. 2c) ranges.It is interesting to note that the segments of the high�lighted cluster on the cluster map correspond to min�ing areas. The map of Pavlodar region is shown inFig. 7b (Internet). Ekibastuz and Bayanaul are knownas centers of opencast coal mining. Other depositdevelopments of nonferrous metals and coal in Kaza�khstan to the left of the Irtysh are also known, includ�ing Maikain, Shoptykol, and Koktobe. The majority

of the highlighted cluster corresponds to the Ekibas�tuz–Maikain–Bayanaul region of the photograph.Along the right bank of the Irtysh, natural industrialminerals are produced and some ore and coal treat�ment is performed; highways and railroads span itsbanks. To the west of Ekibastuz, the Bozshakol settle�ment is also known as an opencast mining site.

These localities are marked on the satellite image ofFig. 1 and fall into the region of the obtained cluster.Salt is mined in Kalmakan and Tavolzhan lakes andnonferrous metals are produced mined to the south ofKurchatov and Semipalatinsk. It may be soil distur�bances cause changes in the spectral characteristics ofthe infrared range in mining areas.

80−100% melting snow

30−80% not continuous

5−30% snow in places

BarabinskUbinskoe Chulym

UzhanikhaOrdynskoe

ZdvinskKochki

KrasnozeskKarasuk

Kamen�na�Obi

Baevo

Fig. 5. (a) Map of automatic clusterization of a five�spectral image by the hierarchical algorithm for a system of quantization gridsensuring the best average separability of clusters and (b) classification of melting snow in the square shown in Fig 5a. Performedby the Hydrometeorological Center of Russia by data of ground�based measurements.

Fig. 6. Map of more detailed clusterization of snow�covered area in five�spectral image.



6. CONCLUSIONS

The proposed hierarchical algorithm is based onthe Narendra histogram. The novelty of this algorithmis that it permits one to automatically choose a differ�ent number of quantization levels (details) for differ�ent domains of a vector space in accordance with theestimate of separability of clusters in them. In addi�tion, an algorithm was proposed for choosing the con�figuration of the details that yields the minimum formeasuring the average separability of the collection ofall obtained clusters. The algorithm provides a differ�entiated approach to different regions of multispectraldata. Comparison with the direct algorithm in theanalysis of the Earth’s surface shows that the new algo�rithm significantly reduces the number of clusters andimproves their correspondence to information classesof the land cover. The hierarchical algorithm permitsone to choose the boundaries of the studied objectsand to investigate the data structure in automatic andinteractive modes comprehensively and selectively.

ACKNOWLEDGMENTS

The work was supported in part by the RussianFoundation for Basic Research, project no. 10�07�00131.

REFERENCES

1. P. Gong and P. J. Howarth, “An Assessment of SomeFactors Influencing Multispectral Land–Cover Classi�fication,” Photogrametr. Eng. Remote Sensing 56 (5),597–621 (1990).

2. P. H. Swain and S. M. Davis, Remote Sensing: the Quan�titative Approach (McGraw�Hill, 1978).

3. Keinosuke Fukunaga, Introduction to Statistical PatternRecognition (Acad. Press, New York and London,1972).

4. P. M. Narendra and M. Goldberg, “A Non–ParametricClustering Scheme for LANDSAT,” Pattern Recogn.,No. 9, 207–215 (1977).

5. V. S. Sidorova, “Multi–Dimensional Histogram andthe Way to Separate Vector Space of Features Accordingto Unimodal Clusters,” in Proc. GraphiCon’2005 Conf.(Novosibirsk, 2005), pp. 267–274.

6. M. Halkidi, Y. Batistakis and M. Vazirgiannis, “OnClustering Validation Techniques,” J. Intell. Inf. Syst.,No. 17 (2–3), 107–132 (2001).

7. V. S. Sidorova, “Histogram Clustering Validation forMultispectral Image”, Optoelectronics, Instrumenta�tion and Data Processing, 43 (1), 28–32 (2007) Origi�nal Russian Text, V. S. Sidorova, Avtometr. 43 (1), 37–43 (2007).

8. V. S. Sidorova, “Unsupervised Classification of ImageTexture. Pattern Recognition and Image Analysis,”Adv. Math. Theor. Appl. 18 (4), 694–700 (2008).

Novokuz'minka

LesnoePetropavlovka

Zhelezinka

Kutuzovo

SevernoeTrofimovka

IrtyshskZapadnoe Panfilovo Kachiry

Golubovka Aktogai

Uspenka

Razumovka Chernoretskoe

Lesnoe3

6

10

8 9

71

113

11175

96

5

MaraldyShcherbakty

SarykamysLeninskii Romanovka

TaiSolnechnyi Kalmakan

Aksu ChigirinovkaShaldai

TortkudukShiderty

AKSU

Ekibastuz Koktobe

Maikaragai

MaikainKyzyl�Oktyabr’

Akku

Shoptykol

Uzunbulak

Alekseevka Bayanaul

Zhanaakshiman Kurchatov

Yuzhnoe

RUSSIAN FEDERATION

No

rth

Kaz

akh

stan

obl

ast

Akm

ola

obl

ast

Karaganda oblast

Eas

t K

azak

hst

an o

blas

t

Pavlodar

2

(a)

(b)

Fig. 7. (a) Map of detailed clusterization of the thawed area of the five�spectral image and (b) schematic map of Pavlodar region.

Valeriya Sergeevna Sidorova. Born1947. Graduated from the Depart�ment of Physics of Novosibirsk StateUniversity in 1972. Scientific inter�ests: image processing, unsupervisedclassification, and analysis of tex�tures. Author of more than 50 papers.Research scientist at the Institute ofComputational Mathematics andMathematical Geophysics, SiberianBranch of the Russian Academy ofSciences (Novosibirsk).

Documents

Hierarchical cluster algorithm for remote sensing data of earth