6
Implementation of Clustering-Based Image Retrieval System YUK YING CHUNG School of Information Technologies University of Sydney NSW 2006, Sydney AUSTRALIA Abstract: - Digital images are useful media for storing spatial, spectral and temporal components of information. Large image databases often store the images in compressed format, JPEG for example. This paper examines the algorithms of direct extraction of low level features from compressed images, working with three different clustering techniques. Results indicate that a K-Harmonic means clustering algorithm not only has the advantage of shorter in both clustering and searching time but also can have a higher accuracy rate. Key-Words: - Image compression, DCT domain, clustering, Image retrieval 1 Introduction Content-based image and video retrieval has become an important research area recently. The growing popularity of the internet, the introduction of new consumer products for image and video creation have resulted in great demand of multimedia storage and retrieval systems. Content-based retrieval systems for images based on various image features such as colour, shape and texture have been reported [1-3]. However, most images and video are stored in compressed format in order to solve the storage problem. To search and locate the database in compressed domain is still a difficult problem. This work is based on the discrete cosine transform (DCT) which is the heart of current image and video compression standards such as JPEG[4], MPEG and H.261. The features will be directly extracted from compressed data and these features are the characteristics of image and video contents for content-based retrieval. Most image retrieval systems extract specific features from a query image and compare these features with the corresponding pre-computed features of all the images in the database. The search time will increase linearly with the size of database. The cumulative time needed to compare the query image with all database images will become very long even the time required to compare two images is very short. In order to solve this problem, we can create an indexing scheme by grouping the images based on their content values so that only the relevant set of cluster images will be compared. The clustering algorithms must retain the original retrieval accuracy. Indexing techniques such as R* and K-D-B trees [5,6] have been proposed for fast range search but they cannot solve the problem of large dimensionality of vectors greater than ten. Therefore, we choose clustering techniques in our retrieval system to overcome the feature dimension problem. In this paper, section 2 gives the introduction of Image Retrieval System using DCT coefficients. Section 3 presents an introduction to three clustering algorithms: K-Means, Isodata and K-Harmonic Means. Section 5 and 6 present the testing results and conclusions. 2 Image Retrieval System using DCT coefficients Most images in database are stored in JPEG[4] format. It can reduce much computational complexity if we process the image retrieval in compressed domain and avoid the full data decompression process. In this work, we use various sets of DCT coefficients extracted from JPEG compressed images. The DCT coefficient values can be regarded as the relative amount of the 2D spatial frequencies contained in the 8x8 block input data. The coefficient with zero frequency is called “DC coefficient” and the remaining 63 coefficients are called “AC coefficients”. Figure 1 shows the zig zag sequence of 8x8 DCT block with DC and 63 AC coefficients. The forward DCT process can concentrate most of the signal in the lower spatial 4th WSEAS International Conference on ELECTRONICS, CONTROL and SIGNAL PROCESSING, Miami, Florida, USA, 17-19 November, 2005 (pp.75-80)

Implementation of Clustering-Based Image Retrieval System...1. Initialize the cluster centroids to the seed points. 2. Find the cluster centroid for each sample. 3. Compute the new

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Implementation of Clustering-Based Image Retrieval System...1. Initialize the cluster centroids to the seed points. 2. Find the cluster centroid for each sample. 3. Compute the new

Implementation of Clustering-Based Image Retrieval System

YUK YING CHUNG School of Information Technologies

University of Sydney NSW 2006, Sydney

AUSTRALIA

Abstract: - Digital images are useful media for storing spatial, spectral and temporal components of information. Large image databases often store the images in compressed format, JPEG for example. This paper examines the algorithms of direct extraction of low level features from compressed images, working with three different clustering techniques. Results indicate that a K-Harmonic means clustering algorithm not only has the advantage of shorter in both clustering and searching time but also can have a higher accuracy rate.

Key-Words: - Image compression, DCT domain, clustering, Image retrieval 1 Introduction Content-based image and video retrieval has become an important research area recently. The growing popularity of the internet, the introduction of new consumer products for image and video creation have resulted in great demand of multimedia storage and retrieval systems. Content-based retrieval systems for images based on various image features such as colour, shape and texture have been reported [1-3]. However, most images and video are stored in compressed format in order to solve the storage problem. To search and locate the database in compressed domain is still a difficult problem. This work is based on the discrete cosine transform (DCT) which is the heart of current image and video compression standards such as JPEG[4], MPEG and H.261. The features will be directly extracted from compressed data and these features are the characteristics of image and video contents for content-based retrieval.

Most image retrieval systems extract specific features from a query image and compare these features with the corresponding pre-computed features of all the images in the database. The search time will increase linearly with the size of database. The cumulative time needed to compare the query image with all database images will become very long even the time required to compare two images is very short. In order to solve this problem, we can create an indexing scheme by grouping the images based on their content values so that only the relevant set of cluster images will be compared. The clustering algorithms must retain the original retrieval accuracy. Indexing techniques

such as R* and K-D-B trees [5,6] have been proposed for fast range search but they cannot solve the problem of large dimensionality of vectors greater than ten. Therefore, we choose clustering techniques in our retrieval system to overcome the feature dimension problem. In this paper, section 2 gives the introduction of Image Retrieval System using DCT coefficients. Section 3 presents an introduction to three clustering algorithms: K-Means, Isodata and K-Harmonic Means. Section 5 and 6 present the testing results and conclusions. 2 Image Retrieval System using DCT

coefficients

Most images in database are stored in JPEG[4] format. It can reduce much computational complexity if we process the image retrieval in compressed domain and avoid the full data decompression process.

In this work, we use various sets of DCT coefficients extracted from JPEG compressed images. The DCT coefficient values can be regarded as the relative amount of the 2D spatial frequencies contained in the 8x8 block input data. The coefficient with zero frequency is called “DC coefficient” and the remaining 63 coefficients are called “AC coefficients”. Figure 1 shows the zig zag sequence of 8x8 DCT block with DC and 63 AC coefficients. The forward DCT process can concentrate most of the signal in the lower spatial

4th WSEAS International Conference on ELECTRONICS, CONTROL and SIGNAL PROCESSING, Miami, Florida, USA, 17-19 November, 2005 (pp.75-80)

Page 2: Implementation of Clustering-Based Image Retrieval System...1. Initialize the cluster centroids to the seed points. 2. Find the cluster centroid for each sample. 3. Compute the new

frequency. For 8x8 block from compressed image, we can only consider the low spatial frequency coefficients to construct the energy histogram. The following S1, S2, S3 and S4 are four sets of the DCT coefficients that we have used for testing.

S1 = [DC]; S2 = [c(0,1), c(1,0), c(1,1)] S3 = [DC, c(0,1), c(1,0), c(1,1)] S4 = [DC, c(0,1), c(0,2), c(1,0), c(1,1),

c(1,2), c(2,0), c(2,1), c(2,2)]

Figure 1. The zig zag sequence of DCT coefficients The DCT energy histogram is constructed by counting the number of times of an energy level that occurs in 8 x 8 DCT blocks. After constructing the DCT energy histogram, the histogram intersection method will be used to perform matching. In matching process, the reference image DCT energy histogram will compare with different groups in the databases. The matched images will have the higher matching values. Histogram intersection is an efficient way to perform matching. The computation complexity is low and can be implemented by most of the computers. The matching algorithm is as follow:

Given a pair of histograms which consist of a reference image (I) and a database image (J), each contains N DCT energy levels.

In order to normalize the matching result between 0 and 1, the measure is finally divided by the total number of coefficients used in the reference image. The histogram intersection technique proposed by [7] is defined as follows:

3 Clustering Algorithms Many retrieval systems calculate the features similarity between the query image and all images in the database and rank the images by sorting their similarities. The problem of this full search approach is very time consuming for large scale database. The retrieval time of this full search is the sum of the time to calculate similarity time Tsim and the time to sort the images in the database according to their similarity Tsort. Time for the full search Tfull: Tfull = Tsim + Tsort Tfull = nTsim(1) + O(nlog n) Where n : number of images in the database Tsim: total time to calculate the similarity Tsort: total time to rank n images Tsim(1) : time to calculate similarity between two images O(nlog n): time to sort n elements If the images in the database are clustered, the retrieval time is the sum of:

1) The time to calculate the similarity between

the query and the cluster centers. 2) The time to calculate the similarity between

the query and the images in the nearest clusters.

3) The time to rank the images in step 2. Time for cluster search Tcluster : Tcluster = k Tsim(1) + l Tsim(1) + O(l log l) Where k: number of clusters l: number of images in the clusters

nearest to the query

DC c(0,1) c(0,2) c(0,3) c(0,4) c(0,5)

c(1,1) c(1,2) c(1,3) c(1,4)

c(2,1) c(2,2) c(2,3)

c(3,1) c(3,2)

c(1,0)

c(2,0)

c(3,0)

c(4,0) c(4,1)

c(5,0)

H I JI J

I

n nn

N

nn

N( , )min( , )

= =

=

∑1

1

match value I Jn nn

N

_ min( ),==∑

1

4th WSEAS International Conference on ELECTRONICS, CONTROL and SIGNAL PROCESSING, Miami, Florida, USA, 17-19 November, 2005 (pp.75-80)

Page 3: Implementation of Clustering-Based Image Retrieval System...1. Initialize the cluster centroids to the seed points. 2. Find the cluster centroid for each sample. 3. Compute the new

Since k << n and l << n, therefore the clustering search time Tcluster should be much smaller than the full search time Tfull (i.e. Tcluster << Tfull ). In this work, three different clustering algorithms: K-Means, Isodata and K-Harmonic Means have been implemented and tested. Section 3.1, 3.2 and 3.3 will give the introduction to these three algorithms. 3.1 K-Means The K-Means clustering algorithm [8] is as follow: 1. Begin with k clusters, each consisting of one of

the first k samples. For each of the remaining (n-k) samples, find the centroid nearest it. Put the sample in the cluster identified with this nearest centroid. After each sample is assigned, recompute the centroid of the altered cluster.

2. Go through the data a second time. For each sample, find the centroid nearest it. Put the sample in the cluster identified with this nearest centroid. Do not recompute any centroid in this step.

3. If no samples change cluster in step 2, stop. 4. Go to Step 2.

3.2 Isodata Compared to other clustering algorithms, the Isodata [8] tries to minimize the squared error by assigning samples to the nearest centroid. It deals with k clusters where k is allowed to range over an interval that includes the number of clusters requested by the user. The following steps need to be followed:

1. Initialize the cluster centroids to the seed points. 2. Find the cluster centroid for each sample. 3. Compute the new centroids. 4. If more than one sample has changed clusters and

number of iteration is less than start_i, go to step 2.

5. Discard clusters and samples if containing fewer than e-min samples.

6. If the number of clusters is greater than or equal to 2 x c_no or number of this iteration is even, go to step 7; otherwise go to step 8.

7. If the distance between two centroids is less than d_min, merge these clusters and update the centroid; otherwise repeat this step and go to step 8.

8. If the number of clusters is less than or equal to c_no / 2 or number of this iteration is odd, go to step 9, otherwise, go to step 10.

9. Find the standard deviation for the complete original set of samples. If none, compute the mean within the cluster.

Compute the centroids of these two clusters. If the distance between these centroids is greater than or equal to 1.1 x d_min, replace the original cluster by these two clusters; otherwise do not split the cluster.

10. If there is no change in cluster when number of iterations greater than i_max, stop. Otherwise, take the centroids of the clusters as new seed points and go to step 2.

where: c_no – the desired number of clusters e_min – the minimum number of samples per cluster d_min – the minimum distance between centroids

without merging start_i – maximum number of iteration in first part. i_max -- maximum number of iteration in main part. 3.3 K-Harmonic Means Algorithm The K-Harmonic Means Algorithm [9] is a recent algorithm designed by Bin Zhang from Hewlett Packard Laboratories. The goal of this algorithm is to tackle the major problem of the K-Means algorithm. The major problem of the K-Means algorithm is the performance dependency on the initialisation of the centres. For example, if the seeds chosen are close to another, the clusters formed will be close to each other as well. The K-Harmonic Means algorithm uses the Harmonic Averages of the distances from each sample point to the centres as components of its performance function. The Harmonic Average is defined as follows:

HA({ a i | i = 1, …, K}) =

∑=

K

i ia

K

1

1

In the Harmonic Average function, a represents the sample, a i is the distance to i cluster and K is the total number of clusters. The developers of K-Harmonic Means algorithm devised the performance function of this algorithm as a derivative of the Harmonic average function. The performance function is described as follows:

4th WSEAS International Conference on ELECTRONICS, CONTROL and SIGNAL PROCESSING, Miami, Florida, USA, 17-19 November, 2005 (pp.75-80)

Page 4: Implementation of Clustering-Based Image Retrieval System...1. Initialize the cluster centroids to the seed points. 2. Find the cluster centroid for each sample. 3. Compute the new

2

min

2

,

min,3,

4min,

,

1⎥⎥

⎢⎢

⎟⎟⎠

⎞⎜⎜⎝

⎛+

=

∑≠l li

iki

iki

dd

d

dq ,

∑=

=K

kkii qq

1, ,

i

kiki q

qp ,

, = ,

∑=

=N

iikik xpm

1, .

di,min is defined as the distance between each sample and its nearest cluster, and di,l is defined as the distance between each sample and each cluster. Similar to the Isodata algorithm, it does not have a fixed number of clusters. The K-Harmonic Means algorithm assigns the samples only when n centroids are computed for n clusters where n is specified by the user. This means some clusters may not contain any sample; therefore, the number of clusters is not fixed. The K-Harmonic Means algorithm [9] can be performed in six steps as shown below: 1. n clusters are created and initialised with n seed

points; 2. The distances from all samples to all centroids

are computed; 3. The nearest cluster to each sample and the

distance are searched; 4. The results from Step 2 and Step 3 are used in the

performance function; and 5. If the centroids do not change anymore or the

convergence value is reached, the algorithm ends here; otherwise go to Step 2.

4 Testing results Tests are designed to determine the performance and accuracy of the results retrieved by the proposed CBIRS system. Tests are conducted on a total of 3000 images in different sizes. All tests are conducted in the same environment as shown below: 1. Operating System: Microsoft Windows

Professional XP SP1 Edition 2. Software: Microsoft Visual C++ 6.0 3. CPU Speed: AMD Althon XP 2500+ 4. Memory: 512MB @ 333 MHz DDR Ram

The Percentage of Retrieval Accuracy Method (PRAM) is used to determine the percentage of accuracy. The accuracy result of each query image, ai is measured by the number of images returned with clustering out of the number of images returned without clustering. PRAM is determined as below:

K

aK

ii∑

=1 * 100%

where K is the total number of query images

The function of PRAM is computed by adding the accuracy results of all query images performed, and the percentage is then obtained by using the mean accuracy result multiplied with 100%. The testing query images include various sizes and various angles. Figure 2 shows the example of query images in various angles. 4.1 Average Clustering Time Analysis

No. of Clusters

Kmeans (seconds)

Isodata (seconds)

K-Harmonic

Means (seconds)

Average Time

Taken in terms of

no. of clusters

used 5 11.6 13.2 4.6 9.52

10 20.4 22.8 8.8 18.64 20 43 46.4 13 33.28 50 161 125.6 25.8 100.44 80 158.8 212 77 136.84 100 214.2 265 109.2 187.12

Average Time 101.5 114.1667 39.73

Table 1 Average time taken for clustering algorithms in 5, 10, 20, 50, 80 and 100 clusters

From Table 1 it is evident that K-Harmonic Means has the best performance compared to other clustering algorithms since it requires the least clustering time. The reason behind this is that the K-Harmonic Means algorithm calculates the centroids without assigning images to the clusters until the convergence value is reached. Furthermore the duration for clustering in n clusters increases when n increases. Isodata clustering algorithm has the worst performance due to splitting and merging operations despite the fact that the number of iterations is only 10 in any situation.

4th WSEAS International Conference on ELECTRONICS, CONTROL and SIGNAL PROCESSING, Miami, Florida, USA, 17-19 November, 2005 (pp.75-80)

Page 5: Implementation of Clustering-Based Image Retrieval System...1. Initialize the cluster centroids to the seed points. 2. Find the cluster centroid for each sample. 3. Compute the new

4.2 Retrieval Accuracy based on Clustering

Clustering AlgorithmType RGB S1 S2 S3 S4

K-Means 87% 85% 81% 85% 86%

Isodata 80% 90% 91% 92% 91%

K-Harmonic Means 83% 94% 90% 95% 92%

Table 2 Accuracy results produced by RGB and DCT coefficients based on clustering algorithms in five clusters

Table 2 shows the results in five clusters. It is apparent that the K-Harmonic Means clustering algorithm has obtained the most favourable result.

The overall results for K-Means, Isodata and K-Harmonic Means clustering algorithms are shown in Table 3.

Test K-Means Algorithm

Isodata Algorithm

K-Harmonic Means Algorithm

Clustering Retrieval Accuracy Test 84.8% 88.8% 90.8%

Searching Existing Images Test 100% 98% 100%

Existing Images in Various Angles Test 98% 98% 98%

Existing Images in Various Sizes Test 94% 96% 94%

Distorted Existing Image Test 100% 96% 96%

Average Accuracy in percentage 94.2% 95.2% 95.7%

Table 3 Overall results in terms of clustering algorithms obtained by five different tests As shown in Table 3, the K-Harmonic Means algorithm performed better than the other two clustering algorithms. Since the K-Harmonic Means algorithm can perform clustering faster and facilitate faster searching it is a good choice to use for clustering images. 5 Conclusion In this paper, a new content based image retrieval system based on DCT coefficient energy histogram and clustering algorithm was proposed. In this paper, we have compared three different clustering algorithms: Isodata, K-Means, and K-Harmonic Means for image retrieval. For a database containing 3000 images, a retrieval accuracy higher than 90% can be achieved within a second by using a PC machine. From our results, it is evident that the K-Harmonic Means clustering algorithm is the most suitable clustering algorithm based on performance

in the algorithms test using JPEG compressed images in CBIRS. It produces the best accuracy and shortest clustering time when compared with other alternative methods. References: [1] W.Niblack, et.al.,”The QBIC Project: querying

images by content using color, texture and shape,” In Storage and Retrieval for Image and Video Databases, Vol. 1908, SPIE Proceedings, Feb, 1993.

[2] J.Smith and S.F. Chang, ”A Fully Automated Content-based Image Query System”, Proc. of ACM Intl. Multimedia Conference, Nov. 1996.

[3] H.Zhang, Y.Gong, C.Y.Low and S.W. Smoliar, “Image retrieval based on color features: an evaluation study”, Proc. of SPIE vol 2606, pp.212-220, 1995.

[4] Gregory K. Wallace, “The JPEG Still Picture Compression Standard.

[5] N.Beckman, et. Al., “The R*tree: An efficient and robust access method for points and rectangles”, In Proc. Of ACM SIGMOD, Atlantic City, May 1990.

[6] J.T. Robinson, “The K-D-B tree: a search structure for large multidimensional dynamic indexes”, In Proc of ACM SIGMOD, Ann Arbor, April 1981.

[7] M.J. Swain and D.H. Ballard, “Color Indexing”, International Journal of Computer Vision, Vol 7, No. 1, 1991, pp 11-32.

[8] E.Gose, R.Johnsonbaugh, S.Jost, ”Pattern Recognition and Image Analysis” Prentice Hall, 1996.

[9] B.Zhang, et. A., “K-Harminic Means - A data Clustering Algorithm. http://www.hpl.hp.com/techreports/1999/HPL-1999-124.html

4th WSEAS International Conference on ELECTRONICS, CONTROL and SIGNAL PROCESSING, Miami, Florida, USA, 17-19 November, 2005 (pp.75-80)

Page 6: Implementation of Clustering-Based Image Retrieval System...1. Initialize the cluster centroids to the seed points. 2. Find the cluster centroid for each sample. 3. Compute the new

Existing Image in database

Horizontally Flipped Vertically Flipped 90 Degrees Clockwise

270 Degrees Clockwise

Wtrfall.raw Wtrfall_HF.raw Wtrfall_VF.raw Wtrfall_C90.raw Wtrfall_C270.raw

Watch.raw Watch_HF.raw Watch_VF.raw Watch_C90.raw Watch_C270.raw

Bridge.raw Bridge_HF.raw Bridge_VF.raw Bridge_C90.raw Bridge_C270.raw

Jody02.raw Jody02_HF.raw Jody02_VF.raw Jody02_C90.raw Jody02_C270.raw

Figure 2. Example of Query Images in Various Angles

4th WSEAS International Conference on ELECTRONICS, CONTROL and SIGNAL PROCESSING, Miami, Florida, USA, 17-19 November, 2005 (pp.75-80)