[IEEE 2013 IEEE Second International Conference on Image Information Processing (ICIIP) - Shimla, India (2013.12.9-2013.12.11)] 2013 IEEE Second International Conference on Image Information

Automatic Image Annotation Using Synthesis of

Complementary Features

Sreekumar K

Department of Computer Science

College of Engineering Poonjar

Kottayam, Kerala, India

[email protected]

Anjusha B Department of Computer Science College of Engineering Poonjar


[email protected]

Rahul R Nair Department of Computer Science College of Engineering Poonjar


[email protected]

Abstract—Image annotation is the process of assigning

meaningful keywords to an image. In automatic image

annotation, this process is executed automatically by checking

the semantics of the image. The semantics contained in an

image is interpreted by extracting some high level and low level

image features. This work implements a system that

automatically annotates colour images using a special feature

extraction mechanism, which can be very effectively used for

image sequence recognition or classification. This special

feature is driven out by combining, three features under

research, namely, Histogram of Oriented Gradients (HOG),

Speeded Up Robust Features (SURF) and a color feature based

on HSV Colour structure. Thus, we formed a synthesized

feature descriptor which essentially describes three aspects

visual perception, the colour, shape, and points of interest. The

proposed system follows a hybrid approach which first trains a

specific set of data and annotation is performed using fuzzy K-

NN classification. In our experiment, it has been observed that

the system has good accuracy and high potential in textual description of digital photographic images.

Keywords— HOG; SURF; HSV colour feature; fuzzy K-NN; image annotation; feature extraction.

I. INTRODUCTION

Nowadays, there is an abundance of digital image capturing devices. In fact, there is a growing availability, which has shown a great increase in the usage of digital images in the recent years. Though there is a widespread research in this field, accurate retrieval of images from large photo collection remains a perplexing mission.

This problem is basically due to the difficulty in mapping of high-level meaningful content of the image as perceived by „Homo sapiens‟. Automatic Image Annotation (AIA) is simply a process of assigning meaningful textual description to an image based on the content inscribed in it. The main task in AIA is to decrease the semantic gap between high level visual concepts and low level visual features. Social networking websites and image databases are in great need of systems such as AIA and its applications.

Earlier, the technique was to manually assign suitable keywords to images in the form of metadata and retrieve them efficiently according to the need. It now becomes a tedious task and needs huge amount of human effort, time and patience; as digital images are increasing day by day.

In earlier days, around1990s, Content Based Image Retrieval (CBIR) system were proposed to organize and search images to eliminate the difficulties in traditional image annotation methods by matching the low level features of images [1]. Queries from the user were used to retrieve images in the traditional CBIR systems.

In this paper, a novel method for automatic image annotation is proposed, which assigns closest keywords with respect to semantic concepts to digital image collections based on the information contained in them.

In this method, image representation scheme is realized using three descriptors, namely, HSV Colour Histogram (10 dimensional vector), Histogram of Oriented Gradients (81dimensional vector) and Speeded-Up Robust Features (64 dimensional vector).

The rest of the paper is structured as follows. In section 2, a literature review is presented. The proposed system and the description of features used are included in Section 3. Experimental results and performance analysis are part of Section 4. Finally, paper is concluded in Section 5.

II. RELATED WORK

Many techniques and approaches were proposed for automatically annotating images. A co-occurrence model proposed by Mori et al., [6], assigned co-occurrence tags to each image. However, it requires a large number of images in the training set to get a higher probability. This model also assigns repeated words to all probable images.

Annotation process can be classified into the segmental approach and holistic approach. This classification is done depending on the portion used for extraction. Segmental approach segments the whole image. Feature extraction is done for each such segment[7], and are considered for annotation.

Proceedings of the 2013 IEEE Second International Conference on Image Information Processing (ICIIP-2013)

978-1-4673-6101-9/13/$31.00 ©2013 IEEE 294

In Holistic approach the whole image is used for feature extraction. No segmenting is done and thus discovers a relation between the image and the textual descriptors [8].

Another classification for annotation is depending on feature extraction. General categories in this classification are Color features, Texture features and scale and rotation invariant features.

Colour feature is one of the extensively used features in annotation procedure. Method of using colour feature for image annotation is explained in [8]. Texture feature is next important feature used in image annotation procedure. Texture is the presence of a 3-D pattern that has some peculiar percent of sameness.

In recent years the most widely used feature in image annotation process is Scale and rotation invariant features. The most common among these are:

Scale Invariant Feature Transform(SIFT) [2]

Speeded Up Robust Feature (SURF) [3]

Object detection is an issue related to feature sets, in our study we found that normalized Histogram of Oriented Gradient (HOG) [3] descriptors provide comparatively high performance for such issues, relative to other existing feature sets such as wavelets. Power with simplicity makes HOG descriptors more usable.

HOG descriptors are extensively used for object recognition in image processing and computer vision problems. HOG is similar to of Edge Orientation Histograms (EOH) and Scale-Invariant Feature Transform (SIFT), and shape Contexts. Difference is that it is computed on a dense grid of uniformly spaced cells and uses overlapping local contrast normalization for improved accuracy.

HOG thus identifies edge or gradient shape, which contributes mainly to the local shape. It has a controllable degree of invariance to rotation and translation. In images including human HOG performs the best because it allows limbs and body segments to change their look provided that they should be upright images. It also performs well for other procedures such as rough 3-D sampling, fine orientation sampling and strong local photometric normalization.

SURF which is an acronym for Speeded-Up Robust Features is also a famous feature detection algorithm. It has a novel scale and rotation invariant detector and descriptor. Computation and comparison of SURF is much faster still it was able to approximate or even outperform previously proposed schemes with respect to repeatability, distinctiveness, and robustness

For color feature extraction, there are many methods existing with different complexities like, color moments [9], Scalable Color Descriptor (SCD) [10], Color Structure Descriptor (CSD) [10], Color Layout Descriptor (CLD) [10], Dominant Color Descriptor (DCD) [10], Color Coherent Vectors (CCV) [11], Color Correlograms [11]. But we here used a relatively easy to compute feature based

on HSV color histogram, which is described in the proposed system.

For human detection, rough3-D sampling, fine orientation sampling and strong local photometric normalization, HOG turns out to be the best strategy. It is probably because it permits limbs and body segments to change look and move from side to side, quite a lot provided that they maintain a roughly upright orientation.

For color feature extraction, there are many methods existing with different complexities like, color moments [8], Scalable Color Descriptor (SCD) [10], Color Structure Descriptor (CSD) [10], Color Layout Descriptor (CLD) [10], Dominant Color Descriptor (DCD) [10], Color Coherent Vectors (CCV) [11], Color Correlograms [11]. But we here used a relatively easy to compute feature based on HSV colour histogram, which is described in the proposed system.

We studied with each of these descriptors and experimentally found out the results. When we combined HOG and SURF features, we got better results. We then, added HSV Colour descriptor for Colour features to this combination and, finally reached the present stage of this approach.

III. PROPOSED SYSTEM

The proposed system works in two phases, training and annotation. Training phase proceeds in two steps, first SURF[11], HOG[18] and color features [13]are extracted from the training image set, secondly average of these descriptors are calculated and used a representative model for each class. Annotation phase proceeds in three steps, first, the features are extracted and synthesized as did in the training phase, for the entire test image set. In the next step, they are classified using the supervised Fuzzy K-NN classification algorithm and finally these classes are annotated based on the model created in training phase. System Architecture of the proposed system is shown in Fig.1

A. Feature Extraction

The accuracy of any annotation system depends mainly on the feature extraction stage. The proposed system is based on the extraction of three features, The HSV color feature, HOG and SURF.

i) SURF

From the study of previous works in the area it is seen than some features work well for some classes and some for the other. SURF is one which is invariant of the affine transformations like scaling and rotation. It is also fast to compute when compared with similar methods like SIFT (Scale Invariant Feature Transform). The extraction of SURF feature is based on the concept of integral image [16], which is calculated using equation 1.

(1)

Hessian matrix [15] is computed using equation 2. The high intensity points are then identified using the principle


295

that, when the determinant of hessian matrix is maximum,, those locations will indicate the high intensity points.

Convolution of Gaussian second order derivative

g( with the image in point X gives For

making the feature scale invariant, the image is first scaled with all possible scaling factors and in each scaled image, the high intensity points are calculated separately. Then we get a number interest points for an image. From these interest points, a strongest point selection procedure is applied in which those below a threshold are discarded. Next step is to describe these points. A 64 dimensional feature vector is extracted for each of the detected key points. Next we need to convert the feature matrix to a feature vector. Since we need a rare interest points in an image to identify it we use Rayleigh estimation [15] for this.

ii) HOG

SURF shows considerably poor result for human and other such classes whereas HOG shows a good performance. So we fused surf descriptor with HOG for a better result in all classes. Histogram of Oriented Gradients

(HOG) descriptor actually gives the frequencies of gradient orientations in localized portions of an image. This method is similar to edge oriented histogram and SIFT descriptors, but the difference is that, HOG is computed over a dense grid of cells, which are uniformly spaced, and for improving accuracy, uses an overlapping local contrast normalization procedure. In the first step, gradient values are computed in one or both the horizontal and vertical directions by simply applying the 1-D centered, point discrete derivative mask along the required directions.

Specifically, this method requires the design of a filter for the color or intensity data with the following kernels:

In the second step, cell histograms are created and each pixel within the cell casts a weighted vote for an orientation-based histogram channel based on the values found in the gradient computation. These cells are then grouped together into a larger spatially connected blocks. This grouping is required to locally normalize the gradient strengths, in order to account for changes in illumination and contrast. The HOG feature vector is then extracted as the vector obtained from components of the normalized cell histograms from all of the block regions. These blocks may overlap, such that, each cell contributes to more than once to the final descriptor. The structure of descriptor is 3x3 cell blocks of

Fig.1 System Architecture for the proposed system


296

6x6 pixel cells with 9 histogram channels. Thus we get a descriptor of size 81.

iii) HSV Colour Feature

Color is an essential feature in many cases. So in this system we also fuse color descriptor with SURF and HOG. We used the HSV [10] color space which considers hue, saturation and value parameters of an image. To reduce the computational difficulty we quantify HSV color space to non-equal intervals. According to the different perception to colors, quantified hue (H), saturation (S) and value (V) parameters were calculated using equation 3.

(3)

By the quantization level above, a one-dimensional feature vector G is formed from different values of H, S, and V with different weight.

G= (4)

Where is quantified series of S and is quantified series of V. Here both of these values are equal to 3.

So, here,

Thus, a ten-dimensional color vector is extracted for the image. The resultant vector was subjected to normalization to bring them into same range to avoid the deviations in matching. Thus, a 155-dimensional feature vector is obtained for each image in the training set.

B. Training Phase

images from each class will constitute the training data set. 155-dimensional fused feature vector is calculated for all the k images in each class. Average value of descriptor for each class is calculated and is modeled as the training set. This model is then used in the annotation phase for classification.

C. Annotation Phase

Feature extraction in annotation phase is done in the same way as did in the training phase. A feature vector having 155 dimensions is extracted for each image and a descriptor matrix was created by putting together descriptors

of each image in the target dataset. Each column represents 155 dimensional descriptor of each image. Training matrix modeled in the training phase contains 155-dimentional column vectors representing each class. The classical K-NN classification algorithm was modified using fuzzy concept to make it fuzzy K-NN classification. It assigns a fuzzy class memberships value to each and every input pattern. Euclidian distance was used to measure the degree of similarity from unknown vector to all the vectors in training set, which is given by the equation 6.

Fuzzy K-NN predicts the class membership value of test images using equation 7.

(7)

Thus now we have a test image data set whose class numbers are predicted using fuzzy K-NN classification.

It is proposed to use the classification result of fuzzy K-NN to annotate the test images. Class numbers and corresponding class names are stored in the annotation database. Depending on the class number class name is retrieved and assigned for each test image.

IV. EXPERIMENTAL RESULTS

The proposed system was implemented and experiments were conducted with standard Corel 1000 dataset [19]. The Corel 1000 dataset contains 1000 images spread across 10 classes. The dataset comprises of photos of sceneries and objects. 300 images, 30 from each class were used in training phase and annotation was performed on the entire dataset of 1000 images. Euclidian distance was used for distance measurement in classification phase. The proposed system was evaluated for performance using statistical approaches by computing parameters like precision, recall, F-score and accuracy. The performance matrix is given by Table I. Table II shows the overall effectiveness of the system using confusion matrix.

TABLE I

PERFORMANCE MATRIX

CLASS

NAME TP FP FN TN

Pre

cisio

n

Rec

all

F-

Sco

re

Acc

ura

cy

African 97 0 3 900 1.00 0.97 0.98 0.99

Beach 69 45 31 855 0.60 0.69 0.64 0.92

Building 64 19 36 881 0.77 0.64 0.70 0.95

Bus 94 11 6 889 0.90 0.94 0.92 0.98

Dinosaur 88 7 12 893 0.93 0.88 0.90 0.98

Elephant 66 42 34 858 0.61 0.66 0.63 0.92

Flower 87 8 13 892 0.92 0.87 0.89 0.98

Horse 90 20 10 880 0.82 0.90 0.86 0.97

Meal 78 21 22 879 0.79 0.78 0.78 0.96

Mountain 69 20 31 880 0.78 0.69 0.73 0.95


297

V. CONCLUSION

Here, we investigated the automatic image annotation problem using a synthesized descriptor of SURF, HOG and Color. These well known algorithms show good result for specific classes, so when they are fused together into a single descriptor it is found that it worked well for a wide range of classes. Thus, it is observed that, when the experimented features were synthesized, they have behaved as if they were complementary in visual content description. From the performance analysis, we were able to get an overall accuracy of 96%. Moreover, the precision and recall values were found to be promising for real life applications.

There is lot of scope for improving the proposed system by changing classification methods and distance algorithms. Support Vector Machine (SVM) may be good alternative to k-nn, and some unsupervised methods may also be tried. More descriptors can also be selected for the synthesis so that the annotation becomes more robust on scene types. Also, a weighted feature fusion and confidence based majority voting scheme may be implemented in annotation phase to improve the accuracy.

REFERENCES

[1] V.N.Gudivada and J.V. Raghvan, “Special issues on content based

image retrieval system”, IEEE Com. Magzine, 1985.

[2] David G. Lowe, "Object recognition from local scale-invariant features," Int. Conf. on Computer Vision, Corfu, Greece (September

1999), pp. 1150-1157

[3] Dalal, N. and Triggs, B. , “Histograms of orientedgradients for human detection”. Int. Conf. on Computer Vision & Pattern Recognition,

2005, volume 2, pages 886–893.

[4] Herbert Bay et al., “Speeded-Up Robust Features (SURF)”,

Computer Vision and Image Understanding (2008) 346–359; September 2007

[5] Muhammed Anees V et al., “Automatic Image Annotation using

SURF descriptors”, India Conf. (INDICON), Annu. IEEE Conf., Kochi, India, 2012, pp. 920-924

[6] Ch.Kavitha et al., "Image retrieval based on color and texture features

of the image sub-blocks", International Journal of Computer Applications (0975 – 8887), Volume 15– No.7, February 2011

[7] E. Akbas, ”Automatic image annotation by ensemble of visual

descriptors”, Masters thesis, Middle East Technical University, Ankara, Turkey, 2006

[8] Hui Yu et al., “Color Texture Moments for Content-Based Image

Retrieval”, Proc. of Int. Conf. on Image Processing , June 2002, Vol3., pages 929-932.

[9] B. S. Manjunath et al.,"Color and Texture Descriptors", IEEE

Transactions on Circuits and Systems for video technology, Vol. 11, No. 6, June 2001

[10] Saman H. Cooray and Noel E. O„Connor, "Content-Based Image

Descriptors for Enhanced Person Annotation in Personal Digital Photo Archives", IEEE International Conference on Signal and

Image Processing Applications, Kuala Lumpur, Malaysia, 2009, pp.18-19

[11] A. Smeulders et al., “Contentbased image retrieval at the end of the early years”, IEEE Trans. Pattern Anal. Mach. Intell.,

22(12):13491380, 2000.

[12] Cao LiHua et al., “Research andImplementation of an Image Retrieval Algorithm Basedon Multiple Dominant Colors”,Journal of

Computer Research & Development, Vol 36, No. 1, pp.96100,1999.

[13] Lowe, David G., “Object recognition from local scale-invariant features”, Proceedings of the International Conference on Computer

Vision.2. pp. 11501157,1999.

[14] Paul Viola and Michael Jones, “Rapid object detection using a boosted cascade of simple features”, CVPR , pages 511-518, 2001.

[15] Olson, David L and Delen, Dursun, ”Advanced Data Mining Tech-

niques”, Springer, 1st edition , page 138,2008

[16] M. Idrissa and M. Acheroy, “Texture classication using gabor filters”,Pattern Recognition Lett, vol. 23, pp. 10951102,2002.

[17] James Z. Wang et al., “SIMPLIcity: Semantics-sensitive Integrated Matching for Picture LIbraries”, IEEE Trans. On Pattern Analysis

and Machine Intelligence, vol 23, no.9, pp. 947-963, 2001

[18] Ninomiya, H et al. “An evaluation on robustness and brittleness of HOG features of human detection”, pages 1-5, Frontiers of Computer

Vision (FCV), 2011 17th Korea-Japan Joint Workshop on , Feb. 2011.

TABLE II: CONFUSION MATRIX FOR THE EXPERIMENTAL STUDY

A

C

T

U

A

L

C

L

A

S

S

PREDICTED CLASS

African Beach Building Bus Dinosaur Elephant Flower Horse Meal Mounta

in

African 97 1 0 1 1 0 0 0 0 0

Beach 0 69 7 2 1 5 1 1 4 10

Building 0 11 64 6 0 11 1 0 5 2

Bus 0 0 4 94 0 2 0 0 0 0

Dinosaur 0 5 0 0 88 6 0 0 0 1

Elephant 0 4 4 0 4 66 0 9 7 6

Flower 0 0 1 0 0 3 87 6 3 0

Horse 0 3 1 2 0 3 0 90 0 1

Meal 0 6 0 0 0 7 4 4 78 1

Mountain 0 15 3 0 1 5 3 0 4 69


298

Documents

[IEEE 2013 IEEE Second International Conference on Image Information Processing (ICIIP) - Shimla, India (2013.12.9-2013.12.11)] 2013 IEEE Second International Conference on Image Information