10
Visual content representation using semantically similar visual words Kraisak Kesorn a,, Sutasinee Chimlek b , Stefan Poslad c , Punpiti Piamsa-nga b a Department of Computer Science and Information Technology, Naresuan University, Phitsanulok 65000, Thailand b Department of Computer Engineering, Kasetsart University, 50 Phahon Yothin Rd., Chatuchak, Bangkok 10900, Thailand c School of Electronic Engineering and Computer Science, Queen Mary, University of London, Mile End Rd., London E1 4NS, United Kingdom article info Keywords: Bag-of-visual words SIFT descriptor Visual content representation Semantic visual word abstract Local feature analysis of visual content, namely using Scale Invariant Feature Transform (SIFT) descrip- tors, have been deployed in the ‘bag-of-visual words’ model (BVW) as an effective method to represent visual content information and to enhance its classification and retrieval. The key contributions of this paper are first, a novel approach for visual words construction which takes physically spatial information, angle, and scale of keypoints into account in order to preserve semantic information of objects in visual content and to enhance the traditional bag-of-visual words, is presented. Second, a method to identify and eliminate similar key points, to form semantic visual words of high quality and to strengthen the dis- crimination power for visual content classification, is given. Third, an approach to discover a set of semantically similar visual words and to form visual phrases representing visual content more distinc- tively and leading to narrowing the semantic gap is specified. Ó 2011 Elsevier Ltd. All rights reserved. 1. Introduction Research in multimedia retrieval has been actively conducted since many years. Several main schemes are used by visual con- tent (e.g. image and video) retrieval systems to retrieve visual data from collections such as content-based image retrieval (CBIR) (Bach et al., 1996; Rui, Huang, & Chang, 1999; Smeulders, Worring, Santini, Gupta, & Jain, 2000; Smith & Chang, 1996), automatic classification of objects and scenes (Forsyth & Fleck, 1997; Naphade & Smith, 2003; Tseng, Lin, Naphade, Natsev, & Smith, 2003), and image and region labeling (Barnard et al., 2003; Duygulu, Barnard, Freitas, & Forsyth, 2002; Hironobu, Takahashi, & Oka, 1999). The research on CBIR started using low-level features, such as color, texture, shape, structure, space relationship to represent the visual content. Typically, the re- search on CBIR is based on two types of visual features: global and local features. Global feature-based algorithms aim at recog- nizing objects in visual content as a whole. First, global features (i.e. color, texture, shape) are extracted and then statistic feature classification techniques (i.e. Naïve Bayes, SVM, etc.) are applied. Global feature-based algorithms are simple and fast. However, there are limitations in the reliability of object recognition under changes in image scaling viewpoints, illuminations, and rotation. Thus, local features are also being used. Several advantages of using local rather than global features for object recognition and visual content categorization have been ad- dressed by Lee (2005). Local feature-based algorithms focus mainly on keypoints. Keypoints are salient patches that contain rich local information about visual content. Moravec (1977) defined the con- cept of ‘‘point of interest’’ as distinct regions in images that can be used to match other regions in consecutive image frames. The use of the Harris corner detector (Harris & Stephens, 1988) to identify interest points and to create a local image descriptor at each inter- est point from a rotationally invariant descriptor in order to handle arbitrary orientation variations has been proposed in Schmid and Mohr (1997). Although this method is rotation invariant, the Harris corner detector is sensitive to changes in image scale (Alhwarin, Wang, Ristic-Durrant, & Graser, 2008) and, therefore, it does not provide a good basis for matching images of different sizes. Lowe (1999) has overcome such problems by detecting the key locations over the image and its scales through the use of local extrema in a Difference-of-Gaussians (DoG). Lowe’s descriptor is called the Scale Invariant Feature Transform (SIFT). SIFT is an algorithm for visual feature extraction invariant to image scaling, translation, rotation, and partially invariant to illumination changes and affine projection. Further improvements for object recognition technique based on SIFT descriptors have been presented recently by. Ke & Sukthankar (2004) who have improved the SIFT technique by applying Principal Components Analysis (PCA) to make local descriptors more distinctive, more robust to image deformations, and more compact than the standard SIFT representation. Consequently, this technique increases image retrieval accuracy and matching speed. Recently, keypoints represented by the SIFT descriptors also have been used in a special technique namely ‘‘bag-of-visual words’’ (BVW). The BVW visual content representation has drawn much attention by computer vision communities, as it tends to code 0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.03.021 Corresponding author. Tel.: +66 81555 7499. E-mail addresses: [email protected], [email protected] (K. Kesorn). Expert Systems with Applications 38 (2011) 11472–11481 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Visual content representation using semantically similar visual words

Embed Size (px)

Citation preview

Page 1: Visual content representation using semantically similar visual words

Expert Systems with Applications 38 (2011) 11472–11481

Contents lists available at ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

Visual content representation using semantically similar visual words

Kraisak Kesorn a,⇑, Sutasinee Chimlek b, Stefan Poslad c, Punpiti Piamsa-nga b

a Department of Computer Science and Information Technology, Naresuan University, Phitsanulok 65000, Thailandb Department of Computer Engineering, Kasetsart University, 50 Phahon Yothin Rd., Chatuchak, Bangkok 10900, Thailandc School of Electronic Engineering and Computer Science, Queen Mary, University of London, Mile End Rd., London E1 4NS, United Kingdom

a r t i c l e i n f o a b s t r a c t

Keywords:Bag-of-visual wordsSIFT descriptorVisual content representationSemantic visual word

0957-4174/$ - see front matter � 2011 Elsevier Ltd. Adoi:10.1016/j.eswa.2011.03.021

⇑ Corresponding author. Tel.: +66 81555 7499.E-mail addresses: [email protected], kraisak@msn

Local feature analysis of visual content, namely using Scale Invariant Feature Transform (SIFT) descrip-tors, have been deployed in the ‘bag-of-visual words’ model (BVW) as an effective method to representvisual content information and to enhance its classification and retrieval. The key contributions of thispaper are first, a novel approach for visual words construction which takes physically spatial information,angle, and scale of keypoints into account in order to preserve semantic information of objects in visualcontent and to enhance the traditional bag-of-visual words, is presented. Second, a method to identifyand eliminate similar key points, to form semantic visual words of high quality and to strengthen the dis-crimination power for visual content classification, is given. Third, an approach to discover a set ofsemantically similar visual words and to form visual phrases representing visual content more distinc-tively and leading to narrowing the semantic gap is specified.

� 2011 Elsevier Ltd. All rights reserved.

1. Introduction

Research in multimedia retrieval has been actively conductedsince many years. Several main schemes are used by visual con-tent (e.g. image and video) retrieval systems to retrieve visualdata from collections such as content-based image retrieval(CBIR) (Bach et al., 1996; Rui, Huang, & Chang, 1999; Smeulders,Worring, Santini, Gupta, & Jain, 2000; Smith & Chang, 1996),automatic classification of objects and scenes (Forsyth & Fleck,1997; Naphade & Smith, 2003; Tseng, Lin, Naphade, Natsev, &Smith, 2003), and image and region labeling (Barnard et al.,2003; Duygulu, Barnard, Freitas, & Forsyth, 2002; Hironobu,Takahashi, & Oka, 1999). The research on CBIR started usinglow-level features, such as color, texture, shape, structure, spacerelationship to represent the visual content. Typically, the re-search on CBIR is based on two types of visual features: globaland local features. Global feature-based algorithms aim at recog-nizing objects in visual content as a whole. First, global features(i.e. color, texture, shape) are extracted and then statistic featureclassification techniques (i.e. Naïve Bayes, SVM, etc.) are applied.Global feature-based algorithms are simple and fast. However,there are limitations in the reliability of object recognition underchanges in image scaling viewpoints, illuminations, and rotation.Thus, local features are also being used.

Several advantages of using local rather than global features forobject recognition and visual content categorization have been ad-

ll rights reserved.

.com (K. Kesorn).

dressed by Lee (2005). Local feature-based algorithms focus mainlyon keypoints. Keypoints are salient patches that contain rich localinformation about visual content. Moravec (1977) defined the con-cept of ‘‘point of interest’’ as distinct regions in images that can beused to match other regions in consecutive image frames. The useof the Harris corner detector (Harris & Stephens, 1988) to identifyinterest points and to create a local image descriptor at each inter-est point from a rotationally invariant descriptor in order to handlearbitrary orientation variations has been proposed in Schmid andMohr (1997). Although this method is rotation invariant, the Harriscorner detector is sensitive to changes in image scale (Alhwarin,Wang, Ristic-Durrant, & Graser, 2008) and, therefore, it does notprovide a good basis for matching images of different sizes. Lowe(1999) has overcome such problems by detecting the key locationsover the image and its scales through the use of local extrema in aDifference-of-Gaussians (DoG). Lowe’s descriptor is called the ScaleInvariant Feature Transform (SIFT). SIFT is an algorithm for visualfeature extraction invariant to image scaling, translation, rotation,and partially invariant to illumination changes and affine projection.Further improvements for object recognition technique based onSIFT descriptors have been presented recently by. Ke & Sukthankar(2004) who have improved the SIFT technique by applyingPrincipal Components Analysis (PCA) to make local descriptorsmore distinctive, more robust to image deformations, and morecompact than the standard SIFT representation. Consequently, thistechnique increases image retrieval accuracy and matching speed.Recently, keypoints represented by the SIFT descriptors also havebeen used in a special technique namely ‘‘bag-of-visual words’’(BVW). The BVW visual content representation has drawn muchattention by computer vision communities, as it tends to code

Page 2: Visual content representation using semantically similar visual words

K. Kesorn et al. / Expert Systems with Applications 38 (2011) 11472–11481 11473

the local visual characteristics towards the object level (Zheng,Neo, Chua, & Tian, 2008). The main advantages of the BVW tech-nique are its simplicity and its invariance to transformations aswell as occlusion, and lighting (Csurka, Dance, Fan, Willamowski,& Bray, 2004). There are hundreds of publications about visual con-tent representation using the BVW model as it is a promisingmethod for visual content classification (Tirilly, Claveau, & Gros,2008), annotation (Wu, Hoi, & Yu, 2009), and retrieval (Zhenget al., 2008).

The BVW technique is motivated by an analogy with the ‘bag-of-words’ representation for text categorization. This leads to somecritical problems, e.g. the lack of semantic information during vi-sual words construction, the ambiguity of visual words, and iscomputationally expensive. Therefore, this paper proposes aframework to generate a new representation model which pre-serves semantic information throughout the BVW constructionprocess that can resolve three difficulties; (i) loss of semantics dur-ing visual word generation; (ii) similar keypoints and non-informa-tive visual words discovery; and (iii) semantically similar visualwords identification. In the remaining sections of this paper, wepropose and analyze our solution.

2. Key contributions

A framework for semantic content-based visual content retrie-val is proposed. It reduces similar keypoints. It preserves seman-tic relations between keypoints and objects in the visual content.This generates visual words for visual content representationthrough reducing the dimensions of keypoints in feature spaceand enhancing the clustering results. It also reduces the less com-putation cost and memory needed. Visual heterogeneity is a crit-ical challenge for content-based retrieval. This paper proposes anapproach to find semantically (associatively) similar visual wordsusing a similarity matrix based on the semantic local adaptiveclustering (SLAC) algorithm to resolve this problem. In otherwords, semantically related visual content can be recognized eventhough they have different visual appearances using a set ofsemantically similar visual words. Experimental results illustratethe effectiveness of the proposed technique which can capturethe semantics of visual content efficiently, resulting in higherclassification and retrieval accuracy.

The rest of this paper is organized as follows. In Section 3, theproposed technique is described. In Section 4, experimental resultsare discussed. Section 5 concludes the paper by presenting an anal-ysis of the strengths and weaknesses of our method and describesfurther work.

3. Related work

Hundreds of papers have been published, about visual contentrepresentation using local features, over the last two decades.This survey focuses on visual content representation based uponScale Invariant Feature Transform (SIFT) descriptors and the bag-of-visual words model because these methods are the majortechniques and are the basis of the method in this paper. Thesurvey focuses on three major limitations of visual content clas-sification and retrieval: loss of semantics during visual wordgeneration; similar keypoints and non-informative visual wordsreduction; semantically similar visual words discovery.

3.1. Loss of semantics during visual word generation

The main disadvantage of existing methods (Jiang & Ngo,2009; Yuan, Wu, & Yang, 2007a; Zheng et al., 2008) is the lackof spatial information, i.e., the physical location, between key-

points during visual words construction when using only a sim-ple k-mean clustering algorithm. To tackle this issue, Wu et al.(2009) tried to preserve the semantic information of visual con-tent during visual word generation by manual separating ob-jects in the visual content during a training phase. Therefore,all detected keypoints that are relevant, are put into the samevisual word for each object category, so, that the linkage be-tween the visual words and high level semantic of object cate-gory can be obtained. One possible way to preserve semanticinformation is to generate visual words based upon physicallocation of keypoints in the visual content. Our hypothesis isthat the nearby keypoints in the visual content are more rele-vant and can represent the semantic information of the visualcontent more effectively. Therefore, rather than using a manualobject separation scheme in order to obtain the relevant key-points or using simple k-mean clustering algorithm, a techniqueis needed that clusters relevant keypoints together based ontheir physical locations that can preserve semantic informationbetween keypoints and visual content to improve the qualityof visual words.

3.2. Similar keypoints and non-informative visual words

Keypoints are salient patches that contain rich local informationin an image. They can be automatically detected using variousdetectors, e.g. Harris corner (Harris & Stephens, 1988) and DoG(Lowe, 2004) and can be represented by many descriptors, e.g. SIFTdescriptor. However, some of these keypoints may not be informa-tive for the clustering algorithm as they are redundant and theyeven can degrade of the clustering performance. Therefore, similarkeypoints should be detected in order to reduce noise and toenhance clustering results. To the best of our knowledge, thereduction of similar keypoints has never been addressed before.Instead researchers in this area focus more on elimination of unim-portant visual words. Wu et al. (2009) consider noisy visual wordsusing the range of visual word (maximum distance of a keypoint(feature) to the center of its cluster (visual word). If the keypointis inside the range of any visual word, the keypoint is assigned tothat visual word, otherwise the keypoint is discarded. The weak-ness of this technique is computationally expensive because everykeypoint needs to be compared to the range between keypoint andthe visual word’s center. For example, if there is N keypoints and Mvisual words, the complexity of this algorithm will be O(MN). Thisalso leads to a scalability problem to the large-scale visual contentsystem.

Besides the similar keypoints issue, some of the generated vi-sual words may be uninformative when representing visual con-tent and degrade the categorization capability. Non-informativevisual words are insignificant local image patterns which areuseless for retrieval and classification. These visual words needto be eliminated in order to improve the accuracy of the classi-fication results and to reduce the size of visual word vectorspace model and computation cost. Rather than to identifyunimportant visual words, Yuan et al. (2007a) have attemptedto discover unimportant information for larger units of visualwords, meaningless phrases (word-sets created from a frequencyitemset data mining algorithm), by determining the likelihoodratio (the statistical significance measure) of those visualphrases. The top-k most meaningful word-sets with largest like-lihood ratio will be selected and the rest are considered asmeaningless visual phrases and will be discarded. However, thismethod ignores the coherency (the ordering of visual words) ofcomponent visual words in a visual phrase. Tirilly et al. (2008)deployed probabilistic Latent Semantic Analysis (pLSA) to elimi-nate the noisiest visual words. Every visual word w, whose prob-ability between visual word w and concept z, Pr(w|z), is lower

Page 3: Visual content representation using semantically similar visual words

11474 K. Kesorn et al. / Expert Systems with Applications 38 (2011) 11472–11481

than |w|/|z|, is considered to be an irrelevant visual word sincethey are not informative for any concept. The main disadvantageof this method is that it ignores the correlations between wordand other concepts in the collection. Some words might appearless in one concept but appear more in other concepts and thesewords could be the featured words. Deleting low-probabilitywords decreases the accuracy of categorization.

3.3. Semantically similar visual words

Visual heterogeneity is one of the greatest challenges when cat-egorization and retrieval relies solely on visual appearance. Forexample, different visual appearances might be semantically simi-lar at a higher semantic conceptualization. One of the challengesfor the BVW method is to discover a relevant group of visual wordswhich have semantic similarity. Recently, a number of efforts havebeen reported including, among others, the use of the probabilitydistributions of visual word classes (Zheng et al., 2008) which isbased upon the hypothesis that semantically similar visual contentwill share a similar class probability distribution. Yuan et al.(2007a), Yuan, Wu, and Yang (2007b) overcome this problem byproposing a pattern summarization technique that clusters thecorrelated visual phrases into phrase classes. Any phrases in thesame class are considered as synonym phrases. A hierarchical mod-el is exploited to tackle the semantically similar of visual contentissue by Jiang and Ngo (2009). A soft-weighting scheme is pro-posed to measure the relatedness between visual words and thehierarchical model constructed by the agglomerate clustering algo-rithm and then to capture is-a type relationships for visual words.Although these methods can discover the semantically related vi-sual word sets, there are some remaining issues that need to beovercome. Identifying the semantically related or synonym visualwords based on probability distributions (Zheng et al., 2008) mightbe not always reliable because unrelated ones can accidentallyhave a similar probability distribution. Finding semantically simi-lar visual words based only on the distance between visual wordsin the vector space model, e.g. (Jiang & Ngo, 2009; Yuan et al.,2007a) is not effective if those visual words are generated by a sim-ple clustering algorithm because distances in the feature space donot represent the semantic information of visual words.

To this end, this paper proposes a framework to discoversemantically similar visual words which preserves the semanticinformation throughout the BVW construction process and im-proves the resolution of these three major limitations of visual con-tent classification and retrieval.

Fig. 1. The semantically similar visual wo

4. A semantic-based bag-of-visual words framework

Before going into the details of our proposed system, we willbriefly give an overview of the framework (Fig. 1) as follows:

(1) Feature detection: extracts several local patches which areconsidered as candidates for basic elements, ‘‘visual words’’.Interest point detectors detect the ‘keypoint’ in an image orthe salient image patches using a well-known algorithm. Forexample, the Difference of Gaussian (DoG) detector (Lowe,2004) is used in our method to automatically detect key-points in images. The DoG detector provides a close approx-imation to the scale-normalized Laplacian of Gaussian thatproduces very stable image features compared to a rangeof other possible image functions, such as the gradient, Hes-sian, and Harris corner detectors.

(2) Feature representation: each image is abstracted in termsof several local patches. Feature representation methodsdeal with how to represent the patches as numerical vec-tors. These methods are called feature descriptors. One ofthe most well-known descriptors is SIFT. SIFT converts eachpatch to 128 dimensional vector. After this step, each visualcontent is a collection of vectors of the same dimension(128 for SIFT); the order of different vectors is of noimportance.

(3) Semantic visual words construction: this step converts avector representing patches into ‘‘visual words’’ and pro-duces a ‘‘bag-of-visual words’’ that is represented as a vector(histogram). A visual word can be considered as representa-tive of several similar patches. One simple method performsclustering (i.e. k-means or x-mean clustering algorithm) overall the vectors. Each cluster is considered as a visual wordthat represents a specific local pattern shared by the key-points in that cluster. This representation is analogous tothe bag-of-words document representation in terms of formand semantics because the bag-of-visual-word representa-tion can be converted into a visual-word vector similar tothe term ‘‘vector’’ for a document.

(4) Non-informative visual words identification: some of thegenerated visual words may not be useful to represent visualcontent. Hence, this kind of visual word is need to bedetected and removed in order to reduce the size of visualword feature space and computation cost. This can be doneusing the Chi-square model.

(5) Semantically similar visual words discovery: only a visualword cannot represent the content of an image because it

rd discovery framework architecture.

Page 4: Visual content representation using semantically similar visual words

Similar keypoints discovery

p1

p2

p8p7

p5p6

p4

p3p1

p2

p8p7

p5p6

p4

p3

K1

K2

X-mean clustering

centroid

K1K2

centroid

Finding centroid of new group

Vector space model for visual words

construction

Visual content A Visual content A Visual content A

Fig. 2. Similar keypoints are merged together using a new centroid of cluster as a representative keypoint. Those representative keypoints are used to construct the semanticvisual words.

Table 1The 2⁄p contingency table of f �xgi.

C1 C2 . . . Ck Total

f �xgi-appear n11 n12 . . . n1k n1+

f �xgi-not appear n21 n22 . . . n2k n2+

Total n+1 n+2 . . . n+k N

K. Kesorn et al. / Expert Systems with Applications 38 (2011) 11472–11481 11475

may have multiple word senses. In other words, a visualword is ambiguous. To tackle this issue, the associativelysimilar visual words can help to disambiguate word sensesand represent visual content more distinctively. To findthese associatively visual words, a semantic local adaptiveclustering technique (AlSumait & Domeniconi, 2008) hasbeen proposed to complete this task.

In this paper, the first two steps will not be discussed in detailsince our work makes no contributions to those areas. Insteadwe focus our framework on semantic visual word constructionand semantically similar visual words discovery.

4.1. Semantic visual word construction

The high dimension of keypoints in feature space can lead to thefeature space containing noise and redundancy. As a result, thesekinds of keypoints directly affect the quality of visual words lead-ing to the following serious drawbacks. First, noisy keypoints canconfuse the clustering algorithm resulting in poor visual wordquality. Second, large numbers of keypoints lead to a large size ofthe visual word feature space and a subsequent high computationcost. Therefore, we propose a novel method for visual word gener-ation which aims to eliminate the noisy keypoints and reduce sim-ilar keypoints as well as preserve semantic information betweenthose keypoints in order to generate semantic-preserved visualwords, in short, semantic visual words.

4.1.1. Similar keypoints detectionSimilar keypoints (j) will be identified by considering their

physical location in visual content. Basically, the keypoint gener-ated from the DoG algorithm provides four useful properties, phys-ical coordination (x, y), angle, scale, and SIFT descriptor. Ourhypothesis is that any keypoints located in nearby positions inthe visual content could potentially be similar so we can groupthem together and find a representative keypoint (n) for those sim-ilar keypoints. Therefore, we discover similar keypoints using coor-dination, angle, scale of the keypoint and SIFT descriptor. Similarkeypoints will be grouped together using the x-mean algorithm(Fig. 2). Let j be a set of the similar keypoints (k), ji = {k1,k2, . . . , kn} where n P 1. Similar keypoints will be grouped togetherusing the x-mean algorithm. This serves to connect the low-levelfeature to high level semantic objects and thus the semantic infor-mation is preserved. Having grouped the similar keypoints, we usethe centre value of the group to represent the whole group of thesimilar keypoints and this representative value will be used to gen-erate visual words. In each j, the average value of the SIFT descrip-tor will be used as n to generate the semantic visual words (f �xg).Consequently, the number of keypoints in feature space is reducedand only n are used to generate f �xg using the x-mean algorithm.The main benefit of the x-mean algorithm over the k-mean algo-rithm is its speed; it does not need to specify the cluster numbers(k-value). At this stage, we obtain f �xg which is improved in con-trast to traditional models because noisy and similar keypoints

are reduced with respect to the original visual content using n. Inaddition, semantic information is preserved via the similar key-points which are the building block of n and connected to a highlevel semantic visual content by their physical co-ordination.

4.1.2. Non-informative visual word identification and removalNon-informative visual words are often referred to as local vi-

sual content patterns that are not useful for retrieval and classifica-tion tasks. They are relatively ‘safe’ to remove in the sense thattheir removal does not cause a significant loss of accuracy but sig-nificantly improves the classification accuracy and computationefficiency of the categorization (Yang & Wilbur, 1996). By analogya text-based document, there usually are unimportant words, so-called stop words (a, an, the, before, after, without etc.), containingin a text document. These stop words need to be removed beforefurther processing, e.g. text categorization, to reduce the noiseand computation costs. Likewise, in visual data processingtechnique, there exist unimportant visual words, the so-callednon-informative visual words {w}. The non-informative visualwords are insignificant local visual content patterns that do notcontribute to visual retrieval and classification. These visual wordsneed to be eliminated in order to improve the accuracy of theresults and to reduce the size of visual word feature space andcomputation cost. In this paper, we utilize a statistical model toautomatically discover w and eliminate them to strengthen thediscrimination power. Yang, Jiang, Hauptmann, and Ngo (2007)evaluate several techniques usually used in feature selection formachine learning and text retrieval, e.g. Document frequency,Chi-square statistics, and Mutual information. In contrast to (Yanget al., 2007), in our framework, non-informative visual words areidentified based on the document frequency method and the sta-tistical correlation of visual words.

Definition 1. Non-informative visual words {w}

A visual word v e V, V = {v1, v2, . . . , vn}, n P 1 is uninformative ifit:

1. Usually does not appear in much visual content in the collection(Yang et al., 2007; Yang & Pedersen, 1997). Thus, it has a lowdocument frequency (DF).

2. Has a small statistical association with all the concepts in thecollection (Hao & Hao, 2008).

Hence, w can be extracted from the visual word feature space usingthe Chi-squared model. Having created f �xg in the previous step,f �xgwill be quantized into a Boolean vector space model to express

Page 5: Visual content representation using semantically similar visual words

11476 K. Kesorn et al. / Expert Systems with Applications 38 (2011) 11472–11481

each visual content vector. Assuming that the appearance of thevisual word i (f �xgi) is independent of any concepts C, C e Z,Z = {C1, C2, . . . , Cn} where n P 1, the correlation between f �xgi andits concepts could be expressed in the form of a 2⁄p contingencytable as shown in Table 1.

Definition 2. The Boolean vector space model.The Boolean vector space model B ¼ fVigN

i¼1 contains a collec-tion of N visual words. A binary matrix XN�M represents B, wherexij = 1 denotes the visual content i containing the visual word j inthe vector space and xij = 0 otherwise, where 1 6 i 6 N and1 6 j 6M.

Definition 3. The 2⁄p contingency table.The 2⁄p contingency table T ¼ fnijgk

j¼1;1 � i � 2. A matrix AN�M

represents T, where n1j is the number of visual contents containingvisual word f �xgi for the concept Cj; n2j is the number of visual con-tents which do not contain visual word f �xgi for the concept Cj; n+j

is the total number of visual contents for the concept Cj; ni+ is thenumber of visual contents in the collection containing the visualword f �xgi; N is the total number of visual contents in the trainingset, where

nþj ¼X2

i¼1

nij; niþ ¼Xk

j¼1

nij; ð1Þ

N ¼X2

i¼1

Xk

j¼1

nij ¼X2

i¼1

niþ ¼Xk

j¼1

nþj; ð2Þ

x22�p ¼ d ¼

X2

i¼1

Xk

j¼1

Nnij � niþnþj� �2

Nniþnþj; ð3Þ

To measure the independence of each f �xg from all the con-cepts, the Chi-square statistic (d) is deployed (Eq. (3)). Having cal-culated this independence, the d values are sorted in descendingorder. The d value indicates the degree of correlation relationshipsbetween f �xg and concepts; the smaller the d value, the weaker thecorrelation relationship. However, there exists a problem for termswhich appear in a small number of documents leading them tohave a small d value. These terms sometimes could be the featurewords. In such a case, we need to weight the d value using Eq. (4)(Hao & Hao, 2008) where DFr denotes the document frequency ofthe word r

x2weighted ¼

x22�p

DFr: ð4Þ

This model balances the strength of the dependent relationship be-tween a word, all concepts, and the document frequency of a word.As a result, those f �xg which have d values less than a threshold(chosen experimentally) can be considered to be non-informativevisual words and removed. The remaining f �xg are informativeand useful for the categorization task. Obviously, w identified in thismanner is collection specific which means by changing the trainingcollection, one can obtain a different ordered list.

Nevertheless, each individual visual word could be ambiguouswhen it is used for classifying visual content alone. For example,a visual word might represent different semantic meanings in dif-ferent visual context (polysemy issue) or multiple visual words(different visual appearances) and can refer to the same semanticclass (synonymy issue) (Zheng et al., 2008; Zheng, Zhao, Neo, Chua,& Tian, 2008). These problems are crucial and will be addressed inthe next section.

4.1.3. Associatively similar visual words discoveryOne possible way to disambiguate multiple word senses is to

combine visual words into a larger unit, a so-called ‘visual phrase’(Yuan et al., 2007b; Zheng et al., 2008) or ‘visual sentence’ (Tirillyet al., 2008). However, visual phrases are usually constructed usingthe FIM algorithm which is purely based on frequent word colloca-tion patterns without taking into account the term weighting andspatial information. The latter information is crucial in order to dis-cover the semantic similarity between words. Typically in text doc-uments, there are two ways to form phrases. The first issyntactical: linguistic information is used to form the phrases.The second is statistical, i.e. a vector space model, where co-occur-rence information is used to group together words that co-occurmore than usual. In this paper, the use of a statistical model to findthe similarity of visual words and to construct visual phrasesseems suitable because such visual content does not provide anylinguistic information. The vector space model is the most com-monly used geometric model for the similarity of words with thedegree of semantic similarity computed as a cosine of the angleformed by their vectors. However, there are two different kindsof similarity between words, taxonomic similarity and associativesimilarity. Taxonomic similarity, or categorical similarity, is thesemantic similarity between words at the same level of categories,so-called synonyms. Associative similarity is a similarity betweenwords that are associated with each other by virtue of semanticrelations other than taxonomic ones such as a collocation relationand proximity. In this paper, we mainly focus on discovering theassociatively similar visual words using the semantic local adap-tive clustering algorithm (SLAC) (AlSumait & Domeniconi, 2008)which takes the term weighting and spatial information (distancebetween f �xg) into account.

The elimination of the noise visual words (w) and the co-occur-rence information makes visual content representation more effi-cient to process. This can be added using semantic similaritytechniques but these are still challenging to use in practice. Totackle this challenge of determining semantic similarity, we exploitthe SLAC clustering algorithm to cluster the semantically similarwords into the same visual phrase. Nevertheless, it is difficult todefine the semantic meaning of a visual word, as it is only a setof quantized vectors of sampled regions of visual content (Zhenget al., 2008). Hence, rather than defining the semantics of a visualword in a conceptual manner, we define the semantically similarvisual words using their position in a feature space.

Definition 4. Associatively similar visual words.

Associatively similar visual words ðuÞ are a set of semantic vi-sual words (f �xg) which have a different visual appearance butwhich are associated with each other by virtue of semantic rela-tions other than taxonomic ones such as collocation relations anda proximity. A set of associatively similar visual words is called avisual phrase.

SLAC is a subspace clustering which is an extension of tradi-tional clustering by capturing local feature relevance withincluster. To find u and to cluster this, learning kernel methods,local term weightings and semantic distance are deployed. Akernel represents the similarity between documents and terms.Based on the data mining, semantically similar visual wordsshould be mapped to nearby positions in the feature space. Torepresent the whole corpus of N documents, the document-termmatrix, D, is constructed. D is a N � D matrix whose rows areindexed by the documents (visual contents) and whose columnsare indexed by the terms (visual words). The numerical valuesin D are frequency of term i in document d. The key idea of thetechnique in this section is to use the semantic distance betweenpairs of visual words, through defining a local kernel for eachcluster as follows:

Page 6: Visual content representation using semantically similar visual words

K. Kesorn et al. / Expert Systems with Applications 38 (2011) 11472–11481 11477

Kjðd1;d2Þ ¼ /ðd1ÞSemjSemTj /ðd2ÞT ; ð5Þ

Semj ¼ RjP; ð6Þ

where d is a document in the collection, and u(d) is document vec-tor, Semj is a semantic matrix which provides additional refine-ments to the semantics of the representation. P is the proximitymatrix defining the semantic similarities between the differentterms and Rj is a local term-weighting diagonal matrix (Fig. 3) cor-responding to cluster j, where wij represents the weight of term i forcluster j, for i = 1, . . . , D. One simple way to compute weights to wij

is to use the inverse document frequency (idf) scheme. However,the idf weighting scheme concerns only document frequency with-out taking the distance between terms into account. In other words,the idf weighting scheme does not concern the inter-semantic rela-tionships among terms. In this paper, therefore, a new weightingmeasure based on the local adaptive clustering (LAC) algorithm isutilized to construct matrix R.

LAC gives less weight to data which are loosely correlated andthis has the effect of elongating distances along that dimension(Domeniconi et al., 2007). In contrast, features along which dataare strongly correlated receive a larger weight which has the effectof constricting distances along that dimension. Different from thetraditional clustering algorithm, LAC clustering of concepts is notonly based on points among features, but also involves weighteddistance information. Eq. (7) shows the formula of the LAC termweight calculation

wij ¼exp � 1

jSj jP

x2Sjðcji � xiÞ2=h

� �

PDi¼1 exp � 1

jSj jP

x2Sjðcji � xiÞ2=h

� � ; ð7Þ

where a set Sj of N points x in the D-dimensional Euclidean space, cij

is a center of i component of vector j, and the coefficient h P 0 is aparameter of the procedure which controls the relative differencesbetween feature weights. In other words, h controls how muchthe distribution of weight values will deviate from the uniformdistribution.

P has nonzero off-diagonal entries, Pij > 0, when the term i issemantically related to the term j. To compute P, the GeneralizedVector Space Model (GVSM) (Wong, Ziarko, & Wong, 1985) is de-ployed to capture the correlations of terms by investigating theirco-occurrences across the corpus based on the assumption thattwo terms are semantically related if they frequently co-occur inthe same documents. Since P holds a similarity figure betweenterms in the form of co-occurrence information, it is necessary totransform it to a distance measure before utilizing it. Eq. (8) showsthe transformation formula:

Pdistij ¼ 1� ðPij=maxðPÞÞ; ð8Þ

where Pdistij is a distance information, max(P) is the maximum entry

value in the proximity matrix. Consequently, a semantic dissimilaritymatrix for cluster j is a D � D matrix given by Eq. (9)

Semdissimj ¼ RjP

dist; ð9Þ

which represents semantic dissimilarities between the terms withrespect to the local term weightings. The algorithm of SLAC (Algo-rithm 1) starts with k initial centroids and equal weights. It parti-tions the data points, re-computes the weights and data partitions

Fig. 3. The local term-weighting diagonal matrix (Rj).

accordingly, and then re-computes the new centroids. The algo-rithm iterates until convergence or a maximum number of itera-tions are exceeded. The SLAC uses a semantic distance. A point xis assigned to the cluster j that minimizes the semantic distanceof the point from its centroid. The semantic distance is derived fromthe kernel in Eq. (5) as follows:

Lwðcl; xÞ ¼ ðx� clÞSemdissiml SemdissimT

l ðx� clÞT : ð10ÞHence, every time the algorithm computes Sj, the semantic ma-

trix must be computed by means of the new weights.

Algorithm 1. The associatively similar visual words discoveryalgorithm

Input: N points x e RD, k, and hOutput: semantically similar visual word sets (visual phrases)1. Initialize k centroids c1, c2, . . . , ck;2. Initialize weights: wij ¼ 1

D ; for each centroid cj, j = 1, . . . , kand for each term i = 1, . . . , D;

3. Compute P; then compute Pdits;4. Compute Semdissim for each cluster j (Eq. (9));5. For each centroid cj, and for each point x, set:

Sj = {x|j = arg min lLw(cl, x)},

where Lwðcl; xÞ ¼ ðx� clÞSemdissiml SemdissimT

l ðx� clÞT

6. Compute new weights:for each centroid cj, and for each term i:Compute Eq. (7)7. For each centroid cj:

Recompute Semdissiml matrix using new weights wij;

8. For each point x:Recompute Sj = {x|j = arg minl Lw(cl, x)};

9. Compute new centroids: cj ¼P

xx1SjðxÞP

x1SjðxÞ for each j = 1, . . . , k

where 1S(�) is the indicator function of set S10. Iterate 5–9 until convergence, or maximum number of

iterations is exceeded

SLAC clusters visual words according to the degree of relevanceand thus generates visual phrases which are semantically related.In addition, the randomly generated visual phrases will not occuras in the FIM-based method (Yuan et al., 2007a; Zheng et al.,2008). However, the disadvantage of the SLAC algorithm is compu-tation complexity. The running time of one iteration is O(kND2),where k is the number of clusters, N is the number of visual con-tents, and D is the number of visual words. Since use of the Chi-square model reduces the thousands of visual words, D is reduced.Fig. 4 illustrates the advantage of the SLAC algorithm.

5. Visual content indexing, similarity measure, and retrieval

Semantic visual words ðuÞ representation for visual contentsare indexed using the inverted file scheme due to its simplicity,efficiency, and practical effectiveness (Witten, Moffat, & Bell,1999; Zheng et al., 2008). The cosine similarity (Eq. (11)) is de-ployed to measure the similarity between the queried and thestored visual content in a collection. Let fPigN

i¼1 be the set of all vi-sual contents in the collection. The similarity between the query(q) and the weighted u associated with the visual content (p) inthe collection is measured using the following inner product:

simðp; qÞ ¼ p � qkpkkqk : ð11Þ

The retrieval visual content is ranked by descending order,according to its similarity value with the query image. In orderto evaluate the retrieval performance of u, the two classical

Page 7: Visual content representation using semantically similar visual words

Table 3The computation cost comparison of two clustering algorithms to generate the 6987visual words (including non-informative visual words) for the pole vault event.

Clustering algorithm {c} (s) {n} (s) Ratio

x-Mean 3624 1139 3.2k-Mean 4356 1613 2.7

Fig. 4. The SLAC algorithm classifies visual words A, B, C, D and E into two groups of u, {A, B} and {C, D, E}. In this example, visual words A and B alone cannot distinguishimage (a) and (b) because they share a visual similarity with words C and D: A is similar to C and B is similar to D. However, the combination of the visual word E with {C, D}can effectively distinguish the high jump event (a) and the pole vault event (b).

Table 2A number of keypoints and representative keypoints for four different sportcategories.

Sports Keypoints {c} Representative keypoints {n} Decrease (%)

High jump 105,410 69,244 34Long jump 186,348 126,232 32Pole vault 245,328 168,467 31Swimming 112,389 73,592 35

Average 33

11478 K. Kesorn et al. / Expert Systems with Applications 38 (2011) 11472–11481

measures used to evaluate the performance of information retrie-val systems are precision and recall. Let A denote all relevant docu-ments (visual contents) in the collection. Let B denote the retrieveddocuments which the system returns for the user query.

� Precision is defined as the portion of relevant documents in theretrieved document set.

Precision ¼ jA \ BjjBj : ð12Þ

� Recall is defined as the portion of relevant documents that werereturned by the system and all relevant documents in thecollection.

Recall ¼ jA \ BjjAj : ð13Þ

Using precision-recall pairs, a so-called precision-recall dia-gram, can be drawn that shows the precision values at different re-call levels. In this paper, the retrieval performance is reportedusing the 11-point Interpolated Average Precision graph (Manning,Raghavan, & Schütze, 2008). The interpolated precision Pinterp at acertain recall level is r defined as the highest precision found forany recall level r0 P r:

PinterpðrÞ ¼ maxðr0Þ; r0 � r: ð14Þ

1 Weka data mining tool, See www.cs.waikato.ac.nz/ml/weka/.

6. Experimental results and discussions

We evaluate the performance of our proposed technique intro-duced in Section 2 against a sport image collection. The image col-lection contains 2000 images of four sport genres (high jump, longjump, pole vault, and swimming). We divide the image collectioninto two sets: 800 images (200 from each category) are selectedfor training which have total 259,790 keypoints and the rest areused for testing which are generated 649,475 keypoints by theDoG algorithm and these keypoints are further processed to findthe representative keypoint (n). The vector quantization techniqueis utilized for clustering the visual words based on n using thex-means clustering algorithm which later produces the semanticvisual words (f �xg).

6.1. Evaluation of the noisy keypoint reduction and the representativekeypoints

First, we compare the proportion of number of keypoints (c)and n in order to investigate how much the technique presentedin Section 4.1.1 can reduce the similar keypoints. This comparison

is performed on the original number of c generated from the DoGalgorithm and the number of n.

Second, we evaluate the computation cost for visual words con-struction between two algorithms, k-mean and x-mean algorithmsusing Weka.1 The two algorithms are used to cluster c and n from allcategories. To illustrate this, we present the computation time forthe pole vault data set which has c and n shown in Table 2 respec-tively. We first cluster data with the x-mean algorithm and thenwe specify the k value in k-mean that is equal to the number of clus-ters from the x-mean. Therefore, both algorithms generate equalnumber of visual words. From the results shown in Table 3, the x-mean clustering obtains the lowest computation cost which is 3.2times faster than c whereas the k-mean algorithm obtains a costwhich is 2.7 times faster than c. This shows that the use of n insteadof the original keypoints (c) significantly reduces the computationtime.

However, the reduction of keypoints may degrade the categori-zation performance as well as the retrieval power. We study theseaspects in subsequent sections.

6.2. Evaluation of the non-informative visual word reduction

Before going to compare the classification and retrieval perfor-mance, we would like to illustrate the proportion of visual wordswhich are generated from both c and n and are further processedto terminate the non-informative visual words. The number of

Page 8: Visual content representation using semantically similar visual words

Table 4The proportion of normal to representative semantic visual words using x-meanclustering before and after the non-informative visual word {w} removal.

Keypoints f �xg {w} {e} = { �x} � {w} Proportion(w/f �xg)

Normal keypoints {c} 6978 2622 4356 38%Representative keypoints {n} 4893 971 3922 20%Difference 30% 63% 10% –

f �xg = visual words; {w} = non-informative visual words. {f �xg} � {w} = informativevisual words {e}.

K. Kesorn et al. / Expert Systems with Applications 38 (2011) 11472–11481 11479

keywords shown in Table 4 is from all categories and generatedusing the x-mean clustering algorithm.

From Table 4, the number of f �xg generated from c is much lar-ger, by 30%, compared to the number of f �xg generated from n. Thisindicates that using c without similar keypoint reduction in theclustering algorithm generates a greater number of noisy visualwords, about 38%, and takes a longer time (Table 3) to constructf �xg. In contrast, the use of n to construct the f �xg obtains less com-putational cost and generates a smaller number of w. This illus-trates that the similar keypoint detection enhances the clusteringalgorithm by reducing the clustering numbers leading to generat-ing less numbers of w, about 20%. The use of n can reduce the num-ber of w by 63%.

However, the difference in informative visual words (e) in bothtechniques is only 10%. This demonstrates that representative key-points (n) can produce the semantic visual words (f �xg) which con-tain less number of non-informative data (w) but still obtain highnumbers of informative visual words similar to the method usingnormal keypoints while being computationally less expensive.Next, we study about the effect of w removal to the classificationperformance. From this point, we will call a set of visual words cre-ated from n as a semantic visual words set f �xg and a set of visualwords created from c which includes w as a normal visual wordsset {a}. We compare the classification results between visualwords with uninformative visual words {a + w}, visual words with-out uninformative visual words {a � w}, semantic visual wordswith uninformative visual words f �xg þW and semantic visualword without uninformative visual words f �xg �W using theSVM-RDF algorithm. The classification results are shown in Fig. 5.

Fig. 5 shows that the classification accuracy is heavily influ-enced by eliminating w. For example, in the pole vault event, theclassification accuracy is increased from 45% to 56% when w is re-moved from normal visual words set (a) whereas the semantic vi-sual words set (f �xg) obtained is 2% higher in the same categoryand the improvement of classification accuracy trends similarlyto all categories. This consistent improvement suggests that w

Fig. 5. The classification performance comparison of the normal visual words (a)and the semantic visual words (f �xg) before and after the w removing using SVM–RDF.

plays an important role in strengthening the discriminative power.Although the large number of normal visual words (a) makes a fea-ture become more discriminative, a also makes the feature vectorless generalizable and might contain more noise. A large number ofa also increases the cost associated with clustering keypoints incomputing visual-word features and in generating supervisedclassifiers.

6.3. Evaluation of visual content classification using the semanticallysimilar visual words

Next, we study the classification performance using the associa-tively (semantically) similar visual words u. This classification hasbeen tested after performing non-informative visual words re-moval, term weighting (LAC) and associatively similar visual worddiscovery (SLAC). Fig. 6 shows the performance of sport genre clas-sification using three different classification algorithms: NaïveBayes, SVM (Linear) and SVM (RBF). Among these classificationmethods, two types of BVW models are deployed for comparingthe classification performance, the semantic-preserved BVW mod-el (SBVW) and the Traditional BVW model (TBVW). SBVW refers tothe bag-of-visual words model containing u which preserves thesemantic information among keypoints and visual words whereasTBVW constructs visual words using the simple k-mean clusteringalgorithm without preserving semantic information among key-points and visual words. The evaluation criterion here is the meanaverage precision (MAP) collected from the Weka classification re-sults. The experimental result (Fig. 6) shows that the proposedmodel (SBVW) which classifies sport types based on u is superiorto the TBVW method in all cases (classifiers and sport genres). Thissuggests that the physical spatial information between keypoints,term weighting and semantic distance of the SLAC scheme in vec-tor space are significant to interlink the semantic conceptualiza-tion of visual content and visual features, allowing the system toefficiently categorize the visual data. In other words, semanticallysimilar visual words ðuÞ which are computed based on termweight and semantic distance in the vector space can represent vi-sual content more distinctively than traditional visual words. Spe-cifically, SVM-linear outperforms other categorization methods byproducing the highest performance up to 78% MAP in the highjump event. Among all sports, the high jump event obtains thehighest classification accuracy. This could be because there arefewer objects in this sport event, e.g. athlete, bar and foam matt.As a consequent, there is less noise among objects in the visualcontent for the high jump event compared to other sports whichcontain more objects. The categorization algorithm seems toclassify the high jump data more effectively compared to the othersport disciplines.

Fig. 6. The comparison of classification performance on four sport types betweenthe preserve semantic information (SBVW) method and the normal BVW method(NBVW).

Page 9: Visual content representation using semantically similar visual words

Fig. 7. Retrieval efficiency comparison of the normal visual words (a) and the semantic visual words (f �xg) before and after removing w.

11480 K. Kesorn et al. / Expert Systems with Applications 38 (2011) 11472–11481

6.4. Evaluation of content-based image retrieval based on thesemantically similar visual words

In this section, we evaluate the retrieval performance based onthe representation of u for visual contents using a different num-ber of visual words respect to the information given in Table 4.Fig. 7 illustrates the retrieval performance of the four differentsport genres. Overall, the retrieval performance using the semanticvisual words without non-informative visual words f �xg �W out-performs the others in all sport categories. In other words, thesemantically similar visual words (u) can deliver superior resultsover other methods using a more compact representation (smallnumber of visual words). Although retrieval performance seemsto not directly depend on the number of visual words, too many vi-sual words could nevertheless bring poor results. This is because alarge number of visual words might contain meaningless visualwords and, thus, this could cause the retrieval mechanism to re-trieve visual content that is not very semantically similar to thequery. As a result, precision is reduced. In summary, we attributethe retrieval performance to two factors. First by preserving thesemantic information throughout the processes of the visual wordconstruction, this leads to visual words representing visual contentmore effectively. This semantic information makes the visual con-tent distribution in the feature space more coherent and producessmaller variations within the same class. Second the use of theassociative visual words efficiently distinguishes visual contentand reduces the dimension of visual words in vector space using

the Chi-square statistical model. It is able to resolve the curse ofthe dimensionality problem. As a result, these two factors enhancethe retrieval performance.

7. Conclusions and future work

We presented a framework for visual content representationwhich has three main advantages. First the use of representativekeypoints reduces their dimensions in feature space, consequently,improving the clustering results (quality of visual words) andreducing computation costs. Second a method is proposed for gen-erating semantic visual words based on the spatial information ofkeypoints which are the linkage between visual words and high le-vel semantic of objects in visual content. Third an approach to findsemantically similar visual word to solve the visual heterogeneityproblem is presented. Based on the experimental results, the pro-posed technique allows the bag-of-visual words (BVW) to be moredistinctive and more compact than existing models. Furthermore,the proposed BVW model can capture the semantics of visual con-tent efficiently, resulting in a higher classification accuracy. How-ever, the major disadvantage of the proposed technique is thecomputation complexity of the SLAC algorithm. We proposed to re-solve this problem by reducing the number of visual words (asdimensions in vector space) using the Chi-square model.

We are currently extending our technique to restructurebag-of-visual words model as a hierarchical, ontology model to

Page 10: Visual content representation using semantically similar visual words

K. Kesorn et al. / Expert Systems with Applications 38 (2011) 11472–11481 11481

disambiguate visual word senses. The unstructured data of visualcontent is transformed into a hierarchical model which describesvisual content more explicitly than the vector space model usingconceptual structures and relationships. Hence, this could aidinformation systems to interpret or understand the meaning of vi-sual content more accurately, i.e., this helps in part to narrow thesemantic gap.

Acknowledgement

This work has been supported by Science Ministry ResearchFunding, the Royal Thai Government, Thailand.

References

Alhwarin, F., Wang, C., Ristic-Durrant, D., & Graser, A. (2008). Improved SIFT-features matching for object recognition. In International academic conference onvision of computer science-BSC (pp. 179–190).

AlSumait, L., & Domeniconi, C. (2008). Text clustering with local semantic kernels.In M. W. Berry & M. Castellanos (Eds.), Survey of text mining II: Clustering,classification, and retrieval (pp. 87–105). London, United Kingdom: Springer-Verlag London Limited.

Bach, J., Fuller, C., Gupta, A., Hampapur, A., Horowitz, B., Humphrey, R., et al. (1996).Virage image search engine: An open framework for image management. InStorage and retrieval for still image and video databases IV (pp. 76–87).

Barnard, K., Duygulu, P., Forsyth, D., Freitas, N. D., Blei, D. M., & Jordan, M. I. (2003).Matching words and pictures. Journal of Machine Learning Research, 3, 1107–1135.

Csurka, G., Dance, C. R., Fan, L., Willamowski, J., & Bray, C. (2004). Visualcategorization with bags of keypoints. In International workshop on statisticallearning in computer vision (pp. 1–22).

Domeniconi, C., Gunopulos, D., Ma, S., Yan, B., Al-Razgan, M., & Papadopoulos, D.(2007). Locally adaptive metrics for clustering high dimensional data. DataMining and Knowledge Discovery, 14(1), 63–97.

Duygulu, P., Barnard, K., Freitas, J. F. G., & Forsyth, D. A. (2002). Object recognition asmachine translation: Learning a lexicon for a fixed image vocabulary. InProceedings of the 7th European conference on computer vision-part IV (pp. 97–112).

Forsyth, D., & Fleck, M. (1997). Body plans. In Proceedings of IEEE computer societyconference on computer vision and pattern recognition (pp. 678–683).

Hao, L., & Hao, L. (2008). Automatic identification of stop words in Chinese textclassification. In 2008 International conference on computer science and softwareengineering (Vol. 1, pp. 718–722).

Harris, C., & Stephens, M. (1988). A combined corner and edge detector. InProceedings of the 4th Alvey vision conference (pp. 147–151).

Hironobu, Y. M., Takahashi, H., & Oka, R. (1999). Image-to-word transformationbased on dividing and vector quantizing images with words. In Proceedings ofinternational workshop on multimedia intelligent storage and retrievalmanagement (Vol. 4, pp. 405–409).

Jiang, Y., & Ngo, C. (2009). Visual word proximity and linguistics for semantic videoindexing and near-duplicate retrieval. Computer Vision and Image Understanding,113(3), 405–414.

Ke, Y., & Sukthankar, R. (2004). PCA-SIFT: A more distinctive representation for localimage descriptors. In Proceedings of the 2004 IEEE computer society conference oncomputer vision and pattern recognition, CVPR 2004 (Vol. 2, pp. 513–506).

Lee, Y., Lee, K., & Pan, S. (2005). Local and global feature extraction for facerecognition. In Proceedings of the 5th international conference on audio- andvideo-based biometric person authentication (pp. 219–228).

Lowe, D. G. (1999). Object recognition from local scale-invariant features. InProceedings of the international conference on computer vision (Vol. 2, pp. 1150–1157).

Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints.International Journal of Computer Vision, 60(2), 91–110.

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to informationretrieval. London, United Kingdom: Cambridge University Press.

Moravec, H. (1977). Towards automatic visual obstacle avoidance. In Proceedings ofthe 5th international joint conference on artificial intelligence (p. 584).

Naphade, M., & Smith, J. (2003). Learning regional semantic concepts fromincomplete annotation. In Proceedings of the international conference on imageprocessing (Vol. 2, pp. 603–606).

Rui, Y., Huang, T. S., & Chang, S. (1999). Image retrieval: Current techniques,promising directions, and open issues. Journal of Visual Communication andImage Representation, 10(1), 39–62.

Schmid, C., & Mohr, R. (1997). Local gray value invariants for image retrieval.IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(5), 530–535.

Smeulders, A. W. M., Worring, M., Santini, S., Gupta, A., & Jain, R. (2000). Content-based image retrieval at the end of the early years. IEEE Transaction on PatternAnalysis and Matching Intelligent, 22(12), 1349–1380.

Smith, J. R., & Chang, S. (1996). VisualSEEk: A fully automated content-based imagequery system. In Proceedings of the 4th ACM international conference onmultimedia (pp. 87–98).

Tirilly, P., Claveau, V., & Gros, P. (2008). Language modeling for bag-of-visual wordsimage categorization. In Proceedings of the 2008 international conference oncontent-based image and video retrieval (pp. 249–258).

Tseng, B., Lin, C., Naphade, M., Natsev, A., & Smith, J. (2003). Normalized classifierfusion for semantic visual concept detection. In Proceedings of the internationalconference on image processing (Vol. 2, pp. 535–538).

Witten, I. H., Moffat, A., & Bell, T. C. (1999). Managing gigabytes: Compressing andindexing documents and images (2nd ed.). London, United Kingdom: AcademicPress.

Wong, S. K. M., Ziarko, W., & Wong, P. C. N. (1985). Generalized vector spacesmodel in information retrieval. In Proceedings of the 8th annual internationalACM SIGIR conference on research and development in information retrieval (pp.18–25).

Wu, L., Hoi, S. C., & Yu, N. (2009). Semantics-preserving bag-of-words models forefficient image annotation. In Proceedings of the 1st ACM workshop on large-scalemultimedia retrieval and mining (pp. 19–26).

Yang, J., Jiang, Y., Hauptmann, A. G., & Ngo, C. (2007). Evaluating bag-of-visual-words representations in scene classification. In Proceedings of the internationalworkshop on multimedia information retrieval (pp. 197–206).

Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in textcategorization. In Proceedings of the 14th international conference on machinelearning (pp. 412–420).

Yang, Y., & Wilbur, J. (1996). Using corpus statistics to remove redundant words intext categorization. Journal of the American Society for Information Science, 47(5),357–369.

Yuan, J., Wu, Y., & Yang, M. (2007a). Discovery of collocation patterns: From visualwords to visual phrases. In IEEE conference on computer vision and patternrecognition (pp. 1–8).

Yuan, J., Wu, Y., & Yang, M. (2007b). From frequent itemsets to semanticallymeaningful visual patterns. In Proceedings of the 13th ACM SIGKDD internationalconference on knowledge discovery and data mining (pp. 864–873).

Zheng, Y., Neo, S., Chua, T., & Tian, Q. (2008). Toward a higher-level visualrepresentation for object-based image retrieval. The Visual Computer, 25(1),13–23.

Zheng, Y., Zhao, M., Neo, S., Chua, T., & Tian, Q. (2008). Visual synset: Towards ahigher-level visual representation. In IEEE conference on computer vision andpattern recognition (pp. 1–8).