Language-based Querying of Image Collections on the basis ...cpt23/papers/TownIVC2004.pdf · Image retrieval, query languages, ontologies, object recognition, language parsing 1 Introduction

Language-based Querying of Image Collections on the basis of an

Extensible Ontology

Christopher Town David Sinclair

University of Cambridge, Computer Laboratory Waimara Ltd

15 JJ Thomson Avenue 115 Ditton Walk

Cambridge UK Cambridge UK

Abstract

The design of a specialised query language for con-tent based image retrieval (CBIR) provides a means ofaddressing many of the problems associated with com-monly used query paradigms such as query-by-exampleand query-by-sketch. By basing such a language on anextensible ontology which encompasses both high-leveland low-level image properties and relations, one cango a long way towards bridging the semantic gap be-tween user models of saliency and relevance and thoseemployed by a retrieval system.

This paper discusses these issues and illustrates thedesign and use of an ontological retrieval languagethrough the example of the OQUEL query language.The retrieval process takes place entirely within the on-tological domain defined by the syntax and semanticsof the user query. Since the system does not rely onthe pre-annotation of images with sentences in the lan-guage, the format of text queries is highly flexible. Thelanguage is also extensible to allow for the definitionof higher level terms such as “cars”, “people”, etc. onthe basis of existing language constructs through the useof Bayesian inference networks. The matching pro-cess utilises automatically extracted image segmenta-tion and classification information and can incorporateany other feature extraction mechanisms or contextualknowledge available at processing time to satisfy a givenuser request.

Keywords

Image retrieval, query languages, ontologies, objectrecognition, language parsing

1 Introduction

Query mechanisms play a vital role in bridging thesemantic gap [17] between users and retrieval systems.There has however been relatively little recent work inaddressing this issue in the context of content basedimage retrieval (CBIR). Most of the query interfacesimplemented in current systems fall into a small groupof approaches. In order to overcome the weaknessesof these methodologies, efforts have focussed on tech-niques such as relevance feedback ([52], [9], [41]) as ameans of improving the composition and performanceof a user query in light of an initial assessment of re-trieval results. While this approach and other methodsfor improving the utility of user queries by means suchas query expansion and by combining multiple querymodalities have shown some promise, they do so at therisk of increased user effort and lack of transparency inthe retrieval process.

This paper presents the notion of an ontologicalquery language as a powerful and flexible means ofproviding an integrated query and retrieval frameworkwhich addresses the problem of the semantic gap be-tween user and system. Further background informa-

tion and a general motivation for such query languagesis given in section 2 while section 4 introduces theOQUEL language as a concrete example for retrievalfrom photographic image collections. Section 3 dis-cusses the basic language design and structure. Insection 5 the process of query interpretation and re-trieval is described further. The discussion is basedon an implementation of the language for the ICONcontent-based image retrieval system. Those contentextraction and representation facilities of ICON rele-vant to the present discussion are outlined in section4.1. Section 6 gives quantitative performance resultsof OQUEL queries compared to other query modali-ties in the ICON system. The paper concludes with asummary in section 7.2 which also provides an outlookof further developments such as the potential role ofnatural language processing and possible extensions ofthe ontological language to video.

1.1 CBIR Query Mechanisms and the Re-trieval Process

As has been noted elsewhere (e.g. [45]), research incontent based image retrieval has in the past sufferedfrom too much emphasis being placed on a system viewof the retrieval process in terms of image processing,feature extraction, content representation, data stor-age, matching, etc.. It has proven fruitful in the designof image retrieval systems to also consider the view ofa user confronted with the task of expressing his or herretrieval requirements in order to get the desired resultswith the least amount of effort. While issues such asthe visualisation of query results and facilities for rel-evance feedback and refinement are clearly important,this paper is primarily concerned with the mechanismsthrough which users express their queries.

Adopting a user perspective, one can summarisemost of the query methods traditionally employed byCBIR systems (see [45] and [40] for further references)and highlight their drawbacks as follows:

• Query-by-example: ([48], [9], [24]) Finding suit-able example images can be a challenge and mayrequire the user to manually search for such im-

ages before being able to query the automatedsystem. Even when the user can provide imageswhich contain instances of the salient visual prop-erties, content or configurations they would liketo search for, it is very hard for the system toascertain which aspects make a given image rele-vant and how similarity should be assessed. Manysuch systems therefore rely on extensive relevancefeedback to guide the search towards desirable im-ages, but this approach is not appropriate for mostreal-world retrieval scenarios. Many industrial ap-plications of CBIR require ways of succinctly ex-pressing abstract requirements which can not beencapsulated by any particular sample image.

• Template, region selection, or sketch: ([7], [23],[6]) Rather than providing whole images, the usercan draw (sometimes literally) the system’s atten-tion to particular image aspects such as the spatialcomposition of desired content in terms of par-ticular regions or a set of pre-defined templates.Clearly this process becomes cumbersome for com-plex queries and there are difficult user interfaceissues concerning how one might best represent ab-stract relations and invariants.

• Feature range or predicate: ([34], [33]) Here theuser can set target ranges or thresholds for cer-tain (typically low-level) attributes such as colour,shape, or texture features which may representglobal image properties or features localised to cer-tain image regions. While this clearly has merit forsome types of queries, the approach requires a cer-tain amount of user sophistication and patience.

• Annotation or document context: ([28], [46], [20])Images rarely come with usable annotations forreasons such as cost, ambiguity, inconsistency, andhuman error. While progress has been made in ap-plying text retrieval methods to annotations andother sources of image context such as captions,difficulties remain due to lack of availability, un-reliability, and variability of such textual informa-tion.

• Query language or concept: ([8], [32], [41]) Effortshave been made to extend popular database querylanguages derived from SQL to cater for the in-trinsic uncertainty involved in matching image fea-tures to assess relevance. However, such languagesremain quite formal and rigid and are difficult toextend to higher-level concepts. Knowledge-basedapproaches utilising description logics or semanticnetworks have been proposed as a means of betterrepresenting semantic concepts but tend to entailsomewhat cumbersome query interfaces.

Although these approaches have proven to be useful,both in isolation and when combined, in providing us-able CBIR solutions for particular application domainsand retrieval scenarios, much work remains to be donein providing query mechanisms that will scale and gen-eralise to the applications envisaged for future main-stream content based access to multimedia. Indeed onecriticism one can generally level at image retrieval sys-tems is the extent to which they require the user tomodel the notions of content representation and simi-larity employed by the system rather than vice versa.One reason for the failure of CBIR to gain widespreadadoption is due to the fact that mainstream users arequite unwilling to invest great effort into query com-position [37] as many systems fail to perform in accor-dance with user expectations.

The language based query framework proposed inthis paper aims to address these challenges. Query sen-tences are typically short (e.g. “people in centre”) yetconceptually rich. This is because they need only rep-resent those aspects of the target image(s) which theuser is trying to retrieve and which distinguish such im-ages from others in the dataset. The user is thereforenot required to translate a description of an envisagedtarget image into the language but merely (and cru-cially) to express desired properties which are to holdfor the retrieved images.

1.2 Language-based Querying

This paper argues that query languages constitutean important avenue for further work in developing

CBIR query mechanisms. Powerful and easy-to-usetextual document retrieval systems have become per-vasive and constitute one of the major driving forcesbehind the internet. Given that so many people are fa-miliar with the use of simple keyword strings and regu-lar expressions to retrieve documents from vast onlinecollections, it seems natural to extend language basedquerying to multimedia data. However, it is importantto recognise [47] that the natural primitives of doc-ument retrieval, words and phrases, carry with theminherently more semantic information and characterisedocument content in a much more redundant and highlevel way than the pixels and simple features found inimages. This is why text retrieval has been so success-ful despite the relative simplicity of using statisticalmeasures to represent content indicatively rather thansubstantively. Image retrieval addresses a much morecomplex and ambiguous challenge, which is why we ar-gue strongly for a query method based on a languagethat can represent both the syntax and semantics of im-age content at different conceptual levels. This paperwill show that by basing this language on an ontologyone can capture both concrete and abstract relation-ships between salient image properties such as objectsin a much more powerful way than with the relativelyweak co-occurrence based knowledge representation fa-cilities of classical information retrieval. Since the lan-guage is used to express queries rather than describeimage content, such relationships can be representedexplicitly without prior commitments to a particularinterpretation or having to incur the combinatorial ex-plosion of an exhaustive annotation of all the relationsthat may hold in a given image. Instead, only those im-age aspects which are of value in determining relevancegiven a particular query are evaluated and evaluationmay stop as soon as an image can be deemed irrelevant.

The comparatively small number of query languagesdesigned for CBIR have largely failed to attain thestandards necessary for general adoption. A majorreason for this is the fact that most language or textbased image retrieval systems rely on manual anno-tations, captions, document context, or pre-generatedkeywords, which leads to a loss of flexibility through the

initial choice of annotation and indexing. Languagesmainly concerned with deriving textual descriptions ofimage content [1] are inappropriate for general purposeretrieval since it is infeasible to generate exhaustivetextual representations which contain all the informa-tion and levels of detail which might be required toprocess a given query in light of the user’s retrievalneed. While keyword indexing of images in terms ofdescriptors for semantic content remains highly desir-able, semi- or fully automated annotation is currentlybased on image document context [42] or limited to lowlevel descriptors. More ambitious “user-in-the-loop”annotation systems still require a substantial amountof manual effort to derive meaningful annotations [51].Formal query languages such as extensions of SQL [38]are limited in their expressive power and extensibilityand require a certain level of user experience and so-phistication.

In order to address the challenges mentioned abovewhile keeping user search overheads to a minimum,we have developed the OQUEL query description lan-guage. It provides an extensible language frameworkbased on a formally specified grammar and an exten-sible vocabulary which are derived from a general on-tology of image content in terms of categories, objects,attributes, and relations. Words in the language repre-sent predicates on image features and target content atdifferent semantic levels and serve as nouns, adjectives,and prepositions. Sentences are prescriptions of desiredcharacteristics which are to hold for relevant retrievedimages. They can represent spatial, object composi-tional, and more abstract relationships between termsand sub-sentences. The language is portable to otherimage content representation systems in that the lowerlevel words and the evaluation functions which act onthem can be changed or re-implemented with little orno impact on the conceptually higher language ele-ments. It is also extensible since new terms can bedefined both on the basis of existing constructs andbased on new sources of image knowledge and meta-data. This enables definition of customised ontologiesof objects and abstract relations.

2 Ontological Language Framework

2.1 Role of Ontologies

By basing a retrieval language on an ontology, onecan explicitly encode ontological commitments aboutthe domain of interest in terms of categories, objects,attributes, and relations. Gruber [16] defines the termontology in a knowledge sharing context as a “formal,explicit specification of a shared conceptualisation”.Ontologies encode the relational structure of conceptswhich one can use to describe and reason about as-pects of the world. Ontology is the theory of objectsin terms of the criteria which allow one to distinguishbetween different types of objects and the relations,dependencies, and properties through which they maybe described. For the present purposes ontologies arerepresentations of image content at different semanticlevels and queries expressing desired image character-istics.

Sentences in a language built by means of anontology can be regarded as active representationalconstructs of information as knowledge and therehave been similar approaches in the past applyingknowledge-based techniques such as description logics([18], [2], [3]) to CBIR. However, in many such casesthe knowledge based relational constructs are simplytranslated into equivalent database query statementssuch as SQL [19] or a potentially expensive softwareagent methodology is employed for the retrieval pro-cess [10]. This mapping of ontological structures ontoreal-world evidence can be implemented in a variety ofways. Common approaches are heavily influenced bymethods such as description logics, frame-based sys-tem, and Bayesian inference [13].

This paper argues that the role of a query languagefor CBIR should be primarily prescriptive, i.e. a sen-tence is regarded as a description of a user’s retrievalrequirements which cannot easily be mapped onto thedescription of image content available to the system.While the language presented here is designed from ageneral ontology which determines its lexical and syn-

Figure 1: Model of the retrieval process usingan ontological query language to bridge the se-

mantic gap between user and system notions ofcontent and similarity.

tactic elements to represent objects, attributes, and re-lations, this does not in itself constitute a commitmentto a particular scheme for determining the semanticinterpretation of any given query sentence. The eval-uation of queries will depend on the makeup of thequery itself, the indexing information available for eachimage, and the overall retrieval context. Evaluationtherefore takes place within a particular ontological do-main specified by the composition of the query and theavailable image evidence at the time it is processed.This approach is consistent with the view expressed ine.g. [41] that the meaning of an image is an emergentproperty which depends on both the query context andthe image set over which the query is posed. Figure1 shows how the ontological query language and themechanisms for its interpretation can thus be regardedas acting as an intermediary between user and retrievalsystem in order to reduce the semantic gap.

2.2 Query Specific Image Interpretation

The important distinction between query descrip-tion and image description languages is founded onthe principle that while a given picture may well saymore than a thousand words, a short query sentenceexpressed in a sufficiently powerful language can ade-quately describe those image properties which are rele-

vant to a particular query. Information theoretic mea-sures can then be applied to optimise a given queryby identifying those of its elements which have highdiscriminative power to iteratively narrow down thesearch to a small number of candidate images. Hence itis the query itself which is taken as evidence for the rel-evance assessment measures appropriate to the user’sretrieval requirement and “point of view”. The syntaxand semantics of the query sentence composed by theuser thereby define the particular ontological domain inwhich the search for relevant images takes place. Thisis inherently a far more powerful way of relating imagesemantics to user requests than static image annota-tion which, even when carried out by skilled humanannotators, will always fall far short of encapsulatingthose aspects and relationships which are of particularvalue in characterising an image in light of a new query.

The use of ontologies also offers the advantageof bridging between high-level concepts and low-levelprimitives in a way which allows extensions to thelanguage to be defined on the basis of existing con-structs without having to alter the representation ofimage data. Queries can thus span a range of concep-tual granularity from concrete image features (regions,colour, shape, texture) and concrete relations (featuredistance, spatial proximity, size) to abstract contentdescriptors (objects, scene descriptions) and abstractrelations (similarity, class membership, inferred objectand scene composition). The ability to automaticallyinfer the presence of high-level concepts (e.g. a beachscene) on the basis of evidence (colour, region classifica-tion, composition) requires techniques such as Bayesianinference which plays an increasing role in semanticcontent derivation [50]. By expressing the causal re-lationships used to integrate multiple sources of evi-dence and content modalities in a dependency graph,such methods are also of great utility in quickly elim-inating improbable configurations and thus narrowingdown the search to a rapidly decreasing number of im-ages which are potentially relevant to the query.

3 Language Design and Structure

This section introduces the OQUEL ontologicalquery language with particular reference to its cur-rent implementation as a query description languagefor the ICON content based image retrieval system.For reasons of clarity, only a high level description ofthe language will be presented here. Section 4 will dis-cuss implementation details pertaining to the contentextraction and representation schemes used in the sys-tem and show how tokens in the language are mappedonto concrete image properties. Section 5 will showhow query sentences are processed to assess image rel-evance.

3.1 Overview and Design Principles

The primary aim in designing OQUEL has beento provide both ordinary users and professional imagearchivists with an intuitive and highly versatile meansof expressing their retrieval requirements through theuse of familiar natural language words and a straight-forward syntax. As mentioned above, many querylanguages have traditionally followed a path set outby database languages such as SQL which are char-acterised by a fairly sparse and restrictive grammati-cal framework aimed at facilitating concise and well-defined queries. The advantages of such an approachare many, e.g. ease of machine interpretation, availabil-ity of query optimisation techniques, scalability, theo-retical analysis, etc.. However, their appropriatenessand applicability to a domain of such intrinsic ambigu-ity and uncertainty as image retrieval remain doubtful.OQUEL was therefore designed to provide greater nat-uralness and flexibility through the use of a more com-plex grammar bearing a resemblance to natural lan-guage on a restricted domain.

3.2 Syntax and Semantics

In order to allow users to enter both simple keywordphrases and arbitrarily complex compound queries, thelanguage grammar features constructs such as predi-cates, relations, conjunctions, and a specification syn-

tax for image content. The latter includes adjectivesfor image region properties (i.e. shape, colour, and tex-ture) and both relative and absolute object location.Desired image content can be denoted by nouns suchas labels for automatically recognised visual categoriesof stuff (“grass”, “cloth”, “sky”, etc.) and through theuse of derived higher level terms for composite objectsand scene description (e.g. “animals”, “vegetation”,“winter scene”). This includes the simple morpholog-ical distinction between singular and plural forms ofcertain terms, hence “people” will be evaluated differ-ently from “person”.

Tokens serving as adjectives denoting desired im-age properties are parameterised to enable values andranges to be specified. The use of defaults, terms rep-resenting fuzzy value sets, and simple rules for operatorprecedence and associativity help to reduce the effec-tive complexity of query sentences and limit the needfor special syntax such as brackets to disambiguategrouping. Brackets can however optionally be usedto define the scope of the logical operators (not, and,or, xor) and are required in rare cases to prevent thelanguage from being context sensitive in the grammartheory sense.

While the inherent sophistication of the OQUELlanguage enables advanced users to specify extremelydetailed queries if desired, much of this complexity ishidden by a versatile query parser. The parser wasconstructed with the aid of the SableCC lexer/parsergenerator tool from LALR(1) grammar rules and theWordNet [27] lexical database as further described inthe next section. The vocabulary of the language isbased on an annotated thesaurus of several hundrednatural language words, phrases, and abbreviations(e.g. “!” for “not”, “,” for “and”) which are recog-nised as tokens. Token recognition takes place in a lex-ical analysis step prior to syntax parsing to reduce thecomplexity of the grammar. This also makes it possibleto provide more advanced word-sense disambiguationand analysis of phrasal structure while keeping the lan-guage efficiently LALR(1) parsable.

The following gives a somewhat simplified high levelcontext free EBNF-style grammar G of the OQUEL

language as currently implemented in the ICON system(capitals denote lexical categories, lower case stringsare tokens or token sets).

G : {S → R

R → modifier? (scenedescriptor | SB | BR)

| not? R (CB R)?

BR → SB binaryrelation SB

SB → (CS | PS) + LS ∗CS → visualcategory | semanticcategory |

not? CS (CB CS)?

LS → location | not? LS (CB LS)?

PS → shapedescriptor | colourdescriptor |sizedescriptor | not? PS (CB PS)?

CB → and | or | xor;

}

The major syntactic categories are:

• S: start symbol of the sentence (text query)

• R: requirement (a query consists of one or morerequirements which are evaluated separately, theprobabilities of relevance then being combined ac-cording to the logical operators)

• BR: binary relation on SBs

• SB: specification block consisting of at least oneCS or PS and 0 or more LS

• CS: image content specifier

• LS: location specifier for regions meeting theCS/PS

• PS: region property specifier (visual properties ofregions such as colour, shape, texture, and size)

• CB: binary (fuzzy) logical connective (conjunc-tion, disjunction, and exclusive-OR)

Tokens (terminals) belong to the following sets:

• modifier: Quantifiers such as “a lot of”, “none”,“as much as possible”.

• scene descriptor: Categories of image contentcharacterising an entire image, e.g. countryside,city, indoors.

• binaryrelation: Relationships which are to hold be-tween clusters of target content denoted by spec-ification blocks. The current implementation in-cludes spatial relationships such as “larger than”,“close to”, “similar size as”, “above”, etc. andsome more abstract relations such as “similar con-tent”.

• visualcategory: Categories of stuff, e.g. water,skin, cloud.

• semanticcategory: Higher semantic categories suchas people, vehicles, animals.

• location: Desired location of image content match-ing the content or shape specification, e.g. “back-ground”, “lower half”, “top right corner”.

• shapedescriptor: Region shape properties, for ex-ample “straight line”, “blob shaped”.

• colourdescriptor: Region colour specified eithernumerically or through the use of adjectives andnouns, e.g. “bright red”, “dark green”, “vividcolours”.

• sizedescriptor: Desired size of regions matchingthe other criteria in a requirement, e.g. “at least10%” (of image area), “largest region”.

The precise semantics of these constructs are depen-dent upon the way in which the query language is im-plemented, the parsing algorithm, and the user queryitself, as will be described in the following sections.

3.3 Vocabulary

As shown in the previous section, OQUEL featuresa generic base vocabulary built on extracted image fea-tures and intermediate level content labels which can

be assigned to segmented image regions on the basisof such features. Some terminal symbols of the lan-guage therefore correspond directly to previously ex-tracted image descriptors. This base vocabulary hasbeen extended and remains extensible by derived termsdenoting higher level objects and concepts which canbe inferred at query time. While the current OQUELimplementation is geared towards general purpose im-age retrieval from photographic image collections, taskspecific vocabulary extensions can also be envisaged.

In order to provide a rich thesaurus of synonyms andalso capture some more complex relations and seman-tic hierarchies of words and word senses, we have madeuse of the lexical information present in the WordNet[27] electronic dictionary. This contains a large vocab-ulary which has been systematically annotated withword sense information and relationships such as syn-onyms, antonyms, hyper- and hyponyms, meronyms,etc.. Currently we have used some of this informationto define a thesaurus of about 400 words relating tothe extracted image features and semantic descriptorsmentioned above.

Work has begun on improving the flexibility of theOQUEL retrieval language by adding a pre-processingstage to the current query parser. This will useadditional semantic associations and word relation-ships encoded in the WordNet database to providemuch greater expressive power and ease syntacticalconstraints. Such a move may require a more flexi-ble natural-language oriented parsing strategy to copewith the additional difficulty of word-sense and querystructure disambiguation but will also pave the way forfuture work on using the language as a powerful repre-sentational device for content-based knowledge extrac-tion.

3.4 Example sentences

The following are examples of valid OQUEL queriesas used in conjunction with ICON:

some sky which is close to buildings in uppercorner

some water in the bottom half which is sur-rounded by trees and grass, size at least 10%

[indoors] & [people]

some green or vividly coloured vegetation inthe centre which is of similar size as clouds orblue sky at the top

[artificial stuff, vivid colours and straightlines] and tarmac

4 Implementation of the OQUEL Lan-

guage

4.1 Content Extraction and Representa-tion

ICON (Image Content Organisation and Navigation,[21]) combines a cross-platform Java user interface withimage processing and content analysis functionality tofacilitate automated organisation of and retrieval fromlarge heterogeneous image sets based on both metadata and visual content.

The backend image processing components extractvarious types of content descriptors and meta data fromimages (see [49]). The following are currently used inconjunction with OQUEL queries:

Image segmentation: Images are segmented intonon-overlapping regions and sets of properties for size,colour, shape, and texture are computed for each re-gion [43, 44]. Initially full three colour edge detectionis performed using the weighted total change dT

dT = dI2i + dI2

j + 3.0dC (1)

where the total change in intensity dIi is given by thecolour derivatives in RGB space

dIi = dRi + dGi + dBi (2)

and the magnitude of change in colour is representedby

dC =√

((dBi − dGi)2 + (dRi − dBi)

2 + (dGi − dRi)2

+ (dBj − dGj)2 + (dRj − dBj)

2 + (dGj − dRj)2).(3)

Figure 2: From top to bottom: Example colour im-age from the Corel image library. Full three colourderivative of the image. Voronoi image computed fromthe edges found by the three colour edge detector (thedarker a pixel, the further it is from an edge). Finalregion segmentation (boundaries indicated in blue).

The choice of 3 as the weighting factor in favour ofcolour change over brightness change is empirical buthas been found to be effective across a very broad rangeof photographic images and artwork. Local orientation(for use in the non-maximum suppression step of theedge detection process) is defined to be in the directionof the maximum colour gradient. dT is then the edgestrength fed to the non-max suppression and hysteresisedge-following steps which follow the popular methoddue to Canny .

Voronoi seed points for region growing are gener-ated from the peaks in the distance transform of theedge image, and regions are then grown agglomera-tively from seed points with gates on colour differencewith respect to the boundary colour and mean colouracross the region. Unassigned pixels at the boundariesof a growing region are incorporated into a region if thedifference in colour between it and pixels in the localneighbourhood of the region is less than one thresh-old and the difference in colour between the candidateand the mean colour of the region is less than a secondlarger threshold.

A texture model based on discrete ridge features isalso used to describe regions in terms of texture fea-ture orientation and density. Ridge pixels are those forwhich the magnitude of the second derivative operatorapplied to a grey-scale version of the original image ex-ceeds a threshold. The network of ridges is then brokenup into compact 30 pixel feature groups and the ori-entation of each feature is computed from the secondmoment about its center of mass. Features are clus-tered using Euclidean distance in RGB space and theresulting clusters are then employed to unify regionswhich share significant portions of the same featurecluster. The internal brightness structure of “smooth”(largely untextured) regions in terms of their isobright-ness contours and intensity gradients is used to derivea parameterisation of brightness variation which allowsshading phenomena such as bowls, ridges, folds, andslopes to be identified. A histogram representation ofcolour covariance and shape features is computed forregions above a certain size threshold.

The segmentation scheme then returns a region map

together with internal region description parameterscomprising colour, colour covariance, shape, texture,location and adjacency relations. Segmentation doesnot rely on large banks of filters to estimate local imageproperties and hence is fast (typically a few secondsfor high resolution digital photographa) and does notsuffer from the problem of the boundary between tworegions appearing as a region itself. The region growingtechnique effectively surmounts the problem of brokenedged topology and the texture feature based regionunification step ensures that textured regions are notfractured. Figure 2 shows a sample image from theCorel picture library and the results of segmentation.The number of segmented regions depends on imagesize and visual content, but has the desirable propertythat most of the image area is commonly containedwithin a few dozen regions which closely correspondto the salient features of the picture at the conceptualgranularity of the semantic categories used here.

a) b)

c)

d)

e)

Figure 3: Eye detection process: a) Binarised image.b) Hausdorff clustered regions after filtering. c) Nor-malised feature cluster of the left eye (left) and dis-tance transforms for 6 feature orientations (blue areasare further from feature points). d) Examples of lefteyes correctly classified using nearest neighbours. e)Examples of nearest neighbour clusters for non-eyes.

Stuff classification: Region descriptors computedfrom the segmentation algorithm are fed into artificialneural network classifiers which have been trained tolabel regions with class membership probabilities for aset of 12 semantically meaningful visual categories of“stuff” (“Brick”, “Blue sky”, “Cloth”, “Cloudy sky”,“Grass”, “Internal walls”, “Skin”, “Snow”, “Tarmac”,“Trees”, “Water”, and “Wood”). The classifiers areMLP (multi layer perceptron) and RBF (radial ba-sis function) networks whose topology was optimisedto yield best generalisation performance for each par-ticular visual category using separate training, testingand validation sets from a large (over 40000 exemplars)corpus of manually labelled image regions. The MLPnetworks typically consist of three hidden layers withprogressively smaller numbers of neurons (up to 250)in each layer.

Automatic labelling of segmented image regionswith semantic visual categories [49] such as grass orwater which mirror aspects of human perception allowsthe implementation of intuitive and versatile querycomposition methods while greatly reducing the searchspace. The current set of categories was chosen tofacilitate robust classification of general photographicimages. These categories are by no means exhaustivebut represent a first step towards identifying fairly lowlevel semantic properties of image regions which can beused to ground higher level concepts and content pre-scriptions. Various psychophysical studies [9, 29] haveshown that semantic descriptors such as these serve asuseful cues for determining image content by humansand CBIR systems. An attempt was made to includecategories which allow one to distinguish between in-door and outdoor scenes. Experiments on both theCorel photo set and a large body of amateur digitalphotographs have given classification success rates ofbetween 82% and 96% for the largest image regionswhich jointly cover over 80% of image area.

Colour descriptors: Nearest-neighbour colour classi-fiers were built from the region colour representation.These use the Earth-mover distance measure applied toEuclidean distances in RGB space to compare regioncolour profiles with cluster templates learned from a

training set. In a manner similar to related approachessuch as [34, 29], colour classifiers were constructed foreach of twelve “basic” colours (“black”, “blue”, “cyan”,“grey”, “green”, “magenta”, “orange”, “pink”, “red”,“white”, “yellow”, “brown”). Each region is associatedwith the colour labels which best describe it.

Face detection: Face detection relies on identifyingelliptical regions (or clusters of regions) classified as hu-man skin. A binarisation transform is then performedon a smoothed version of the image. Candidate re-gions are clustered based on a Hausdorff distance mea-sure [39] and resulting clusters are filtered by size andoverall shape and normalised for orientation and scale.From this a spatially indexed oriented shape model isderived by means of a distance transform of 6 differ-ent orientations of edge-like components from the clus-ters via pairwise geometric histogram binning [12]. Anearest-neighbour shape classifier was trained to recog-nise eyes. See figure 3 for an illustration of the ap-proach.

Adjacent image regions classified as human skin inwhich eye candidates have been identified are then la-belled as containing (or being part of) one or morehuman faces subject to the scale factor implied by theseparation of the eyes. This detection scheme shows ro-bustness across a large range of scales, orientations, andlighting conditions but suffers from false positives. Re-cently an integrated face detector based on a two levelclassifier of polynomial kernel SVMs (Support VectorMachines) has been implemented. For reasons of effi-ciency, this detector is applied only to face candidatesdetected by the previously described method in orderto greatly reduce the false positive rate while retaininghigh accuracy (about 72% correct detections and 0.08false positives per image for the diverse image collec-tion employed in section 6).

Region mask: Canonical representation of the seg-mented image giving the absolute location of each re-gion by mapping pixel locations onto region identifiers.The mask stores an index value into the array of regionsin order to indicate which region each pixel belongs to.For space efficiency this is stored in a run length en-coded representation.

Region graph: Graph of the relative spatial relation-ships of the regions (distance, adjacency, joint bound-ary, and containment). Distance is defined in terms ofthe Euclidean distance between centres of gravity, adja-cency is a binary property denoting that regions sharea common boundary segment, and the joint boundaryproperty gives the relative proportion of region bound-ary shared by adjacent regions. A region A is said tobe contained within region B if A shares 100% of itsboundary with B. Together with the simple parameter-isation of region shape computed by the segmentationmethod, this provides an efficient (if non-exact) repre-sentation of the geometric relationships between imageregions.

Figure 4: Three level grid pyramid which subdi-

vides an image into different numbers of fixedgrids (1, 5, 64) at each level.

Grid pyramid: The proportion of image contentwhich has been positively classified with each particu-lar label (visual category, colour, and presence of faces)at different levels of an image pyramid (whole image,image fifths, 8x8 chess grid, see figure 4). For each gridelement we therefore have a vector of percentages forthe 12 stuff categories, the 12 colour labels, and thepercentage of content deemed to be part of a humanface. Grid regions are generally of the same area andrectangular shape, except in the case of the image fifths

where the central rectangular fifth occupies 25% of im-age area and is often given a higher weighting for scenecharacterisation to reflect the fact that this region islikely to constitute the most salient part of the image.

Through the relationship graph representation of re-gions we can make the matching of clusters of regionsinvariant with respect to displacement and rotation us-ing standard matching algorithms [36]. The grid pyra-mid and region mask representations allow an efficientcomparison of absolute position and size.

This may be regarded as an intermediate level rep-resentation which does not preclude additional stagesof visual inference and composite object recognition inlight of query specific saliency measures and the inte-gration of contextual information. Such intermediatelevel semantic descriptors for image content have beenused by several CBIR systems in recent years ( [26],[5], [14], [22]).

Figure 5: Simplified Bayesian network for the

scene descriptor “winter”.

4.2 Grounding the Vocabulary

An important aspect of OQUEL language imple-mentation concerns the way in which sentences in thelanguages are grounded in the image domain. Here wediscuss those elements of the token set which might beregarded as being statically grounded, i.e. there ex-ists a straightforward mapping from OQUEL words toextracted image properties as described in section 4.1.Other terminals (modifiers, scene descriptors, binaryrelations, and semantic categories) and syntactic con-structs are evaluated by the query parses as will be

discussed in section 5.

visualcategory: The 12 categories of stuff which havebeen assigned to segmented image regions by the neuralnet classifiers. Assignment of category labels to imageregion is based on a threshold applied to the classifieroutput.

location: Location specifiers which are simplymapped onto the grid pyramid representation. For ex-ample, when searching for “grass” in the “bottom left”part of an image, only content in the lower left imagefifth will be considered.

shapedescriptor: The current terms are “straightline”, “vertical”, “horizontal”, “stripe”, “right angle”,“top edge”, “left edge”, “right edge”, “bottom edge”,“polygonal”, and “blobs”. They are defined as predi-cates over region properties and aspects of the regiongraph representation derived from the image segmenta-tion. For example, a region is deemed to be a straightline if its shape is well approximated by a thin rect-angle, “right edge” corresponds to a shape appearingalong the right edge of the image, and “blobs” are re-gions with highly amorphous shape without straightline segments.

colourdescriptor: Region colour specified either nu-merically in the RGB or HSV colour space or throughcolour labels assigned by the nearest-neighbour classi-fiers. By assessing the overall brightness and contrastproperties of a region using fixed thresholds, coloursidentified by each classifier can be further describedby a set of three “colour modifiers” (“bright”, “dark”,“faded”).

sizedescriptor: The size of image content matchingother aspects of a query is assessed by adding the areasof the corresponding regions. Size may be defined asa percentage value of image area (“at least x%”, “atmost x%”, “between x% and y%”) or relative to otherimage parts (e.g. “largest”, “smallest”, “bigger than”).

4.3 System Integration

A general query methodology for content based im-age and multimedia retrieval must take into accountthe differences in potential application domains and

system environments. Great care was therefore takenin the design of the OQUEL language to make it pos-sible to integrate it with existing database infrastruc-ture and content analysis facilities. This portability wasachieved by a component-based software developmentapproach which allows individual matching modules tobe re-implemented to cater for alternative content rep-resentation schemes without affecting the overall se-mantics of the query language. This facility also makesit possible to evaluate a particular query differently de-pending on the current retrieval context.

The implementation of OQUEL also remains exten-sible. New terms can be represented on the basis ofexisting constructs as macro definitions. Simple lexicalextensions are handled by a tokeniser and do not re-quire any modifications to the query parser. Novel con-cepts can also be introduced by writing an appropriatesoftware module (a Java class extending an interface orderived by inheritance) and plugging it into the existinglanguage model. While an extension of the languagesyntax requires recompilation of the grammar specifica-tion, individual components of the language are largelyindependent and may be re-specified without affectingother parts. Furthermore, translation modules can bedefined to optimise query evaluation or transform partof the query into an alternative format (e.g. a sequenceof pre-processed SQL statements).

As will be discussed in the next section, the querytext parser was designed to hide grammatical complex-ity (“what you don’t know can’t hurt you”) and pro-vide a natural language like tool for query composition.There is also a forms-based interface which offers thelook and feel of graphical database interfaces and ex-plicitly exposes available language features while beingslightly restricted in the type of queries it can handle.Lastly there is a graphical tool which allows users to in-spect or modify a simplified abstract syntax tree (AST)representation of a query.

5 Retrieval Process

This section discusses the OQUEL retrieval processas implemented in the ICON system. In the first stage,

the syntax tree derived from the query is parsed topdown and the leaf nodes are evaluated in light of theirpredecessors and siblings. Information then propagatesback up the tree until we arrive at a single probabilityof relevance for the entire image. At the lowest level,tokens map directly or very simply onto the content de-scriptors. Higher level terms are either expanded intosentence representations or evaluated using Bayesiangraphs. For example, when looking for people in animage the system will analyse the presence and spa-tial composition of appropriate clusters of relevant stuff(cloth, skin, hair) and relate this to the output of faceand eye spotters. This evidence is then combined prob-abilistically to yield a likelihood estimate of whetherpeople are present in the image. Figure 5 shows a sim-plified Bayesian network for the scene descriptor “win-ter”. Arrows denote conditional dependencies and ter-minal nodes correspond to sources of evidence or, inthe case of the term “outdoors”, other Bayesian nets.

5.1 Query-time Object and Scene Recog-nition for Retrieval

In this paper we have argued that in order to comecloser to capturing the semantic “essence” of an im-age, tasks such as feature grouping and object iden-tification need to be approached in an adaptive goaloriented manner. This takes into account that crite-ria for what constitutes non-accidental and perceptu-ally significant visual properties necessarily depend onthe objectives and prior knowledge of the observer, asrecognised in [4]. Going back to the lessons learnedfrom text retrieval stated in section 1.2, for most con-tent retrieval tasks it is perfectly adequate to approachthe problem of retrieving images containing particularobjects or characterisable by particular scene descrip-tors in an indicative fashion rather than a full analyticone. As long as the structure of the inference meth-ods adequately accounts for the non-accidental prop-erties that characterise an object or scene, relevancecan be assessed by a combination of individually weaksources of evidence. These can be ranked in a hierar-chy and further divided into those which are necessary

for the object to be deemed present and those whichare merely contingent. Such a ranking makes it possi-ble to quickly eliminate highly improbable images andnarrow down the search window.

Relevant images are those where we can find suf-ficient support for the candidate hypotheses derivedfrom the query. Given enough redundancy and a man-ageable false positive rate, this will be resilient to fail-ure of individual detection modules. For example, aquery asking for images containing people does not re-quire us to solve the full object recognition challenge ofcorrectly identifying the location, gender, size, etc. ofall people depicted in all images in the collection. Aslong as we maintain a notion of uncertainty, border-line false detections will simply result in lowly rankedretrieved images. Top query results will correspondto those image where our confidence of having foundevidence for the presence of people is high relative tothe other images, subject to the inevitable thresholdingand identification of necessary features.

5.2 Query Parsing and Representation

OQUEL queries are parsed to yield a canonical ab-stract syntax tree (AST) representation of their syn-tactic structure. Figures 6, 7, 8, and 9 show samplequeries and their ASTs. The structure of the syn-tax trees follows that of the grammar, i.e. the rootnode is the start symbol whose children represent par-ticular requirements over image features and content.The leaf nodes of the tree correspond to the terminalsymbols representing particular requirements such asshapedescriptors and visual categories. Intermediatenodes are syntactic categories instantiated with the rel-evant token (i.e. “AND”, “which is larger than”) whichrepresent the relationships that are to be applied whenevaluating the query.

5.3 Query Evaluation and Retrieval

Images are retrieved by evaluating the AST to com-pute a probability of relevance for each image. Due tothe inherent uncertainty and complexity of the task,

evaluation is performed in a manner which limits therequirement for runtime inference by quickly ruling outirrelevant images given the query. Query sentences con-sist of requirements which yield matching probabilitiesthat are further modified and combined according tothe top level syntax. Relations are evaluated by con-sidering the image evidence returned by assessing theirconstituent specification blocks. These attempt to finda set of candidate image content (evidence) labelledwith probabilities according to the location, content,and property specifications which occur in the syntaxtree. A closure consisting of a pointer to the identifiedcontent (e.g. a region identifier or grid coordinate) to-gether with the probability of relevance is passed as amessage to higher levels in the tree for evaluation andfusion.

The overall approach therefore relies on passing mes-sages (image structures labelled with probabilities ofrelevance), assigning weights to these messages accord-ing to higher level structural nodes (modifiers and re-lations), and integrating these at the topmost levels(specification blocks) in order to compute a belief statefor the relevance of the evidence extracted from thegiven image for the given query. There are many ap-proaches to using probabilities to quantify and combineuncertainties and beliefs in this way [35]. The approachadopted here is related to that of [25] in that it ap-plies notions of weighting derived from the Dempster-Shafer theory of evidence to construct an informationretrieval model which captures structure, significance,uncertainty, and partiality in the evaluation process.

At the leaf nodes of the AST, derived terms such asobject labels (“people”) and scene descriptions (“in-doors”) are either expanded into equivalent OQUELsentence structures or evaluated by Bayesian networksintegrating image content descriptors with additionalsources of evidence (e.g. a face detector). Bayesiannetworks tend to be context dependent in their appli-cability and may therefore give rise to brittle perfor-mance when applied to very general content labellingtasks. In the absence of additional information in thequery sentence itself, it was therefore found useful toevaluate mutually exclusive scene descriptors for ad-

Figure 6: Search results for OQUEL query A “bright red and stripy”.

Figure 7: Search results for OQUEL query B “people in centre”.

Figure 8: Search results for OQUEL query C “some water in the bottom half which is surroundedby trees and grass, size at least 10%”.

Figure 9: Search results for OQUEL query D “winter”.

ditional disambiguation. For example, the concepts“winter” and “summer” are not merely negations ofone another but correspond to Bayesian nets evaluat-ing different sources of evidence. If both were to as-sign high probabilities to a particular image then thelabelling is considered ambiguous and consequently as-signed a lower relevance weight.

The logical connectives are evaluated using thresh-olding and fuzzy logic (i.e. “p1 and p2” correspondsto “if (min(p1,p2)<=threshold) 0 else min(p1,p2)” ).A similar approach is taken in evaluating predicatesfor low level image properties by using fuzzy quanti-fiers [15]. Image regions which match the target con-tent requirements can then be used to assess any otherspecifications (shape, size, colour) which appear in thesame requirement subtree within the query. Groups ofregions which are deemed salient with respect to thequery can be compared for the purpose of evaluatingrelations as mentioned above.

Figure 10: Examples of alternate ICON query in-

terfaces using sketch of classified target content(left) and region properties (right).

6 Evaluation

6.1 Qualitative and Quantitative Evalua-tion

Progress in CBIR research remains hampered by alack of standards for comparative performance evalua-tion [31, 30]. This is an intrinsic problem due to theextreme ambiguity of visual information with respectto human vs computer interpretations of content andthe strong dependence of relevance assessment uponthe particular feature sets and query methods imple-mented by a given system. There are no publicly avail-able image sets with associated ground truth data atthe different levels of granularity required to do jus-tice to different retrieval approaches, nor are there anystandard sets of queries and manually ranked resultswhich could easily be translated to the different for-mats and conventions adopted by different CBIR sys-tems. Furthermore, there are no usable techniques forassessing important yet elusive usability criteria relat-ing to the query interface as discussed in 1.1. Real-world users (rarely addressed in the CBIR research lit-erature) would be primarily interested in the ease withwhich they could formulate effective queries in a par-ticular system to solve their search requirements withminimal effort for their chosen data set. Even if largescale standardised test sets and sample queries wereavailable to the CBIR community, results derived fromthem might not be of much use in predicting perfor-mance on real-world retrieval tasks.

However, meaningful evaluation of retrieval meth-ods is possible if carried out for a set of well speci-fied retrieval tasks using the same underlying contentrepresentation and image database. We have assessedthe performance of the OQUEL language in terms ofits utility as a query tool both in terms of user effortand query performance. The ICON system has been inuse at our research lab and was demonstrated at con-ferences such as ICCV2001 and CVPR2001. Qualita-tively speaking, users find that the OQUEL languageprovides a more natural and efficient mechanism for

Figure 11: Plots of relative percentages for precision versus recall for the 4 retrieval experiments.

content-based querying than the other query methodspresent in ICON.

6.2 Experimental Method

While most evaluation of CBIR systems is per-formed on commercial image collections such as theCorel image sets, their usefulness is limited by the factthat they consist of very high quality photographic im-ages and that the associated ground truth (categorylabels such as “China”, “Mountains”, “Food”) are fre-quently too high level and sparse to be of use in perfor-mance analysis [30]. We therefore chose a set of imagesconsisting of around 670 Corel images augmented with412 amateur digital pictures of highly variable qualityand content. Manual relevance assessments in terms ofrelevant vs. non-relevant were carried out for all 1082images over the test queries described below. In thecase of the four test queries A–D below, the number of

relevant images was 77, 158, 53, and 67 respectively.In order to quantify the performance of our imple-

mentation of the OQUEL language in light of the in-herent difficulties of CBIR evaluation, we focussed oncontrasting its utility as a retrieval tool compared withthe other query modalities present in the ICON system.They are:

• Query-by-example: A set of weighted sample im-ages (both positive and negative examples). Com-parisons are performed on the basis of metrics suchas a pair-wise best region match criterion and aclassification pyramid distance measure.

• User drawn sketch: Desired image content com-posed by means of a sketch based query composi-tion tool which uses a visual thesaurus of targetimage content corresponding to the set of visualcategories.

• Feature range or predicate: Constraints on visual

appearance features (colour, shape, texture) de-rived from the region segmentation.

The user may assign different weights to the variouselements that comprise a query and can choose from aset of similarity metrics to specify the emphasis that isto be placed on the absolute position of target contentwithin images and overall compositional aspects. Allof the query components have access to the same pre-computed image representation as described in 4.1.

6.3 Test Queries

We chose four test queries which have the followingexpressions in the OQUEL language:

• Query A “bright red and stripy”

• Query B “people in centre”

• Query C “some water in the bottom half whichis surrounded by trees and grass, size at least 10”%

• Query D “winter”

These are not meant to constitute a representativesample over all possible image queries (no such sampleexists) but to illustrate performance and user search ef-fort for conceptually different retrieval needs expressedat different levels of description. For each OQUELquery we created a further two queries expressed us-ing the other search facilities of the ICON system:

• Composite query: a query which may combine asketch with feature constraints as appropriate toyield best performance in reasonable time.

• Query-by-example: the single image maximisingthe normalised average rank metric was chosen asthe query. This type of query is commonly usedto assess baseline performance.

Figures 6, 7, 8, and 9 show the four OQUEL queriesand their search results over the collection. Figure10 depicts examples of alternate queries consisting of

a combination of low level attributes and user drawnsketch.

6.4 Results

To quantify performance, precision vs. recall wascomputed using the ground truth images for each testquery as shown in figure 11. For each OQUEL query wealso show the results for a combined query (“Comb.”)and a query-by-example (“QBE”) designed and opti-mised to meet the same user search requirement. It canbe seen that OQUEL queries yield better results, espe-cially for the top ranked images. In the case of queryA, results are essentially the same as those for a queryconsisting of feature predicates for the region proper-ties “stripy” and “red”. In general OQUEL queriesare more robust to errors in the segmentation and re-gion classification due to their ontological structure.Query-by-example in particular is usually insufficientto express more advanced concepts relating to spatialcomposition, feature in variances, or object level con-straints.

As recommended in [31], we also computed the nor-malised average rank which is a useful stable measureof relative performance in CBIR:

Rank∼ =1

NNrel{

Nrel∑

i=1

Ri − Nrel{Nrel − 1}2

} (4)

where Ri is the rank at which the i’th relevant imageis retrieved, Nrel the number of relevant images, N thetotal number of images in the collection. The value ofRank∼ ranges from 0 to 1 where 0 indicates perfectretrieval.

Query Rank∼

A - OQUEL 0.2185A - Comb. 0.2184A - QBE 0.3992

B - OQUEL 0.2924B - Comb. 0.3081B - QBE 0.3693

C - OQUEL 0.2637C - Comb. 0.3159C - QBE 0.3530

D - OQUEL 0.1944D - Comb. 0.2582D - QBE 0.2586

Comparisons with other query composition and re-trieval paradigms implemented in ICON (sketch, sam-ple images, property thresholds) therefore show thatthe OQUEL query language constitutes a more efficientand flexible retrieval tool. Few prior interpretative con-straints are imposed and relevance assessments are car-ried out solely on the basis of the syntax and seman-tics of the query itself. Text queries have also generallyproven to be more efficient to evaluate since one onlyneeds to analyse those aspects of the image contentrepresentation which are relevant to nodes in the cor-responding syntax tree and because of various possibleoptimisations in the order of evaluation to quickly ruleout non-relevant images. Although the current systemdoes not use an inverted file as its index, query evalu-ation took no more than 100ms for the test queries.

7 Conclusions

7.1 Discussion

As explained above, one of the primary advantagesof the proposed language-based query paradigm forCBIR is the ability to leave the problem of initial do-main selection to the user. The retrieval process op-erates on a description of desired image content ex-pressed according to an ontological hierarchy definedby the language and relates this at retrieval time tothe available image content representation. Domainknowledge therefore exists at three levels: the struc-

ture and content of the user query, the ontology un-derlying the query language, and the retrieval mecha-nism which parses the user query and assesses imagerelevance. User queries may be quite high-level andemploy general terms, thus placing the burden of find-ing feature combinations which discriminate relevantfrom non-relevant images on the ontology and the in-terpreter. Richer, more specific queries narrow downthe retrieval focus. One can therefore offset user com-position effort and the need for greater language andparser complexity depending on the relative costs in-volved in a real world CBIR context.

The current implementation does not constitute anexhaustive means of mapping retrieval requirementsand relating them to images. Nor does the OQUELlanguage come close to embodying the full richness ofa natural language specification of concepts relating toproperties of photographic images. However, the cur-rent system does show that it is possible to utilise anontological language framework to fuse different indi-vidually weak and ambiguous sources of image infor-mation and content representation in a way which im-proves retrieval performance and usability of the sys-tem. Clearly there remain scalability issues as addi-tional classifiers will need to be added to improve therepresentational capacity of the query language. How-ever, the notion of ontology based languages provides apowerful tool for extending retrieval systems by addingtask and domain specific concept hierarchies at differ-ent levels of semantic granularity.

7.2 Summary and Future Outlook

Query composition is a relatively ill-understood partof research into CBIR and clearly merits greater atten-tion if image retrieval systems are to enter the main-stream. Most systems for content based image re-trieval offer query composition facilities based on exam-ples, sketches, feature predicates, structured databasequeries, or keyword annotation. Compared to doc-ument retrieval using text queries, user search effortremains significantly higher, both in terms of initialquery formulation and the need for relevance feedback.

This paper argues that query languages provide aflexible way of dealing with problems commonly en-countered in CBIR such as ambiguity of image con-tent and user intention and the semantic gap whichexists between user and system notions of relevance.By basing such a language on an extensible ontology,one can explicitly state ontological commitments aboutcategories, objects, attributes, and relations withouthaving to pre-define any particular method of queryevaluation or image interpretation. A central themeof the paper is the distinction between query descrip-tion and image description languages and the powerof a formally specifiable language featuring syntax andsemantics in order to capture meaning in images rela-tive to a query. The combination of individually weakand ambiguous clues to determine object presence andestimate overall probability of relevance builds on re-cent approaches to robust object recognition and canbe seen as an attempt at extending the success of in-dicative methods for content representation in the fieldof text retrieval.

We present OQUEL as an example of such a lan-guage. It is a novel query description language whichworks on the basis of short text queries describing theuser’s retrieval needs and does not rely on prior anno-tation of images. Query sentences can represent ab-stract and arbitrarily complex retrieval requirementsat multiple levels and integrate multiple sources of ev-idence. The query language itself can be extended torepresent customised ontologies defined on the basisof existing terms. An implementation of OQUEL forthe ICON system demonstrates that efficient retrievalof general photographic images is possible through theuse of short OQUEL queries consisting of natural lan-guage words and a simple syntax.

The use of more sophisticated natural language pro-cessing techniques would ease the current grammaticalrestrictions imposed by the syntax and allow statis-tical interpretation of more free-form query sentencesconsisting of words from an extended vocabulary. Per-haps most importantly, ongoing efforts aim to acquirethe weighting of the Bayesian inference nets used inscene and object recognition using a training corpus

and prior probabilities for the visual categories. Thegoal is to reduce the need for pre-wired knowledge suchas “an image containing regions of snow and ice is morelikely to depict a winter scene”. An approach such as[11] paired with the structural Expectation Maximisa-tion method might provide a means of automaticallyacquiring new high level terms and their inference net-works. The automated discovery of domain and generalpurpose ontologies together with the means of relatingthese to lower level evidence is an important challengefor data mining and machine learning research.

Acknowledgements

The authors would like to acknowledge directionalguidance and support from AT&T Laboratories Re-search and financial assistance from the Royal Com-mission for the Exhibition of 1851.

References[1] A. Abella and J. Kender. From pictures to words:

Generating locative descriptions of objects in an im-

age. ARPA94, pages II:909–918, 1994.

[2] M. Aiello, C. Areces, and M. de Rijke. Spatial reason-

ing for image retrieval. In Proc. International Work-

shop on Description Logics, 1999.

[3] S. Bechhofer and C. Goble. Description logics and

multimedia—applying lessons learnt from the Galen

project. In Proc. Workshop on Knowledge Represen-

tation for Interactive Multimedia Systems, 1996.

[4] A. Bobick and W. Richards. Classifying objects from

visual information. Technical report, MIT AI Lab,

1986.

[5] N. Campbell, W. Mackeown, B. Thomas, and T. Tros-

cianko. Interpreting image databases by region classifi-

cation. Pattern Recognition (Special Edition on Image

Databases), 30(4):555–563, April 1997.

[6] C. Carson, S. Belongie, H. Greenspan, and J. Malik.

Region-based image querying. In Proc. IEEE Work-

shop on Content-based Access of Image and Video Li-

braries, 1997.

[7] S. Chang, W. Chen, and H. Sundaram. Semantic visual

templates: Linking visual features to semantics. In

Proc. Workshop on Content Based Video Search and

Retrieval, 1998.

[8] T.-S. Chua, K.-C. Teo, B.-C. Ooi, and K.-L. Tan. Us-

ing domain knowledge in querying image databases.

In International Conference on Multimedia Modeling,

1996.

[9] I. Cox, M. Minka, T. Minka, T. Papathomas, and

P. Yianilos. The bayesian image retrieval system,

PicHunter. IEEE Transactions on Image Processing,

9(1):20–37, 2000.

[10] M. Dobie, T. Robert, D. Joyce, M. Weal, P. Lewis,

and W. Hall. A flexible architecture for content and

concept based multimedia information exploration. In

Proc. Second UK Conference on Image Retrieval, 1999.

[11] P. Duygulu, K. Barnard, J.F.H. De Freitas, and D.A.

Forsyth. Object recognition as machine translation:

Learning a lexicon for a fixed image vocabulary. In

Proc. European Conference on Computer Vision, 2002.

[12] A. Evans, N. Thacker, and J. Mayhew. The use of geo-

metric histograms for model-based object recognition.

In Proc. British Machine Vision Conference, 1993.

[13] D. Fensel, I. Horrocks, F. van Harmelen, S. Decker,

M. Erdmann, and M. Klein. OIL in a nutshell. In

Knowledge Acquisition, Modeling and Management,

pages 1–16, 2000.

[14] C. Fung and K. Loe. A new approach for image clas-

sification and retrieval. In Proc. ACM SIGIR Con-

ference on Research and Development in Information

Retrieval, pages 301–302. ACM, 1999.

[15] I. Glockner and A. Knoll. Fuzzy quantifiers for pro-

cessing natural-language queries in content-based mul-

timedia. Technical Report TR97-05, Faculty of Tech-

nology, University of Bielefeld, Germany, 1997.

[16] T. Gruber. Towards Principles for the Design of On-

tologies Used for Knowledge Sharing. Formal Ontol-

ogy in Conceptual Analysis and Knowledge Represen-

tation, 1993.

[17] V. Gudivada and V. Raghavan. Content-based image

retrieval systems. IEEE Computer, 28(9):18–22, 1995.

[18] M. Hacid and C. Rigotti. Representing and Rea-

soning on Conceptual Queries Over Image Databases.

In Proc. of the Eleventh International Symposium on

Methodologies for Intelligent Systems, LNCS 1609,

pages 340–348. Springer, 1999.

[19] C. Hsu, W. Chu, and R. Taira. A knowledge-based

approach for retrieving images by content. Knowledge

and Data Engineering, 8(4):522–532, 1996.

[20] W. Hu. An overview of the world wide web search

technologies. In Proc. SCI2001 – 5th World Multi-

conference on System, Cybernetics and Informatics,

2001.

[21] ICON System, AT&T Laboratories Cambridge.

www.uk.research.att.com/permm/icon.html.

[22] A. Jaimes and S. Chang. Model-based classification of

visual information for content-based retrieval. In Proc.

SPIE Conference on Storage and Retrieval for Image

and Video Databases, 1999.

[23] T. Kato, T. Kurita, N. Otsu, and K. Hirata. A sketch

retrieval method for full color image database query by

visual example. In Proc. Int. Conference on Pattern

Recognition, pages I:530–533, 1992.

[24] P. Kelly, T. Cannon, and D. Hush. Query by image ex-

ample: The comparison algorithm for navigating digi-

tal image databases (CANDID) approach. In Storage

and Retrieval for Image and Video Databases (SPIE),

pages 238–248, 1995.

[25] M. Lalmas. Applications of Uncertainty Formalisms,

chapter Information retrieval and Dempster-Shafer’s

theory of evidence, pages 157–177. Springer, 1998.

[26] J. Lim. Learnable visual keywords for image classi-

fication. In Proc. ACM International Conference on

Digital Libraries, 1999.

[27] G. Miller, R. Beckwith, C. Fellbaum, D. Gross, and

K. Miller. Introduction to Wordnet: an on-line lex-

ical database. International Journal of Lexicography,

3:235–244, 1990.

[28] T. Mills, D. Pye, D. Sinclair, and K. Wood. Shoebox:

A digital photo management system. Technical report,

AT&T Laboratories Cambridge, 1999.

[29] A. Mojsilovic, J. Gomes, and B. Rogowitz. Isee: Per-

ceptual features for image library navigation. In Proc.

2002 SPIE Human Vision and Electronic Imaging,

2002.

[30] H. Mueller, S. Marchand-Maillet, and T. Pun. The

truth about corel - evaluation in image retrieval. In

Proc. Conf. on Image and Video Retrieval, LNCS 2383,

pages 38–50. Springer, 2002.

[31] H. Mueller, W. Mueller, D. McG Squire, S. Marchand-

Maillet, and T. Pun. Performance evaluation in

content-based image retrieval: Overview and propos-

als. Pattern Recognition Letters, 22(5):593–601, 2001.

[32] S. Nepal, M. Ramakrishna, and J. Thom. A fuzzy

object query language (foql) for image databases. In

Sixth International Conference on Database Systems

for Advanced Applications, 1999.

[33] W. Niblack. The qbic project: querying images by

color, texture and shape. Technical report, IBM Re-

search Report RJ-9203, 1993.

[34] V. Ogle and M. Stonebraker. Chabot: Retrieval from

a relational database of images. IEEE Computer,

28(9):40–48, 1995.

[35] S. Parsons and A. Hunter. Applications of Uncertainty

Formalisms, chapter A review of uncertainty handling

formalisms, pages 8–37. Springer, 1998.

[36] G.M. Petrakis. Design and evaluation of spatial simi-

larity approaches for image retrieval. Image and Vision

Computing, 20:59–76, 2002.

[37] K. Rodden. How do people organise their pho-

tographs? In BCS IRSG 21st Annual Colloquium on

Information Retrieval Research, 1999.

[38] N. Roussoupolos, C. Falautsos, and T. Sellis. An effi-

cient pictorial database system for psql. IEEE Trans-

actions on Software Engineering, 14(5):639–659, 1988.

[39] W. Rucklidge. Locating ojects using the hausdorff dis-

tance. In Proc. International Conference on Computer

Vision, 1995.

[40] Y. Rui, T. Huang, and S. Chang. Image retrieval: Cur-

rent techniques, promising directions and open issues.

Journal of Visual Communication and Image Repre-

sentation, 10:39–62, 1999.

[41] S. Santini, A. Gupta, and R. Jain. Emergent semantics

through interaction in image databases. Knowledge

and Data Engineering, 13(3):337–351, 2001.

[42] H. Shen and K. Ooi, B. Tan. Finding semantically re-

lated images in the WWW. In Proc. ACM Multimedia,

pages 491–492, 2000.

[43] D. Sinclair. Voronoi seeded colour image segmenta-

tion. Technical Report TR99-04, AT&T Laboratories

Cambridge, 1999.

[44] D. Sinclair. Smooth region structure: folds, domes,

bowls, ridges, valleys and slopes. In Proc. Conference

on Computer Vision and Pattern Recognition, pages

389–394. IEEE Comput. Soc. Press, 2000.

[45] A. Smeulders, M. Worring, S. Santini, A. Gupta, and

R. Jain. Content-based image retrieval at the end of

the early years. IEEE Trans. Pattern Analysis and

Machine Intelligence, 22(12):1349–1380, 2000.

[46] J. Smith and S. Chang. Image and video search en-

gine for the world wide web. In Storage and Retrieval

for Image and Video Databases (SPIE), pages 84–95,

1997.

[47] K. Sparck Jones. Information retrieval and artificial

intelligence. Artificial Intelligence, 114:257–281, 1999.

[48] K. Tieu and P. Viola. Boosting image retrieval. In

Proc. International Conference on Computer Vision,

2000.

[49] C.P. Town and D.A. Sinclair. Content based image re-

trieval using semantic visual categories. Technical Re-

port MV01-211, Society for Manufacturing Engineers,

2001.

[50] N. Vasconcelos and A. Lippman. A bayesian frame-

work for content-based indexing and retrieval. In Proc.

Conference on Computer Vision and Pattern Recogni-

tion, 1998.

[51] L. Wenyin, S. Dumais, Y. Sun, H. Zhang, M. Czerwin-

ski, and B. Field. Semi-automatic image annotation.

In Proc. Interact2001 Conference on Human Computer

Interaction, 2001.

[52] M. Wood, N. Campbell, and B. Thomas. Iterative re-

finement by relevance feedback in content-based digital

image retrieval. In Proc. ACM Multimedia 98, pages

13–17, 1998.

Documents

Language-based Querying of Image Collections on the basis ...cpt23/papers/TownIVC2004.pdf · Image retrieval, query languages, ontologies, object recognition, language parsing 1 Introduction