IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...afornes/publi/journals/2014_PAMI_Almazan.pdf · Jon Almazan, Albert Gordo, Alicia Forn´ es, Ernest Valveny´ Abstract—This

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

Word Spotting and Recognition withEmbedded Attributes

Jon Almazan, Albert Gordo, Alicia Fornes, Ernest Valveny

Abstract—This article addresses the problems of word spotting and word recognition on images. In word spotting, the goal is tofind all instances of a query word in a dataset of images. In recognition, the goal is to recognize the content of the word image,usually aided by a dictionary or lexicon. We describe an approach in which both word images and text strings are embeddedin a common vectorial subspace. This is achieved by a combination of label embedding and attributes learning, and a commonsubspace regression. In this subspace, images and strings that represent the same word are close together, allowing one to castrecognition and retrieval tasks as a nearest neighbor problem. Contrary to most other existing methods, our representation hasa fixed length, is low dimensional, and is very fast to compute and, especially, to compare. We test our approach on four publicdatasets of both handwritten documents and natural images showing results comparable or better than the state-of-the-art onspotting and recognition tasks.

Index Terms—Word image representation, Attribute-based representation, Handwritten Text, Scene Text, Word Spotting, WordRecognition

F

1 INTRODUCTION

T EXT understanding in images is an importantproblem that has drawn a lot of attention from the

computer vision community since its beginnings. Textunderstanding covers many applications and tasks,most of which originated decades ago due to thedigitalization of large collections of documents. Thismade necessary the development of methods ableto extract information from these document images:layout analysis, information flow, transcription andlocalization of words, etc. Recently, and motivatedby the exponential increase of publicly available im-age databases and personal collections of pictures,this interest now also embraces text understandingon natural images. Methods able to retrieve imagescontaining a given word or to recognize words in apicture have also become feasible and useful.

In this paper we consider two problems related totext understanding: word spotting and word recogni-tion. In word spotting, the goal is to find all instancesof a query word in a dataset of images. The queryword may be a text string – in which case it is usuallyreferred to as query by string (QBS) or query bytext (QBT) –, or may also be an image, – in whichcase it is usually referred to as query by example(QBE). In word recognition, the goal is to obtaina transcription of the query word image. In manycases, including this work, it is assumed that a text

• J. Almazan, A. Fornes and E. Valveny are with the Computer VisionCenter, Dept. de Ciencies de la Computacio, Universitat Autonoma deBarcelona, Spain. E-mail: almazan,afornes,[email protected].

• A. Gordo is with the LEAR Team, INRIA, Grenoble, France. E-mail:[email protected].

• A preliminary version of this paper [1] appears in ICCV2013.

dictionary or lexicon is supplied at test time, and thatonly words from that lexicon can be used as candidatetranscriptions in the recognition task. In this work wewill also assume that the location of the words inthe images is provided, i.e., we have access to imagesof cropped words. If those were not available, textlocalization and segmentation techniques [2], [3], [4],[5] could be used, but we consider that out of thescope of this work1.

Traditionally, word spotting and recognition havefocused on document images [6], [7], [8], [9], [10],[11], [12], [13], [14], where the main challenges comefrom differences in writing styles: the writing styles ofdifferent writers may be completely different for thesame word. Recently, however, with the developmentof powerful computer vision techniques during thelast decade, there has been an increased interest inperforming word spotting and recognition on naturalimages [15], [16], [3], [4], [17], [5], which poses differ-ent challenges such as huge variations in illumination,point of view, typography, etc.

Word spotting can be seen as a particular case ofsemantic content based image retrieval (CBIR), wherethe classes are very fine-grained – we are interestedin exactly one particular word, and a difference ofonly one character is considered a negative result –but also contain a very large intra-class variability– writing styles, illumination, typography, etc, canmake the same word look very different. In the sameway, word recognition can be seen as a special case

1. One may argue that, when words are cropped, one is no longerperforming word spotting but word ranking or word retrieval.However, word spotting is the commonly accepted term even whenthe word images are cropped, and we follow that convention in thiswork.


Images

f : I ! RD

Attribute Embedding

I(I1)

I(I2)

I1

I2

I(I1)

I(I2)

y(Y1)

Y(Y) = V TY(Y)I(I) = WT f(I)

Label Embedding (PHOC)

Y : Y ! Rd

Y1

Label

“hotel”

I(I) = UTI(I)

Common Subspace

f(I2)

f(I1)

I : I ! Rd I : I ! Rd0

Y : Y ! Rd0

I(Y1)

Figure 1: Overview of the proposed method. Images are first projected into an attributes space with theembedding function φI after being encoded into a base feature representation with f . At the same time, labelsstrings such as “hotel” are embedded into a label space of the same dimensionality using the embeddingfunction φY . These two spaces, although similar, are not strictly comparable. Therefore, we project theembedded labels and attributes in a learned common subspace by minimizing a dissimilarity functionF (I,Y;U, V ) = ||UTφI(I) − V TφY(Y)||22 = ||ψI(I) − ψY(Y)||22. In this common subspace representations arecomparable and labels and images that are relevant to each other are brought together.

of very fine-grained, zero-shot classification, wherewe are interested in classifying a word image into(potentially) hundreds of thousands of classes, forwhich we may not have seen any training example.The examples on Figs. 9 and 10 illustrate these issues.

In this work we propose to address the spottingand recognition tasks by learning a common repre-sentation for word images and text strings. Usingthis representation, spotting and recognition becomesimple nearest neighbor problems. We first proposea label embedding approach for text labels inspiredby the bag of characters string kernels [18], [19] usedfor example in the machine learning and biocom-puting communities. The proposed approach embedstext strings into a d−dimensional binary space. In anutshell, this embedding –which we dubbed pyrami-dal histogram of characters or PHOC – encodes ifa particular character appears in a particular spatialregion of the string (cf . Fig 2). Then, this embedding isused as a source of character attributes: we will projectword images into another d−dimensional space, morediscriminative, where each dimension encodes howlikely that word image contains a particular characterin a particular region, in obvious parallelism withthe PHOC descriptor. By learning character attributesindependently, training data is better used (since thesame training words are used to train several at-tributes) and out of vocabulary (OOV) spotting andrecognition (i.e., spotting and recognition at test timeof words never observed during training) is straight-forward. However, due to some differences (PHOCsare binary, while the attribute scores are not), directcomparison is not optimal and some calibration isneeded. We finally propose to learn a low-dimensionalcommon subspace with an associated metric betweenthe PHOC embedding and the attributes embedding.

The advantages of this are twofold. First, it makesdirect comparison between word images and textstrings meaningful. Second, attribute scores of imagesof the same word are brought together since theyare guided by their shared PHOC representation. Anoverview of the method can be seen in Figure 1.

By having images and text strings share a commonsubspace with a defined metric, word spotting andrecognition become a simple nearest neighbor prob-lem in a low-dimensional space. We can perform QBEand QBS (or even a hybrid QBE+S, where both animage and its text label are provided as queries) usingexactly the same retrieval framework. The recognitiontask simply becomes finding the nearest neighbor ofthe image word in a text dictionary embedded firstinto the PHOC space and then into the commonsubspace. Since we use compact vectors, compressionand indexing techniques such as Product Quantiza-tion [20] could now be used to perform spotting invery large datasets. To the best of our knowledge, we

beyond

bey

ond

be

yo

nd

L1

L2

L3

a b c d e m n o [ · · · ] [ · · · ] x y z

[ · · · ] [ · · · ]

[ · · · ] [ · · · ]

[ · · · ] [ · · · ]

[ · · · ] [ · · · ]

[ · · · ] [ · · · ]

a b c d e

a b c d e

a b c d e

a b c d e

a b c d e

m n o

m n o

m n o

m n o

m n o

x y z

x y z

x y z

x y z

x y z

Figure 2: PHOC histogram of a word at levels 1, 2, and3. The final PHOC histogram is the concatenation ofthese partial histograms.


are the first to provide a unified framework where wecan perform out of vocabulary (OOV) QBE and QBSretrieval as well as word recognition using the samecompact word representations.

The rest of the paper is organized as follows. InSection 2, we review the related work in word spot-ting and recognition. In Section 3 we describe how toencode our images into a low-dimensional attributerepresentation. In Section 4 we describe the proposedcommon subspace learning. Section 5 suggests somepractices to learn the attributes space and the commonsubspace when training data is scarce. Section 6 dealswith the experimental validation of our approach.Finally, Section 7 concludes the paper.

This paper is an extended version of the work ini-tially published in ICCV 2013 [1]. Novel contributionsinclude a better base feature representation tailoredfor word images (Section 3), a more detailed formu-lation of the common subspace problem (Section 4),a bagging approach to learn with scarce data (Section5), and evaluation on recognition tasks, as well as newdatasets, including two popular benchmarks based onnatural images (Section 6).

2 RELATED WORK

Here we review the works most related to some keyaspects of our proposed approach.

2.1 Word Spotting and Recognition in DocumentImagesWord spotting in document images has attracted at-tention in the document analysis community duringthe last two decades [6], [7], [8], [9], [10], [11], [12],and still poses lots of challenges due to the difficul-ties of historical documents, different scripts, noise,handwritten documents, etc.

Because of this complexity, most popular techniqueson document word spotting have been based ondescribing word images as sequences of features ofvariable length and using techniques such as DynamicTime Warping (DTW) or Hidden Markov Models(HMM) to classify them. Variable-length features aremore flexible than feature vectors and have beenknown to lead to superior results in difficult word-spotting tasks since they can adapt better to thedifferent variations of style and word length [6], [7],[10], [11], [12], [14], [21].

Unfortunately, this leads to two unsatisfying out-comes. First, due to the difficulties of learning withsequences, many supervised methods cannot performOOV spotting, i.e., only a limited number of key-words, which need to be known at training time,can be used as queries. Second, because the methodsdeal with sequences of features, computing distancesbetween words is usually slow at test time, usuallyquadratic with respect to the number of features. Withefficient implementations they may be fast enough

for some practical purposes (e.g., making a searchconstrained in a particular book [14]), although deal-ing with very large volumes of data (e.g. millions ofimages) at testing time would be very inefficient.

Indeed, with the steady increase of datasets sizethere has been a renewed interest in compact, fast-to-compare word representations. Early examples ofholistic representations are the works of Manmatha etal. [8] and Keaton et al. [22]. In [8], a distance betweenbinary word images is defined based on the result ofXORing the images. In [22], a set of features basedon projections and profiles is extracted and used tocompare the images. In both cases, the methods arelimited to tiny datasets. A more recent work [23]exploits the Fisher kernel framework [24] to constructthe Fisher vector of a HMM. This representation hasa fixed length and can be used for efficient spottingtasks, although the paper focuses on only 10 dif-ferent keywords. Finally, recent approaches that arenot limited to keywords can be found in [25], [26],[27]. Gatos et al. [25] perform a template matchingof block-based image descriptors, Rusinol et al. [26]use an aggregation of SIFT descriptors into a bag ofvisual words to describe images, while Almazan etal. [27] use HOG descriptors [28] combined with anexemplar-SVM framework [29]. These fast-to-comparerepresentations allow them to perform word spottingusing a sliding window over the whole documentwithout segmenting it into individual words. Al-though the results on simple datasets are encouraging,the authors argue that these fixed-length descriptorsdo not offer enough flexibility to perform well onmore complex datasets and especially in a multi-writer scenario.

Through this paper we follow these recent works[27], [26] and focus on fixed-length representations,which are faster to compare and store and can be usedin large-scale scenarios. Our proposed approach basedon attributes directly addresses the aforementionedproblems: our attributes framework very naturallydeals with OOV query words at test time, whileproducing discriminative, compact signatures that arefast to compute, compare, and store.

Regarding word recognition, handwritten recogni-tion still poses an important challenge for the samereasons. As in word spotting, a popular approachis to train HMMs based on grapheme probabilities[13]. A model is first trained using labeled train-ing data. At test time, given an image word anda text word, the model computes the probability ofthat text word being produced by the model whenfed with the image word. Recognition can then beaddressed by computing the probabilities of all thelexicon words given the query image and retrievingthe nearest neighbor. As in the word spotting case, themain drawback here is the comparison speed, sincecomputing these probabilities is orders of magnitudeslower than computing an Euclidean distance or a dot


product between vectorial representations.

2.2 Word Spotting and Recognition in Natural Im-ages

The increasing interest in extracting textual informa-tion from real scenes is related to the recent growthof image databases such as Google Images or Flickr.Some interesting tasks have been recently proposed,e.g. localization and recognition of text in GoogleStreet View images [15] or recognition in signs har-vested from Google Images [30]. The high complexityof these images when compared to documents, mainlydue the the large appearance variability, makes itvery difficult to apply traditional techniques of thedocument analysis field. However, with the recentdevelopment of powerful computer vision techniquessome new approaches have been proposed.

Some methods have focused on the problem ofend-to-end word recognition, which comprises thetasks of text localization and recognition. Wang et al.[15] address this problem by combining techniquescommonly applied in object recognition, such as Ran-dom Ferns and Pictorial Structures. They first detecta set of possible character candidates windows usinga sliding window approach and then each word inthe lexicon is matched to these detections. Finally, theone with the highest score is reported as the predictedword. Neumann and Matas [3], [4] also address thisproblem of end-to-end word recognition. In [3] theypose the character detection problem as a sequentialselection from the set of Extremal Regions. Then, therecognition of candidate regions is done in a separateOCR stage using synthetic fonts. In [4] they introducea novel approach for character detection and recog-nition where they first detect candidate charactersas image regions which contain strokes of specificorientations and then, these characters are recognizedusing specific models trained for each character andgrouped into lines to form words. Bissacco et al. [5]take advantage of recent progress in machine learning,concretely in deep neural networks, and large scalelanguage modeling. They first perform a text detec-tion process returning candidate regions containingindividual lines of text, which are then processed fortext recognition. This recognition is done by iden-tifying candidate character regions and maximizinga score that combines the character classifier andlanguage model likelihoods. Although they use thelexicon information in a post-process to correct somerecognition errors, one important advantage of thismethod is that it does not requiere an available lexiconfor a full recognition of the image words.

Different problems are also explored by Mishra et al.[30], [16], [31]. In [30] they focus only on the problemof recognition and present a framework that uses then-gram information in the language by combiningthese priors into a higher order potential function in a

Conditional Random Field model defined on the im-age. In [31] they propose a method to perform imageretrieval using textual cues. Instead of relying on aperfect localization and recognition to retrieve imagescontaining a given text query, they propose a query-driven search: they find approximate locations of thecharacters in the text query, and then impose spatialconstrains. By contrast, [16] address this problem froma different point of view. Rather than pre-selecting aset of character detections, they define a global modelthat incorporates language priors and all potentialcharacters. They present a framework that exploitsbottom-up cues, derived from Conditional RandomField models on individual character detections, andtop-down cues, obtained from a lexicon-based prior.Goel et al. [32] address the problem of recognition asa retrieval framework: lexicon words are transformedin a collection of synthetic images and the recogni-tion is posed as retrieving the best match from thelexicon image set. They use gradient-based features torepresent the images and a weighted Dynamic TimeWarping to perform the matching.

In general, the main structure of these methodsconsists of a first step of probabilistic detection ofcharacter candidate regions in the image, and a sec-ond step of recognition using character models andgrouping constrains. This leads to models tailored forrecognition, but with a limited usability for other taskssuch as comparing two word images (for QBE), orstoring word images using a compact representationfor indexing purposes. Our model, by contrast, ad-dresses these issues in a natural way and is usefulboth for recognition and retrieval tasks.

2.3 Zero-Shot Learning and Label Embedding

To learn how to retrieve and recognize words thathave not been seen during training, it is necessary tobe able to transfer knowledge between the trainingand testing samples. One of the most popular ap-proaches to perform this zero-shot learning in com-puter vision involves the use of visual attributes [33],[34], [35], [36]. In this work we use character attributesto transfer the information between training and test-ing samples. Although the idea of separating wordsinto characters and learning at the character levelhas been used before (see, e.g., the character HMMmodels of [6], [37]), these approaches have been tiedto particular HMM models with sequence features,and so their performance has been bounded by them.In our case, we propose a broader framework since wedo not constrain the choice of features or the methodto learn the attributes.

Our work can also be related to label embeddingmethods [38], [39], [17], where labels are embeddedinto a different space and a compatibility functionbetween images and labels is defined. Of those, thework of Rodriguez and Perronnin [17] is the most


related to our work, since it also deals with textrecognition and presents a text embedding approach(spatial pyramid of characters or SPOC) very similarto ours2. The main difference stems from how theembedding is used. While in our case, we use itas a source of attributes, and only then we try tofind a common subspace between the attributes andthe PHOCs, Rodriguez and Perronnin try to find acommon subspace directly between their image repre-sentation (Fisher vectors [40]) and their SPOCs usinga structured SVM framework. Our approach can beseen as a more regularized version of theirs, since weenforce that the projection that embeds our imagesinto the common subspace can be decomposed intoa matrix that projects the images into an attributesspace.

3 ATTRIBUTES BASED WORD REPRESEN-TATION

In this section we describe how we obtain the at-tributes based-representation of a word image. Westart by motivating our pyramidal histogram of char-acters (PHOC) representation, which embeds labelstrings into a d−dimensional space. We then showhow to use this PHOC representation to encode wordimages.

One of the most popular approaches to performsupervised learning for word spotting is to learnmodels for particular keywords. A pool of positiveand negative samples is available for each keyword,and a model (usually a HMM) is learned for eachof them. At test time, it is possible to compute theprobability of a given word being generated by thatkeyword model, and that can be used as a score. Notethat this approach restricts one to keywords that needto be learned offline, usually with large amounts ofdata. In [12], this problem is addressed by learning asemicontinuous HMM (SC-HMM). The parameters ofthe SC-HMM are learned on a pool of unsupervisedsamples. Then, given a query, this SC-HMM modelcan be adapted, online, to represent the query. Thismethod is not restricted to keywords and can performOOV spotting. However, the labels of the trainingwords were not used during training.

One disadvantage of these approaches that learnat the word level is that information is not sharedbetween similar words. For example, if learning anHMM for a “car” keyword, “cat” would be considereda negative sample, and the shared information be-tween them would not be explicitly used. We believethat sharing information between words is extremelyimportant to learn good discriminative representa-tions, and that the use of attributes is one way toachieve this goal. Attributes are semantic propertiesthat can be used to describe images and categories

2. Both the conference version of this paper [1] and the work ofRodriguez and Perronnin [17] appeared simultaneously.

[34], and have recently gained a lot of popularity forimage retrieval and classification tasks [34], [33], [41],[42], [36]. Attributes have also shown ability to trans-fer information in zero-shot learning settings [33], [34],[35], [36] and have been used for feature compres-sion since they usually provide compact descriptors.These properties make them particularly suited forour word representation task, since they can transferinformation from different training words and lead tocompact signatures. The selection of these attributesis commonly a task-dependent process, so for theirapplication to word spotting we should define themas word-discriminative and appearance-independentproperties. In the following subsection we describeour label embedding approach, which embeds textstrings into a binary vectorial space, and later we willshow how to use this embedding as a source of word-discriminative visual attributes.

3.1 Text Label Embedding with PHOCsOne straightforward approach to embed text strings isto construct a (binary) histogram of characters. Whenusing digits plus the English alphabet, this leads to ahistogram of 36 dimensions3, where each dimensionrepresents whether the text string contains a particularcharacter or not. In an attributes context, these canbe understood as the labels for attributes defined as“word contains an x” or “word contains a y”.

However, this label embedding is not word-discriminative: words such as “listen” and “silent”share the same representation. Therefore, we proposeto use a pyramid version of this histogram of char-acters, which we dubbed PHOC (see Fig. 2). Insteadof finding characters on the whole word, we focus ondifferent regions of the word. At level 2, we defineattributes such as “word contains character x on thefirst half of the word” and “word contains characterx on the second half of the word”. Level 3 splitsthe word in 3 parts, level 4 in 4, etc. In practice,we use levels 2, 3, 4, and 5, leading to a histogramof (2 + 3 + 4 + 5) × 36 = 504 dimensions. Finally,we also add the 50 most common English bigramsat level 2, leading to 100 extra dimensions for atotal of 604 dimensions. These bigrams let us encoderelations between adjacent characters, which mayhelp to disambiguate when finding a low-dimensionalcommon subspace (cf . Section 4). In this case, whenusing a pyramidal encoding and bigrams, “listen” and“silent” have significantly different representations.

Given a transcription of a word we need to de-termine the regions of the pyramid where we assigneach character. For that, we first define the normalizedoccupancy of the k-th character of a word of length n

3. We do not make any distinction between lower-case andupper-case letters, which leads to a case-insensitive representation.It is trivial to modify it to be case-sensitive, at the cost of addinganother 26 attributes. It is also straightforward to include otherpunctuation marks.


Labeled word images

Attribute i model

Trainingsamples

Traininglabels

everyone

journey

meetings

vineyards

president

knowledge

greatest

slightest

Images

Transcriptions

.

.

.

Fisher Vectors

PHOC i

.

.

.

101

0

Figure 3: Training process for i-th attribute model. An SVM classifier is trained using the Fisher vectorrepresentation of the images and the i-th value of the PHOC representation as label.

as the interval Occ(k, n) = [ kn ,k+1n ], where the position

k is zero-based. Note that this information is extractedfrom the word transcription, not from the word image.We remark that we do not have access to the exactposition of the characters on the images at trainingtime, only their transcription is available. We use thesame formula to obtain the occupancy of region r atlevel l. Then, we assign a character to a region if theoverlap area between their occupancies is larger orequal than 50% the occupancy area of the character,i.e., if |Occ(k,n)∩Occ(r,l)|

|Occ(k,n)| ≥ 0.5, where |[a, b]| = b − a.This is trivially extended to bigrams or trigrams.

3.2 Learning Attributes with PHOCs

As we mentioned, the PHOC histograms can be seenas labels of attributes asking questions such as “wordcontains character x on the first half of the word” or“word contains character y on the second half of theword”. These attributes are word-discriminative, sincethey are based on the word-discriminative PHOCembedding. If the attributes are learned using datacoming from different writers or sources, the resultingmodels will also be robust to changes in appearance,style, etc.

To learn these attributes we use linear SVMs. Wordimages are first encoded into feature vectors, andthese feature vectors are used together with the PHOClabels to learn SVM-based attribute models. The ap-proach is illustrated in Figure 3. To represent theimages, we use Fisher vectors (FV) [40], a state-of-the-art encoding method [43] which works well withlinear classifiers. The FV can be seen as a bag of visualwords [44] that encodes not only word counts but alsohigher-order statistics. In a nutshell, at training time,low-level descriptors (SIFTs in our case) are extractedfrom the training images and used to learn a Gaussianmixture model (GMM) λ = wk, µk,Σk, k = 1 . . .K,where w are the mixing weights, µ are the means,Σ the (diagonal) covariances, and K is the number ofGaussians. Then, to compute the representation of oneimage, one densely extracts its low-leveldescriptorsand aggregates the gradients of the GMM model withrespect to its parameters (usually only means and

Figure 4: Spatial pyramids on word images. The sizesand contents of each spatial region are very dependenton the length of the word.

variances are considered, since weights add little extrainformation) evaluated at those points. This leads to ahighly discriminative, high-dimensional signature. Wenote however that there is absolutely no requirementto use Fisher vectors or SVMs. Any encoding methodand classification algorithm that transforms the inputimage into attribute scores could be used to replacethem. We chose SVMs and FVs for their simplicityand effectivity.

3.3 Adding Spatial Information

One problem with many image encoding methods,including the FV, is that they do not explicitly encodethe position of the features, which is extremely impor-tant to describe word images. If the spatially-awareattributes allow one to ask more precise questionsabout the location of the characters, spatial informa-tion on the image representation is needed to be ableto correctly answer those questions.

One well-established approach to add spatial in-formation is to use spatial pyramids [45]. Instead ofaggregating the descriptors of the whole image, theimage is first divided in k regions, and the features ofeach region are aggregated independently. This pro-duces k independent descriptors that are concatenatedto produce the final descriptor. When dealing withword images, however, this poses a problem: words ofdifferent lengths will produce regions of very differentsizes and contents, see Fig. 4.

A different approach that works well in combina-tion with FVs was proposed by Sanchez et al [46].


0,0

-0.5,-0.5

0.5,0.5

-0.2,-1.25 0.3,-1.1

Figure 5: Word image and the automatically adjustedreference box that defines the coordinates system.

In a nutshell, the SIFT descriptors of the image areenriched by appending the normalized x and y co-ordinates and the scale they were extracted at. Then,the GMM is learned not on the original SIFT featuresbut on these enriched ones. When computing the FVusing these features and GMM, the representationimplicitly encodes the position of the features insidethe word. They showed that, for natural images, thisachieved results comparable to spatial pyramids withmuch lower-dimensional descriptors. When dealingwith natural images, the x and y coordinates werenormalized between −0.5 and 0.5. In the case of wordimages we follow the same approach. However, crop-ping differences during annotation lead to changesof the word position inside the image, making thesecoordinates less robust. Because of this, instead ofusing the whole word image as a coordinate system,we automatically and approximately find the beginingand end, as well as the baseline and the median line ofthe word (by greedily finding the smallest region thatcontains 95% of the density of the binarized image)and use that for our reference coordinates system. Thecenter of that box corresponds to the origin, and thelimits of the box are at [−0.5, 0.5]. Pixels outside of thatreference box are still used with their correspondingcoordinates. See Figure 5 for an illustration.

Either when using spatial pyramids or xy enriching,the GMM vocabulary is learned using the whole im-age. We propose to improve these representations bylearning region-specific GMMs. At training time, wesplit the images in regions similar to spatial pyramids,and learn an independent, specialized vocabulary oneach region. These GMMs are then merged togetherand their weights are renormalized to sum 1.

We evaluated these different representations on theIAM dataset (cf. Section 6 for more details regardingthe dataset). The goal here is to find which rep-resentation leads to better results at predicting theattributes at the right location; correctly predictingthe attributes is of paramount importance, since itis deeply correlated with the final performance atretrieval and recognition tasks. We used the trainingset of IAM to learn the attributes, and then evaluatethe average precision of each attribute on the testset and report the mean average precision. Figure 6shows the results. It is clear that some type of spatialinformation is needed, either xy enriching or spatialpyramid. The specialized GMMs do not work well

FV no SP FV no SP + xy

FV SP FV SP + xy FV spec. FV spec. + xy

40,62 49,51 56,16 57,29 40,54 61,00

Mea

n Av

erag

e Pr

ecis

ion

(%)

30

35

40

45

50

55

60

65

70

FV no SP

FV no SP + xy

FV SP

FV SP + xy

FV spec

.

FV spec

. + xy

Figure 6: Results of the attributes classifiers for differ-ent Fisher vector configurations on the IAM dataset.

when used directly, which is not surprising, since thedistribution of characters in words is in general (closeto) uniform, and so the specialized GMMs are actu-ally very similar. However, when learning specializedGMMs on enriched SIFT features, the coordinates addsome information about the position that the special-ized GMM is able to exploit independently on eachregion. The final result is that the specialized GMMson enriched SIFTs lead to the best performance, andis the representation that we will use through the restof our experiments.

4 ATTRIBUTES AND LABELS COMMONSUBSPACE

Through the previous section we presented anattributes-based representation of the word images.Although this representation is robust to appearancechanges, special care has to be put when comparingdifferent words, since the scores of one attribute maydominate over the scores of other attributes. Directlycomparing embedded attributes and embedded textlabels is also not well defined: although both lie in asimilar space of the same dimensionality, the embed-ded text labels are binary, while the attribute scoresare not and have different ranges. Even if directlycomparing those representations yields reasonable re-sults due to their similarities, such direct comparisonis not well principled. Therefore, some calibration ofthe attribute scores and PHOCs is necessary.

One popular approach to calibrate SVM scores isPlatts scaling. It consists of fitting a sigmoid overthe output scores to obtain calibrated probabilities,P (y = 1|s) = (1 + exp(αs + β))−1, where α andβ can be estimated using MLE. In the recent [47],Extreme Value Theory is used to fit better probabilitiesto the scores and to find a multi-attribute space simi-larity. After the calibration, all scores are in the range[0− 1], which makes them more comparable betweenthemselves –useful for the QBE task–, as well asmore comparable to the binary PHOC representation– useful for the QBS and recognition tasks.


One disadvantage of such approaches is that theydo not take into account the correlation between thedifferent attributes. In our case this is particularlyimportant, since our attributes are very correlateddue to multilevel encoding and the bigrams. Here wepropose to perform the calibration of the scores jointly,since this can better exploit the correlation betweendifferent attributes. To achieve this goal, we firstpropose to address it as a ridge regression problem.However, this only takes into account the correlationbetween the attribute scores, and ignores the corre-lations between the attributes themselves. Therefore,we also propose a common subspace regression (CSR)that leads to a formulation equivalent to CanonicalCorrelation Analysis.

Let I = In, n = 1, . . . , N be a set of N im-ages available for training purposes, and let Y =Yn, n = 1, . . . , N be their associated labels. Let alsoA = φI(I) ∈ <d×N be the N images embeddedin the d−dimensional attribute space, and let B =φY(Y) ∈ 0, 1d×N be the N labels embedded in thed−dimensional label space. Then, one straightforwardway to relate the attribute scores of A to the em-bedded labels of B is to define a distance functionF (Ii,Yi;P ) = ||PTφI(Ii)− φY(Yi)||22, with P ∈ <d×d,and to minimize the distance across all the samplesand their labels,

argminP

N∑i

1

2F (Ii,Yi;P ) +

1

2Ω(P ) =

argminP

1

2||PTA−B||2F +

1

2Ω(P ),

(1)

and where Ω(P ) = α||P ||2F is a regularization termand α controls the weight of the regularization. In thiscase this is equivalent to a ridge regression problemand P has a closed form solution P = (AAT +αI)−1ABT , where I is the identity matrix. Since dis the number of attributes, which is low, solvingthis problem (which needs to be solved only once,at training time) is extremely fast.

As we mentioned, however, this formulation onlyexploits the correlation of the attribute scores andignores the correlations between the attributes them-selves. We therefore modify it to project both viewsinto a common subspace of dimensionality d′ (see Fig.7). We define a new distance function F (Ii,Yi;U, V ) =||ψI(Ii) − ψY(Yi)||22, with ψI(I) = UTφI(I) andψY(Y) = V TφY(Y) being two linear embedding func-tions that use projection matrices U, V ∈ <d×d′

toembed φI(I) and φY(Y) into a common subspace.Then, analogous to the previous case, the goal is tominimize the distance across all the samples and their

labels,

argminU,V

N∑i

1

2F (Ii,Yi;U, V ) +

1

2Ω(U) +

1

2Ω(V ) =

argminU,V

1

2||UTA− V TB||2F +

1

2Ω(U) +

1

2Ω(V ),

s.t.

ψI(I)ψI(I)T = I

ψY(Y)ψY(Y)T = I,(2)

where the orthogonality constrains ensure that thesolutions found are not trivial.

By using lagrangian multipliers, taking derivativeswith respect to U and V , and making them equal tozero, one arrives to the following equalities:

λ(AAT + αI)uk = ABT vk

λ(BBT + αI)vk = BATuk,(3)

where uk and vk are the k−th columns of matrices Uand V , and λ appears due to the lagrangian multipli-ers. When solving for uk, one arrives to the followinggeneralized eigenvalue problem:

ABT (BBT + αI)−1BATuk = λ2(AAT + αI)uk (4)

The first k generalized eigenvectors form the first kcolumns of the projection matrix U . This allows oneto choose the final dimensionality d′. An analogousprocess can be used to obtain the projection matrixV . We can observe how, in this case, we explicitly usemore relations between the data than in the regressioncase, which leads to better models. This model alsoallows one to control the output dimensionality andperform dimensionality reduction. In some of ourexperiments we will reduce the final dimensionality ofour representations down to 80 dimensions while stillobtaining state-of-the-art results. As in the regressioncase, the matrices U and V are very fast to obtain sincethey depend on the dimensionality of the attributespace, which is low.

Interestingly, these equations are also the solutionto the Canonical Correlation Analysis (CCA) problem,where one tries to find the projections that maximizethe correlation in a common subspace [48]. CCA isa tool to exploit information available from differentdata sources, used for example in retrieval [49] andclustering [50]. In [51], CCA was used to correlateimage descriptors and their labels, which broughtsignificant benefits for retrieval tasks. We believe thisis the most similar use of CCA to our approach.However, while [51] combined images and labels withthe hope of bringing some semantic consistency tothe image representations, our goal here is to bringthe imperfect predicted scores closer to their perfectvalue in a common subspace.

The optimization shown in Equation (2) aims atminimizing the distance between the images and


[ · · · ]

a1 a2 a3 a4 a5 ad

[ · · · ]

a1 a2 a3 a4 a5 ad

[ · · · ]

p1 p2 p3 p4 p5 pd’

U V

Attribute Embedding Label Embedding (PHOC)

Common Subspace

I(I) Y(Y)

Y(Y) I(I)

Figure 7: Projection of predicted attribute scores andattributes ground truth into a more correlated sub-space with CSR.

their corresponding labels, but makes no effort inpushing apart negative labels and learning to rank.It is inviting to, instead, learn the parameters of Fthat optimize the ranking loss directly, as done forexample in Weston et al. [52]. However, we foundthat the results of the CSR were similar or superior tothose obtained optimizing the ranking loss directly.We believe this is due to the non-convexity of theoptimization problem. Interestingly, similar resultswere obtained in the text embedding method of [17],where the structured SVM approach used to optimizetheir (similar) compatibility function barely improvedover the initialization of the parameters based onregression.

One may also note that the relation between theattribute scores and the binary attributes may notbe linear, and that a kernelized approach (KCSR)could yield larger improvements. In this case, wefollow the approach of [51]: we explicitly embedthe data using a random Fourier feature (RFF) map-ping [53], so that the dot-product in the embeddedspace approximately corresponds to a Gaussian kernelK(x, y) = exp(−γ||x − y||2) in the original space,and then perform linear projection on the embeddedspace. In this case, at testing time, a sample is firstprojected into the attribute space, then embeddedusing the RFF mapping, and finally projected into thecommon subspace using the learned projections.

5 LEARNING WITH SCARCE DATA

One inconvenience of learning the attribute space andthe common subspace in two independent steps is theneed of sufficiently large amounts of training data.This is because the data used to learn the commonsubspace should be different than the data used tolearn the attribute space. The reason is that, if weembed the same data used to train the attributes intothe attributes space, the scores of the SVMs will beseverely overfit (most of them will be very close to−1 or 1), and therefore the common subspace learnedusing that data will be extremely biased, leading toinferior results. If enough training data is available,one can construct two disjoint training sets, train theattribute on one of the sets, and train the common

subspace using the other set. However this does notfully exploit the training data, since each trainingsample is used only to learn the attributes or thecommon subspace, but not both.

To overcome this problem, we propose to use avariant of bagging. The training data is split in sev-eral folds of training and validation partitions. Thetraining and validation data of each fold is disjoint,but different folds will have overlapping data. In eachfold, a model is learned using the training data of thatfold, and this model is used to score the validationdata. Therefore, the scores on the validation data are(almost) unbiased. Through several folds, the valida-tion scores are added, and, for each sample, a counterthat indicates how many times it has been scoredis kept. At the end of the process, a global modelis produced by averaging all the local models. Bynormalizing the score of every sample by the numberof times it was scored, we also produce unbiasedscores of the train set, which can be used to learn thecommon subspace without problems. The process tolearn the model of one attribute and score the trainingset is depicted in Algorithm 1 using a Matlab-likenotation. Note that some care needs to be taken toensure that all training samples appear at least oncein the validation set so they can be scored.

6 EXPERIMENTS

We start by describing the datasets we use throughour experiments. We then describe the most relevantimplementation details of our approach. After that,we present our results and compare them with thepublished state-of-the-art.

6.1 Datasets:

We evaluate our method in four public datasets: twodatasets of handwritten text documents, and twodatasets of text in natural scenes.

The IAM off-line dataset 4 [54] is a large datasetcomprised of 1, 539 pages of modern handwrittenEnglish text written by 657 different writers. Thedocument images are annotated at word and line leveland contain the transcriptions of more than 13, 000lines and 115, 000 words. There also exists an officialpartition for writer independent text line recognition thatsplits the pages in three different sets: a training setcontaining 6, 161 lines, a validation set containing1, 840 lines and a test set containing 1, 861 lines.These sets are writer independent, i.e., each writercontributed to one and only one set. Thorough ourspotting and recognition experiments we will use thisofficial partition, since it is the one most widely usedand eases comparison with other methods.

4. http://www.iam.unibe.ch/fki/databases/iam-handwriting-database


Algorithm 1 Learn attribute model with bagging

Input: Training data X ∈ <D×N

Input: Training labels Y ∈ 0, 1NInput: Number of folds FOutput: Model W ∈ <D

Output: Training data embedded onto the attributespace A ∈ <N

W = zeros(1, D)A = zeros(1, N)count = zeros(1, N)f = 1while f ≤ F do

Split data in train and val partitionsTrainIdx, V alIdx = split(N, f)TrainData = X(:, T rainIdx)TrainLabels = Y (TrainIdx)V alData = X(:, V alIdx)V alLabels = Y (V alIdx)Learn model using training data. Use validationset to validate the parameters.Wf = learnSVM(TrainData, TrainLabels,

V alData, V alLabels)Encode the validation set into the attributesspace and keep track of the number of updatesA(V alIdx) = A(V alIdx) +WT

f V alDatacount(V alIdx) = count(V alIdx) + 1Add Wf to the global model WW = W +Wf

f = f + 1end whileNormalize and endW = W/FA = A/countEnd

The George Washington (GW) dataset5 [10] con-tains 20 pages of letters written by George Washing-ton and his associates in 1, 755. The writing stylespresent only small variations and it can be considereda single-writer dataset. The dataset contain approxi-mately 5, 000 words annotated at word level. Thereis no official partition for the GW dataset. We followthe approach of [7], [55] and split the GW dataset intwo sets at word level containing 75% and 25% ofthe words. The first set is used to learn the attributesrepresentation and the calibration, as well as for vali-dation purposes, and the second set is used for testingpurposes. The experiments are repeated 4 times withdifferent train and test partitions and the results areaveraged.

The IIIT 5K-word (IIIT5K) dataset6 [30] contains5, 000 cropped word images from scene texts and

5. http://www.iam.unibe.ch/fki/databases/iam-historical-document-database/washington-database

6. http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K.hmtl

born-digital images, obtained from Google Image en-gine search. This is the largest dataset for naturalimage word spotting and recognition currently avail-able. The official partition of the dataset contains twosubsets of 2, 000 and 3, 000 images for training andtesting purposes. These are the partitions we use inour experiments. The dataset also provides a globallexicon of more than half million dictionary wordsthat can be used for word recognition. Each word isassociated with two lexicon subsets: one of 50 words,and one of 1, 000 words.

The Street View Text (SVT) dataset7 [15] is com-prised of images harvested from Google Street Viewwhere text from business signs and names appear. Itcontains more than 900 words annotated in 350 dif-ferent images. In our experiments we use the officialpartition that splits the images in a train set of 257word images and a test set of 647 word images. Thisdataset also provides a lexicon of 50 words per imagefor recognition purposes.

6.2 Implementation DetailsWe use Fisher vectors [40] as our base image rep-resentation. SIFT features are densely extracted at6 different patch sizes (bin sizes of 2, 4, 6, 8, 10,and 12 pixels) from the images and reduced to 62dimensions with PCA. Then, the normalized x andy coordinates are appended to the projected SIFTdescriptors. To normalize the coordinates, we use theautomatically detected reference boxes on the IAMand GW datasets. On IIIT5K and SVT, we observedthat the minibox approach did not perform as welldue to the nature of the backgrounds, which difficultsthe fitting of the bounding box, so we use the wholeimage as reference system. These features are thenaggregated into a FV that considers the gradientswith respect to the means and variances of the GMMgenerative model.

To learn the GMM we use 1 million SIFT featuresextracted from words from the training sets. We use16 Gaussians per GMM, and learn the GMM in astructured manner using a 2 × 6 grid leading to aGMM of 192 Gaussians. This produces histogramsof 2 × 64 × 192 = 24, 576 dimensions. For efficiencyreasons, however, on the IAM dataset, we use SIFTfeatures reduced to 30 dimensions instead of 62. Thisproduces histograms of 12, 288 dimensions. This re-duction improved the training speed and storage costswhile barely affecting the final results. The descriptorsare then power- and L2- normalized. Please cf . [56]for more details and best practices regarding theconstruction of FV representations.

When computing the attribute representation, weuse levels 2, 3, 4 and 5, as well as 50 common bigramsat level 2, leading to 604 dimensions when consideringdigits plus the 26 characters of the English alphabet.

7. http://vision.ucsd.edu/∼kai/svt/


Table 1: Retrieval results on the IAM, GW, IIIT5K and SVT datasets. Accuracy measured in mean averageprecision.

IAM GW IIIT5K SVTQBE QBS QBE QBS QBE QBS QBE QBS

FV 15.66 – 62.72 – 24.21 – 22.88 –Att. 44.60 39.25 89.85 67.64 55.71 35.91 48.94 60.32Att. + Platts 48.09 66.86 93.04 91.29 62.05 62.30 51.47 76.01Att. + Reg. 46.59 60.95 90.54 87.02 61.06 58.12 52.51 71.88Att. + CSR. 52.61 73.54 92.46 90.81 63.79 66.24 55.86 79.65Att. + KCSR. 55.73 73.72 92.90 91.11 63.42 65.15 55.87 79.35

We learn the attributes using the bagging approachof Algorithm 1 with 10 folds on all datasets. Since thetraining set of SVT is very small (only 257 images), weaugment it by using the whole 5K dataset as trainingset. Note that other recent works that evaluate onSVT also augment the data, either producing synthetictraining [32] or by using in-house datasets [5].

When learning and projecting with CSR and KCSR,the representations (both score attributes and embed-ded labels) are first L2-normalized and mean centered.We use CSR to project to a subspace of 80 dimen-sions on all datasets. For KCSR, we project into 160dimensions on all datasets. Then, once projected, therepresentations are L2 normalized once again. ThisL2 normalization is important to compensate for theloss of energy after the dimensionality reduction [57]and significantly improved the overall accuracy ofthe methods. After L2 normalization, both euclideandistance and dot product produce equivalent rankingssince both measures are proportional after L2 normal-ization. We therefore use the dot product, since weobserved it to be approximately 20 times faster thanusing the euclidean distance on our system.

6.3 Word Spotting6.3.1 ProtocolIn the word spotting task, the goal is to retrieveall instances of the query words in a “database”partition. Given a query, the database elements aresorted with respect to their similarity to the query. Wethen use mean average precision as the accuracy mea-sure. Mean average precision is a standard measureof performance for retrieval tasks, and is essentiallyequivalent to the area below the precision-recall curve.Note that, since our search is exhaustive, the recall isalways 100%. We use the test partition of the datasetsas database, and use each of its individual words asa query in a leave-one-out style. When performingquery-by-example, the query image is removed fromthe dataset, and queries that have no extra relevantwords in the database are discarded. When perform-ing query-by-string, only one instance of each string isconsidered as a query, i.e., words that appear severaltimes in the dataset are only used as a query once. In

the IAM dataset it is customary to not use stopwordsas queries. However, they still appear in the datasetand act as distractors. We follow this approach andnot use stopwords as queries in the IAM dataset. TheIAM dataset also contains a set of lines marked as“error”, where the transcription of the line is dubiousand may or may not be correct. We have filtered outthose lines, and they are not used neither at trainingnor at testing.

Some methods in the literature have used slightlydifferent protocols, that we will adopt when compar-ing to them. On the QBS experiments of Table 2, wereport results using only queries that also appear onthe training set, as the other approaches do, even ifall the methods are able to perform OOV spotting.An exception is Aldavert et al. [55], that on GWreports results both using only queries that appearon training, and using all the queries. We follow thesame approach. The results between parenthesis andmarked with an asterisk denote that all queries wereused. Furthermore, on the QBS experiments on IAM,we follow the state-of-the-art approach of [7] andperform line spotting instead of word spotting, i.e., weretrieve whole lines that are correct if they containthe query word. To do so we group all the words ineach line as a single entity, and define the distancebetween a query and a line as the distance betweenthe query and the closest word in the line. Frinken etal. [7] also use only a subset of the test set containingapproximately half of the test lines. We obtained theexact lines after contacting the authors and use onlythose lines when comparing to them.

For the spotting task on IIIT5K, we compare our-selves with the results of Rodrıguez and Perronnin[17]. However, they use precision at 1 instead of meanaverage precision as the accuracy metric. Further-more, instead of retrieving the test queries on thetest partition, they retrieve test queries directly on thetraining partition. We follow the same approach whencomparing to them in Table 3.

6.3.2 Results

The results of our approach on the word spotting taskare shown on Table 1. For each dataset, we compare


the FV baseline (which can only be used in QBE tasks),the uncalibrated attributes embedding (Att.), the at-tributes calibrated with Platts (Att. + Platts), the one-way regression (Att. + Reg), the common subspaceregression (Att. + CSR), and the kernelized commonsubspace regression (Att. + KCSR). We highlight thefollowing points:

FV baseline vs attributes. It is clear how the useof attributes dramatically increases the performance,even when no calibration at all is performed. Thisis not surprising, since the attributes space has beenlearned using significant amounts of labeled trainingdata. It reinforces the idea that exploiting labeled dataduring training is very important to obtain compet-itive results. The QBS results however are not par-ticularly good, since the direct comparison betweenattribute scores and PHOCs is not well principled.

Platts vs reg vs CSR. By learning a non-linear cali-bration using Platts, the attribute results significantlyimprove for all datasets. Although Platts does notfind a common subspace, it puts the attributes em-bedding and the label embedding in the same rangeof values, which obviously helps the performance,particularly in the QBS case. The results obtained withregression are unstable. Although they outperformthe uncalibrated attributes, they only bring a slightimprovement over Platts, and only in some cases.While regression exploits the correlation of attributescores, the nonlinearity of Platts seems to give itan edge. However, when considering the commonsubspace regression, which exploits the correlationsof both the attribute scores and the embedded labels,the results increase drastically, always outperformingPlatts or regression except on the GW dataset, wherethey are very close.

CSR vs KCSR. The kernelized version of CSRobtains results very similar to CSR. Our intuition isthat, due to the higher dimensionality of RandomFourier Features, the method is more prone to overfit,and requires more training data to show its benefits.Indeed, in the IAM dataset, which contains largeramounts of training data, KCSR clearly outperformsCSR.

Hybrid retrieval. We also explore a hybrid spottingapproach, where both an embedded image ψI(Ii) andits embedded transcription ψY(Yi) are available asa query. Since both representations lie in the samespace after the projection, we can create a new hybridrepresentation by a weighted sum, i.e., ψH(Ii,Yi) =αψI(Ii) + (1− α)ψY(Yi) and use it as a query. Figure8 shows the results of this hybrid approach on ourdatasets as a function of the α weight. We observehow the results improve when using both representa-tions at query time.

Comparison with the state-of-the-art. We compareour approach with recently published methods ondocument and natural images. For the documentdatasets (Table 2), we first focus on QBE and com-

(a) IAM (b) GW

(c) IIIT5K (d) SVT

Figure 8: Hybrid spotting results with KCSR as afunction of the weight α assigned to the visual partof the query.

pare our approach with the FV baseline and a DTWapproach based on Vinciarelli [58] features. On GW,we report the results of [12] on DTW as well astheir results with semi-continuous HMM (SC-HMM).Although the results of [12] are not exactly compa-rable, since partitions are different (we followed [7],[55] instead of [12]), we provided both our DTWresults and the ones reported on their paper to atleast provide an approximate idea of the expecteddifferences due to the partition.

Table 2: Word spotting comparison with the state-of-the-art on IAM and GW. Results on QBS only usequeries that also appear on the training set, exceptthose marked with an asterisk. QBS results on IAMperform line spotting instead of word spotting, anduse only half of the lines of the test set.

IAM GW

QBE

Baseline FV 15.66 Baseline FV 62.72DTW 12.30 DTW 60.63

DTW [12] 50.00SC-HMM [12] 53.00

Proposed (Platts) 48.09 Proposed (Platts) 93.04Proposed (KCSR) 55.73 Proposed (KCSR) 92.90

QBS

cHMM [6], [7] 36.00 cHMM [6], [7] 60.00Aldavert et al. [55] 76.20 (56.54*)

Frinken et al. [7] 78.00 Frinken et al. [7] 84.00Proposed (KCSR) 80.64 Proposed (KCSR) 93.93 (91.11*)

We observe how the FV baseline is already com-parable or outperforms some popular methods onboth datasets. This is in line with the findings of [23],


where the FV of a HMM outperforms the standardHMM on keyword classification tasks. Also, despitethe possible differences in partitions, the advantageof the proposed method over methods that do notperform supervised learning is clear. For the QBS case,we compare ourselves with the recent methods of [7]and [55], as well as with the character HMM approachof [6] as evaluated in [7]. All these works use labeledtraining data, which translates into more competitiveresults than in the QBE case. However, the expres-siveness of the attributes, the use of discriminativeimage representations, and the learning of a commonsubspace, give an edge to the proposed method. Wealso note how our results do not particularly sufferwhen using words that were not seen during training(93.93 vs 91.11 on GW), as opposed to the approach of[55] (76.2 vs 56.54), showing a much nicer adaptationto unseen words.

We also compare our approach on natural images,see Table 3. We compare ourselves with the recentlabel-embedding method of [17], as well as their DTWbaseline. Note that in this case the reported measureis precision at 1 instead of mean average precision. Weobserve how our approach clearly outperforms theirresults. Part of this improvement may however be dueto using better Fisher vectors, since we use structuredGMMs with enriched SIFT descriptors and Rodriguezand Perronnin use spatial pyramids.

Table 3: Word spotting comparison with the state-of-the-art in IIIT5K dataset for the QBE task.

Method Top-1 acc.

FV 40.70DTW [17] 37.00Rodrıguez and Perronnin [17] 43.70Proposed (KCSR) 72.28

Qualitative results. We finally show qualitative re-sults with samples from the IAM and IIIT5K datasetson Figure 9. We observe how some difficult wordsare correctly retrieved. Some common errors includewords with common patterns (like a double tt) or dif-ferent terminations (“window” vs “windows”, “bill-boards” vs “billboard”).

6.4 Word Recognition6.4.1 ProtocolIn word recognition, the goal is to find the transcrip-tion of the query word. In our experiments, the tran-scription is limited to words appearing in a lexicon.IIIT5K and SVT have officially associated lexicons. InSVT, each query word has an associated lexicon of 50words, one of which corresponds to the transcriptionof the query. IIIT5K has two associated lexicons perword, one of 50 words and one of 1, 000 words. ForIAM, we use a closed lexicon that contains all the

words that appear in the test set, as in one of theexperiments of [13]. In our case, since we embedboth images and strings into a common subspace,the transcription problem is equivalent to finding thenearest neighbor of the query image in a datasetcontaining the lexicon embedded into the commonsubspace. In IIIT5K and SVT, the standard evaluationmetric is precision at one, i.e., is the top retrievedtranscription correct? In document datasets, morecommon measures are the word error rate (WER)and character error rate (CER). The CER between twowords is defined as the edit distance or Levenstheindistance between them, i.e., the minimum numberof character insertions, deletions, and substitutionsneeded to transform one word into the other, nor-malized by the length of the words. We report themean CER between the queries and their top retrievedtranscription. The WER is defined similarly, but con-sidering words in a text line instead of charactersinside a word. This is done typically because manytranscription methods work on whole lines at thesame time. If words are already segmented, WER isequivalent to a Hamming distance between words ina line, and the dataset WER is the mean percentageof words that are wrongly transcribed in each line.

6.4.2 ResultsThe results on word recognition on IAM are on Table4, where we compare with the state-of-the-art ap-proach of [13], which uses a combination of HMMsand neural networks to clean and normalize the im-ages and produce word models. We compare ourresults in terms of WER and CER, where a lower scoreis better. Although we do not match the results of [13],our results are competitive without performing anycostly preprocessing on the images and with muchfaster recognition speeds.

Table 4: Recognition error rates on the IAM dataset.

Method WER CER

Espana-Bosquera et al. [13] 15.50 6.90Proposed (KCSR) 20.01 11.27

Table 5 shows results on the IIIT5K an SVT datasets.We observe how our approach clearly outperforms allpublished results on IIIT5K. On SVT, our method isonly outperformed by Google’s very recent PhotoOCR[5] approach. However, [5] uses 2.2 million trainingsamples labeled at the character level, while we useonly 5 thousand images labeled at the word level.

Finally, Figure 10 shows some qualitative resultson image recognition with samples from the IAM,IIIT5K and SVT datasets. Common errors include verysimilar-looking words (“Heather” and “feather”),low-resolution images (“Mart” and “man”), or un-orthodox font styles (“Neumos” vs “bimbos”).


Figure 9: Qualitative results on word spotting on the IAM and IIIT5K. Relevant words to the query are outlinedin green.

security undue independence breakfast feather

regency cottage summer vijayawada 560

motorsports sushi hotel man burbank

bimbos store salon bookstore starbucks

Figure 10: Qualitative results on word recognition on the IAM, IIIT5K, and SVT datasets.

Table 5: Recognition results on the IIIT5K and SVTdataset. Accuracy measured as precision at 1.

Dataset Method |y| = 50 |y| = 1000

IIIT5KMishra et al. [30] 64.10 57.50Rodrıguez and P. [17] 76.10 57.40Proposed (KCSR) 88.57 75.60

SVT

ABBY [32] 35.00 -Mishra et al. [16] 73.26 -Goel et al. [32] 77.28 -PhotoOCR [5] 90.39 -Proposed (KCSR) 87.01 -

6.5 Computational Analysis

The improvements of our approach are not only interms of accuracy and memory use. Our optimizedDTW implementation in C took more than 2 hoursto compare the 5, 000 queries of IAM against the16, 000 dataset words on an 8-core Intel Xeon W3520at 2.67GHz with 16Gb of RAM, using one single

core. By contrast, comparing the same queries usingour attributes embedded with CSR involves only onematrix multiplication and took less than 1 second onthe same machine, about 0.2 milliseconds per query.For recognition tasks we only need to compare thequery with the given lexicon. Recognizing a querywith a lexicon of 1, 000 in IIIT5K takes less than 0.02milliseconds. At query time we also need to extractthe FV representation of the query image, which in-volves the dense SIFT extraction and the FV encoding,and then embed it into the CSR/KCSR subspace. Thisprocess takes, on average, 0.77 seconds per image.

In general, these numbers compare favorably withother approaches. The method proposed by [7] takesa few milliseconds to process a single line in theIAM for the QBS task. PhotoOCR [5] reports timesof around 1.4 seconds to recognize a cropped imageusing a setup tuned for accuracy. An unoptimizedimplementation of the method proposed in [4] takes35 seconds to locate and recognize the words in an


image, and the same task takes around 8 seconds withthe method in [31].

Regarding the cost of learning an attribute modelwith an SVM, learning a single model on the IAMdatabase using our SGD implementation on a singlecore with more than 60, 000 training samples took, onaverage, 35 seconds, including the crossvalidation ofthe SVM parameters. That is, the complete trainingprocess of the attributes, including the bagging, can bedone in about 2 days on a single CPU. Since attributesand folds are independent, this is trivially paralleliz-able. Training the Deep Neural Network proposed in[5] took 2 days on a 200 cores cluster. Learning theCSR and KCSR projections is also very fast since thedimensionality of the attribute space is low: approx-imately 1.5 seconds for CSR and approximately 60seconds for KCSR. As the attribute models, this needsto be learned only once, offline.

7 CONCLUSIONS AND FUTURE WORK

This paper proposes an approach to represent andcompare word images, both on document and on nat-ural domains. We show how an attributes-based ap-proach based on a pyramidal histogram of characterscan be used to learn how to embed the word imagesand their textual transcriptions into a shared, morediscriminative space, where the similarity betweenwords is independent of the writing and font style,illumination, capture angle, etc. This attributes rep-resentation leads to a unified representation of wordimages and strings, resulting in a method that allowsone to perform either query-by-example or query-by-string searches, as well as image transcription, in aunified framework. We test our method in four publicdatasets of documents and natural images, outper-forming state-of-the-art approaches and showing thatthe proposed attribute-based representation is well-suited for word searches, whether they are images orstrings, in handwritten and natural images.

Regarding future work, we have observed em-pirically that the quality of the attribute models isquite dependent on the available number of trainingsamples, and that the models for rare characters inrare positions were not particularly good. We believethat having larger training sets could improve thequality of those models and lead to better overallresults. Towards this goal, we have experimented withaugmenting the training sets by applying transforma-tions such as changes in slant to the available images.Preliminary results indicate that this can significantlyboost the results. As an example, we improved theQBE results on IAM from 55.73 to 59.62. In the sameline, our learning approach is currently based onwhole word images and does not require to segmentthe individual characters of the words during trainingor test, which we consider an advantage. However, ithas been shown that learning on characters can lead

to large improvements in accuracy [5]. We want tostudy how we could learn on individual segmentedcharacters (using existing datasets such as Char74K[59]) and transfer that information into our systemwithout needing to actually segment the characters ofthe target dataset at any time. We are also interestedin lifting the need to have cropped word imagesby integrating the current word representation in anefficient sliding window framework.

ACKNOWLEDGEMENTS

J. Almazan, A. Fornes, and E. Valveny are partiallysupported by the Spanish projects TIN2011-24631,TIN2009-14633-C03-03, TIN2012-37475-C02-02, by theEU project ERC-2010-AdG-20100407-269796 and by aresearch grant of the UAB (471-01-8/09).

REFERENCES

[1] J. Almazan, A. Gordo, A. Fornes, and E. Valveny, “Handwrit-ten word spotting with corrected attributes,” in ICCV, 2013.

[2] R. Manmatha and J. Rothfeder, “A scale space approach forautomatically segmenting words from historical handwrittendocuments,” IEEE TPAMI, 2005.

[3] L. Neumann and J. Matas, “Real-Time Scene Text Localizationand Recognition,” in CVPR, 2012.

[4] L. Neumann and J. Matas, “Scene Text Localization and Recog-nition with Oriented Stroke Detection,” in ICCV, 2013.

[5] A. Bissacco, M. Cummins, Y. Netzer, and H. Neven, “Pho-toOCR: Reading Text in Uncontrolled Conditions,” in ICCV,2013.

[6] A. Fischer, A. Keller, V. Frinken, and H. Bunke, “HMM-basedword spotting in handwritten documents using subword mod-els,” in ICPR, 2010.

[7] V. Frinken, A. Fischer, R. Manmatha, and H. Bunke, “A novelword spotting method based on recurrent neural networks,”IEEE TPAMI, 2012.

[8] R. Manmatha, C. Han, and E. M. Riseman, “Word spotting: Anew approach to indexing handwriting,” in CVPR, 1996.

[9] T. Rath, R. Manmatha, and V. Lavrenko, “A search engine forhistorical manuscript images,” in SIGIR, 2004.

[10] T. Rath and R. Manmatha, “Word spotting for historical doc-uments,” IJDAR, 2007.

[11] J. A. Rodrıguez-Serrano and F. Perronnin, “Local gradienthistogram features for word spotting in unconstrained hand-written documents,” in ICFHR, 2008.

[12] J. A. Rodrıguez-Serrano and F. Perronnin, “A model-basedsequence similarity with application to handwritten word-spotting,” IEEE TPAMI, 2012.

[13] S. Espana-Bosquera, M. Castro-Bleda, J. Gorbe-Moya, andF. Zamora-Martinez, “Improving offline handwritten textrecognition with hybrid HMM/ANN models,” IEEE TPAMI,2011.

[14] I. Yalniz and R. Manmatha, “An efficient framework forsearching text in noisy documents,” in DAS, 2012.

[15] K. Wang, B. Babenko, and S. Belongie, “End-to-end Scene TextRecognition,” in ICCV, 2011.

[16] A. Mishra, K. Alahari, and C. V. Jawahar, “Top-down andbottom-up cues for scene text recognition,” in CVPR, 2012.

[17] J. A. Rodrıguez-Serrano and F. Perronnin, “Label embeddingfor text recognition,” in BMVC, 2013.

[18] C. Leslie, E. Eskin, and W. Noble, “The spectrum kernel:A string kernel for SVM protein classification,” in PacificSymposium on Biocomputing, 2002.

[19] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, andC. Watkins, “Text classification using string kernels,” JMLR,2002.

[20] H. Jegou, M. Douze, and C. Schmid, “Product quantization fornearest neighbor search,” IEEE TPAMI, 2011.


[21] S. Lu, L. Li, and C. L. Tan, “Document image retrieval throughword shape coding,” IEEE TPAMI, 2008.

[22] P. Keaton, H. Greenspan, and R. Goodman, “Keyword spottingfor cursive document retrieval,” in DIA, 1997.

[23] F. Perronnin and J. A. Rodrıguez-Serrano, “Fisher kernels forhandwritten word-spotting,” in ICDAR, 2009.

[24] T. Jaakkola and D. Haussler, “Exploiting generative models indiscriminative classifiers,” in NIPS, 1999.

[25] B. Gatos and I. Pratikakis, “Segmentation-free word spottingin historical printed documents,” in ICDAR, 2009.

[26] M. Rusinol, D. Aldavert, R. Toledo, and J. Llados, “Browsingheterogeneous document collections by a segmentation-freeword spotting method,” in ICDAR, 2011.

[27] J. Almazan, A. Gordo, A. Fornes, and E. Valveny, “Efficientexemplar word spotting,” in BMVC, 2012.

[28] N. Dalal and B. Triggs, “Histograms of oriented gradients forhuman detection,” in CVPR, 2005.

[29] T. Malisiewicz, A. Gupta, and A. Efros, “Ensemble ofExemplar-SVMs for object detection and beyond,” in Interna-tional Conference on Computer Vision, 2011, 2011.

[30] A. Mishra, K. Alahari, and C. V. Jawahar, “Scene text recogni-tion using higher order language priors,” in BMVC, 2012.

[31] A. Mishra, K. Alahari, and C. V. Jawahar, “Image Retrievalusing Textual Cues,” in ICCV, 2013.

[32] V. Goel, A. Mishra, K. Alahari, and C. V. Jawahar, “Whole isGreater than Sum of Parts: Recognizing Scene Text Words,” inICDAR, 2013.

[33] C. H. Lampert, H. Nickisch, and S. Harmeling, “Learningto detect unseen object classes by between-class attributetransfer,” in CVPR, 2009.

[34] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth, “Describingobjects by their attributes,” in CVPR, 2009.

[35] M. Rohrbach, M. Stark, and B. Schiele, “Evaluating knowledgetransfer and zero-shot learning in a large-scale setting,” inCVPR, 2011.

[36] C. Wah and S. Belongie, “Attribute-Based Detection of Unfa-miliar Classes with Humans in the Loop,” in CVPR, 2013.

[37] A. Fischer, A. Keller, V. Frinken, and H. Bunke, “Lexicon-free handwritten word spotting using character HMMs,” PRL,2012.

[38] S. Bengio, J. Weston, and D. Grangier, “Label embedding treesfor lagre multi-class tasks,” in NIPS, 2010.

[39] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid, “Label-embedding for attribute-based classification,” in CVPR, 2013.

[40] F. Perronnin, J. Sanchez, and T. Mensink, “Improving theFisher kernel for large-scale image classification,” in ECCV,2010.

[41] B. Siddiquie, R. S. Feris, and L. S. Davis, “Image Ranking andRetrieval Based on Multi-Attribute Queries,” in CVPR, 2011.

[42] L. Torresani, M. Szummer, and A. Fitzgibbon, “Efficient objectcategory recognition using classemes,” in ECCV, 2010.

[43] K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman, “Thedevil is in the details: an evaluation of recent feature encodingmethods,” in BMVC, 2011.

[44] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray,“Visual categorization with bags of keypoints,” in Workshop onStatistical Learning in Computer Vision, ECCV, 2004.

[45] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of fea-tures: spatial pyramid matching for recognizing natural scenecategories,” in CVPR, 2006.

[46] J. Sanchez, F. Perronnin, and T. de Campos, “Modeling thespatial layout of images beyond spatial pyramids,” PRL, 2012.

[47] W. J. Scheirer, N. Kumar, P. N. Belhumeur, and T. E. Boult,“Multi-attribute spaces: Calibration for attribute fusion andsimilarity search,” in CVPR, 2012.

[48] H. Hotelling, “Relations between two sets of variants,”Biometrika, 1936.

[49] D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor, “Canonicalcorrelation analysis; an overview with application to learningmethods,” tech. rep., Department of Computer Science, RoyalHolloway, University of London, 2003.

[50] M. B. Blaschko and C. H. Lampert, “Correlational spectralclustering,” in CVPR, 2008.

[51] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin, “Iterativequantization: A procrustean approach to learning binary codesfor large-scale image retrieval,” IEEE TPAMI, 2012.

[52] J. Weston, S. Bengio, and N. Usunier, “Large scale image anno-tation: Learning to rank with joint word-image embedding,”in ECML, 2010.

[53] A. Rahimi and B. Recht, “Randomm features for large-scalekernel machines,” in NIPS, 2007.

[54] U.-V. Marti and H. Bunke, “The IAM-database: An english sen-tence database for off-line handwriting recognition,” IJDAR,2002.

[55] D. Aldavert, M. Rusinol, R. Toledo, and J. Llados, “IntegratingVisual and Textual Cues for Query-by-String Word Spotting,”in ICDAR, 2013.

[56] J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek, “Imageclassification with the Fisher vector: Theory and practice,”IJCV, 2013.

[57] H. Jegou and O. Chum, “Negative evidences and co-occurrences in image retrieval: the benefit of pca and whiten-ing,” in ECCV, 2012.

[58] A. Vinciarelli and S. Bengio, “Offline cursive word recognitionusing continuous density hidden markov models trained withPCA or ICA features,” in ICPR, 2002.

[59] T. de Campos, B. Babu, and M. Varma, “Character recognitionin natural images,” in VISAPP, 2009.

Jon Almazan received his B.Sc. degreein Computer Science from the UniversitatJaume I (UJI) in 2008, and his M.Sc. degreein Computer Vision and Artificial Intelligencefrom the Universitat Autonoma de Barcelona(UAB) in 2010. He is currently a PhD studentin the Computer Science Department andthe Computer Vision Center at UAB, underthe supervision of Ernest Valveny and AliciaFornes. His research work is mainly focusedon document image analysis, image retrieval

and word spotting and recognition.

Albert Gordo received his PhD in computervision in 2013 from the Universitat Autonomade Barcelona / Computer Vision Center, incollaboration with the Xerox Research Cen-tre Europe in Grenoble, France. After that, hespent a year as a postdoc at the LEAR groupat INRIA. He is currently a postdoc at XRCE.His research interests range from documentanalysis and word spotting and recognitionto fine-grained object classification and de-tection, with a special focus on large-scale

problems.

Alicia Fornes received the B.S. degree fromthe Universitat de les Illes Balears (UIB) in2003 and the M.S. degree from the Universi-tat Autonoma de Barcelona (UAB) in 2005.She obtained the Ph.D. degree on writeridentification of old music scores from theUAB in 2009. She was the recipient of theAERFAI (Image Analysis and Pattern Recog-nition Spanish Association) best thesis award2009-2010. She is currently a postdoctoralresearcher in the Computer Vision Center.

Her research interests include document analysis, symbol recog-nition, optical music recognition, historical documents, handwritingrecognition and writer identification.


Ernest Valveny is an Associate Professorat the Computer Science Department of theUniversitat Autonoma de Barcelona (UAB),where he obtained his PhD degree the in1999. He is also member of the ComputerVision Center (CVC) at UAB. His researchwork has mainly focused on document analy-sis and pattern recognition, and more specif-ically in the fields of shape representation,text recognition and retrieval, document clas-sification and graph matching. He is currently

a member of IAPR and of the editorial board of the InternationalJournal on Document Analysis and Recognition and has servedas a reviewer for many international journals and conferences. Hehas published more than 100 papers in international journals andconferences.

Documents

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE ...afornes/publi/journals/2014_PAMI_Almazan.pdf · Jon Almazan, Albert Gordo, Alicia Forn´ es, Ernest Valveny´ Abstract—This