Local Gradient Histogram For Word Spotting In Unconstrained Handwritten Documents

בס"ד

Introduction The Article – Background & Contribution The Article’s Feature Experiments

- experiments with HMM & results- experiments with DTW & results

Handwritten Word Spotting (HWS) is the pattern classification task, which consists in detecting specific keywords in handwritten document image, text etc..

The main difficulty is the high intra-writer and inter-writer variability.The scenario of this work is an unconstrained handwritten word spotting task -This includes a variety of writer-styles ,document layouts , spontaneous writing. Artifacts or spelling mistakes. For instance, when trying to index an historical document, it’s sometimes written by various writers, hence we must distinguish between their handwriting

The goal is to detect keywords in realistic, unrestricted conditions.

Therefore , an important decision in the representation phase is the choice of a word Descriptor.

A word will be “described by” (i.e its value is..) a single feature-vector or by a sequence of feature-vectors

The evaluation of such feature-vectors will be shown later on

Two main word descriptors:1. Holistic –

extract a single feature vector for image.-Disadvantage : has a limited performance -Advantage : Is sufficient for digit recognition or small-vocabulary

word recognition. and is faster to evaluate.

2. Local\Sequential –described as a 1-D sequence of feature vectors.

-Advantage - way more accurate for describing a word (as will be shown later on…..)-Disadvantage – takes more time to evaluate

Once we’ve spotted the desired keywords in the image we can* Route mails based on the presence of specified keywords* Indexing of historical documents* Extract meta-data from document images * document categorization

Introduction

The Article – Background & Contribution The Article’s Feature Experiments


i. The main contribution of this article is a new sequential-feature-set that obtains performance well beyond the other feature-set that are usually used in an unconstrained word-spotting task.

ii. The Secondary contribution of this article is to provide an experimental comparison of local word-descriptors for word spotting. To give a comparison that is independent of the classifier employed, we will provide results from using both DTW and HMM

Let M be the original image and Mx=Gauss(M,x) is the image after Gaussian blur with strength x.

Let (a,b) be a point, M[a,b] is the value of the pixel at this point in image M. keypoint is a point the extremum for the function Fx(a,b)=|Mx[a,b]-M[a,b]|

Scale-invariant feature transform (SIFT) is an algorithm that uses the orientation of the gradient in the keypoints from several strength points in order to identify major objects inside the image.

This algorithm is the inspiration for this experiment because it has been using the orientation of the gradient and classifying each orientation into bins which is similar to the features that we going to use in this experiment.

Hence, we would like to describe words by generating a sequence of such descriptors as in SIFT by moving a sliding window over a word image (will elaborate later on this)

Find keypoints Discard those with a low contrast and those too close to the edges

Introduction The Article – Background & Contribution

The Article’s Feature Experiments


How does our HWS system go?1. Segmentation –

Extract sub-images that potentially represent words – employing state-of-the-art techniques based on projection profiles and clustering of gap distances

2. Fast Rejection & Pruning -A classifier using holistic features is applied to perform a 1st rejection pass, prunes about 90% of the segmented words ,falsely rejecting about 5% of the keywords

How does our HWS system go?

3. Normalization – Non pruned word-images are normalized with respect to slant skew and text height

4. Feature Vector Computation – For each normalized image (i.e non pruned Word) a sequence of feature vector is computed (Later on…..)

5. Scoring assignment – For each sequence of feature-vector (i.e word) the word will be consider as a keyword if the word exceeds a predefined threshold.

The study focuses on part 4, which deals with the computation of the feature vector sequence. More precisely, it deals also with the choice of a feature set.

There are few feature sets, but due to their limited performance we will introduce a new feature-set and will compare it to the rest.

First, lets have a brief look at the existing feature sets, since we will later compare them to our new feature set by HMM and DTW :

1. Column FeaturesThose features are calculated by acting upon the foreground pixels and then several features are concatenated :

1.number of the pixels of the word itself (i.e. the foreground)

2.mean3.Second order moment , which can be one of the

covariance matrix elements, or possibly the variance itself4.Min/max of pixel positions5.The difference from the previous column of the

min/max position 6.Black white transitions7.Number of pixels between upper line and base line

Gaussian filtered :

Gradient vertical :

Gradient horizontal :

2. Pixel-Count Features dividing the window that spans several

columns into 4x4 cells after adjusting the height to the actual content of the foreground pixels and then counting the number of pixel in each such cells.

3.Gaussian Filter Featuresthe DTW algorithm uses three features that shall be computed per pixel and concatenated into a feature vector for each pixel in the window:

(a) Value after applying a vertical gradient filter

(b) Value after applying a horizontal gradient filter

(c) Value after applying a Gaussian filter.Each window was 15 pixel high and 1 pixel wide,

thus we have 45 features because the each value for each feature in the final vector is the sum of all such value in every window

Local Gradient Histogram features!

the experiment would prove that they lead to improved performance in a word spotting task.

1. Sliding Window :- Given an image(Word) , I(x,y) with Height = H

Width = W- we center each column of I within a window of width w<W and height

H.- At each position , a feature-vector is computed that only depends on

the pixel inside of that window.- The feature-vector at each position is computed - Hence, we get a sequence of W feature-vectors

- I(x,y) ===>

2. Division of the window into cells:- each window is subdivided into rectangular cells employing one of the (i) split the window into M x N cells of identical dimensions (ii) same as above, but only for the window area that actually contains

pixels(iii) independently subdivide the window into three parts determined by

the lower and upper bounds resulting in a gird of (A+B+C) x N cells

3. Gradient Histogram Computation:- In general , at each cell(local) gradient histogram features is extracted.

- Gradients+Smoothing- We do it by taking image I(x,y) , smoothing it to L(x,y) for de-noising it.- Now we compute horizontal and vertical gradient are determined as

- a

3. Gradient Histogram Computation: cont’dafter choosing one of the above, we now obtain from each pixel (x,y)The magnitude m with the direction θ

- a

3. Gradient Histogram Computation: cont’d

- The gradient angles are quantified into a number T of regularly spaced orientation.

- For each pixel (x,y) we determine which of the T orientations is the closest to θ(x,y)

- and then we add part of m(x,y) (its norm) to the corresponding bin.

3. Gradient Histogram Computation: cont’dAssigning gradients to the closest orientations may result in aliasing noise.To reduce its impact the gradient magnitude of a pixel can be shared between the two closest bins in dependency to the angles ratio (see last page).

Then the contribution of the pixel to the two bins will be as follows

3. Gradient Histogram Computation: cont’dShown below is the calculation for one feature-vector which is computed for each window

- As we said earlier, there are W of such M x N vectors , hence, each word will be

- Characterized by a sequence of those W M x N vectors.

4. Frame Normalization :

• Feature vector at one window position is called Frame. • A frame is the concatenation of the gradient histogram computed in

each cell• A performance gain is obtained when scaling each frame so that their

componenetssum to 1.

5. Summary:

• If in each window there are M x N cells (in case of regular split) and each cell is represented by a histogram of T bins

• each position of the sliding window will be characterized by a feature vector of M x N x T components

• The word is characterized by a sequence of W such vectors

Introduction The Article – Background & Contribution The Article’s Feature

Experiments - experiments with HMM & results- experiments with DTW & results

630 letters in French Customer department The letters were processed by segmentation

algorithms in order to extract word candidates. We would now do the fast rejection that was

depicted earlier on those word candidates This experiment is using the 10 most frequent

words and 208-750 examples for each such word

The words would be also used as our keywords

Before the HMM would run , the keywords would be divided into 5 groups

Four would be used for training(i.e building) the model

The 5th group was used as the testing group i.e. we actually run the experiment on this group

We repeated those steps 5 times. In the DTW test we would use 5 random images as

our queries Those queries are our reference for comparing

between words based on the distance from the queries

The negative value of the distance to the closest of this queries would be used as the similarity score.

We have done 5 repetitions and then the results were averaged between the repetitions in the DTW test.

DET (Detection error trade-off) curves are curves that compare one type of an error to a complementary type that depends on some kind of an error detection threshold.

A point in this curve indicates error rates for specific threshold.

Thus, a DET Curve is the trajectory that is formed by the movement of a point when we change the error threshold.

Usually the axes are for the respective error levels denoted as percents , thus the less the distance from the origin becomes , the better becomes the algorithm

In all the curve in this experiment the false rejection rate is plotted against the false acceptance rate: False rejection is when an element scored under the threshold

but really does correspond to the word False acceptation is an element that had passed the threshold

and does not correspond to the word.

In order to calibrate the experiment the most advantageous grid configuration should be found . In this experiment the precision was averaged across the thresholds to produce the mean for different grid layouts in the moving window using HMM

Grid Type Mean Average

Precision Dividing to 4x4 with base and upper without fitting

0.321

Dividing to 4x4 with fitting 0.717Dividing to 4x4 with consideration for the base and upper line

0.655

The optimal number of orientation bins was found to be 8.

The fitted grid is better because it is reasonably restricted to the actual content inside the window

Mean (average) precision for the HMM method and DET curves

The DET curves are for the word “Résiliation”

Feature Mean average Precision

Column FeaturesMarti & Bunke

0.329

Pixel count FeatureVincerelli

0.336

Gaussian Filter FeaturesRath & Manmatha

0.135

The proposed features

0.717

Again the mean (average) precision this time using DTW

Feature Mean average Precision

Column Features

0.108

Pixel count Feature

0.117

Gaussian Filter Features

0.092

The proposed features

0.254