Hidden Tree Markov Models for Document Image …ai.dinfo.unifi.it/paolo/ps/pami02-HTMM.pdfHidden Tree Markov Models for Document Image Classiﬁcation Michelangelo Diligenti ∗Paolo

Hidden Tree Markov Models for Document

Image Classification

Michelangelo Diligenti∗ Paolo Frasconi† Marco Gori∗

∗ Dept. of Information Engineering, Universita di Siena, Italy

† Dept. of Systems and Computer Science, Universita di Firenze, Italy

Abstract

Classification is an important problem in image document process-

ing and is often a preliminary step towards recognition, understanding,

and information extraction. In this paper, the problem is formulated in

the framework of concept learning and each category corresponds to the

set of image documents with similar physical structure. We propose a

solution based on two algorithmic ideas. First, we obtain a structured

representation of images based on labeled XY-trees (this representa-

tion informs the learner about important relationships between image

sub-constituents). Second, we propose a probabilistic architecture that

extends hidden Markov models for learning probability distributions de-

fined on spaces of labeled trees. Finally, a successful application of this

method to the categorization of commercial invoices is presented.

Keywords Document classification, Machine Learning, Markovian models,

Structured information.

1 Introduction

In spite of the explosive increase of electronic communication in recent years,

a significant amount of documents is still printed on paper. Image document

processing aims at automating operations normally carried out by individuals

on printed documents, including reading, understanding, extracting informa-

tion, and organizing documents according to given criteria. Operations such

1

as information extraction can be greatly simplified if some background knowl-

edge about physical layout is available. Layout classification is therefore an

important preliminary step for guiding the subsequent recognition task [6].

One important application is the automatic management of commercial invoice

forms. Here a classifier can be employed to determine which company issued

the invoice and/or which type of form was employed. Besides other obvious

applications (e.g. organizing invoices in separate folders), classification can

also help the construction of information extraction modules, since the pres-

ence, the position, and the type of various pieces of information significantly

varies with issuing company and with the specific form that was employed.

Digital libraries are also a target application domain. In this case, layout

classification can partition scanned pages according to useful criteria such as

presence of titles, advertisements, detect table of contents pages, etc. Again,

data organization or automatic metadata extraction can greatly benefit from

knowing the class of the document under analysis.

In this paper we formulate layout classification as a multiclass classifica-

tion problem. The choice of document representation is very important for this

learning problem. Flat representations (e.g., pixel based), do not carry robust

information about the position and the number of basic constituents of the

image (such as logos, headers, text fields). More appropriate representations

in this domain involve either the extraction of global features, or the use of

recursive data structures. When global features are extracted, vector-based

machine learning algorithms can be employed for classification. For exam-

ple, the methods in [19, 20] collect several statistics from the bitmap image

and subsequently use oblique decision trees or self-organizing maps for clas-

sification. On the other hand, the use of recursive representations allows to

preserve relationships among the image constituents. For example in [6, 7]

documents are represented as structured patterns, while in [3] both the phys-

ical and the logical structure are used to recognize documents. The structured

representations suggested in this paper is obtained by extracting an XY-tree

from the document image [17]. Nodes are associated with specific boxes of

the document and edges represent a containment relation. For each box we

run several feature extractors, in order to get a real vector of attributes. The

2

resulting representation aims at capturing local features of constituents, along

with structural relationships among constituents, and is therefore less sensitive

to interclass variability.

A second important practical aspect in document classification is that many

systems must deal with a relatively large number of classes, and classes nor-

mally vary over time. Discriminant approaches (including multilayer neural

networks and support vector machines) can achieve very good accuracy on

many learning tasks. However, if the set of classes changes, the classifier must

be retrained from scratch. Generative probabilistic models avoid this problem

by learning a class conditional distribution P (Y|C = c) for each class c. In

this way, given an arbitrary set of classes C (realizations the class variable

C can take on) and a generative model for each class c ∈ C, we can obtain

h(Y) = arg maxc∈C P (Y|C = c)P (C = c).

Based on the above considerations, we suggest a classification method that

(1) uses structured data representations and (2) is based on a generative model.

The combination of these two features requires a new machine learning tool,

since generative models are abundant in the literature for both attribute-value

and sequential representations, but not for hierarchical data structures. On

the other hand, compositional approaches to pattern recognition have been

mainly proposed in conjunction with early structural grammars [10] (lacking

sound methods for training) or discriminant models, such as recursive neural

networks [9].

The generative classifier we propose in this paper is based on a gener-

alization of hidden Markov models (HMM), a powerful tool for probabilis-

tic sequence modeling. The recent view of the HMM as a particular case of

Bayesian networks [21, 2, 15] has provided new theoretical insights and helped

conceiving extensions of the standard model in a sound and formally elegant

framework. Whereas standard HMMs are commonly employed for learning in

sequential domains, our extension can learn probability distributions on labeled

trees. We call the novel architecture Hidden Tree Markov Model (HTMM)1.

This extension was first suggested in [9] as one of the possible architectures

1 The model should not be confused with Hidden Markov Decision Trees [14], a graphical

model for learning in sequential domains using a tree structured architecture.

3

for connectionist learning of data structures. In this paper, we develop the

architectural details not included in [9] and we present a solution to a real

commercial invoices classification problem.

2 Hidden tree-Markov models

We now develop a general theoretical framework for modeling probability dis-

tributions defined over spaces of trees. In the following we assume that each

instance, denoted Y, is a labeled m-ary tree. Each node, v, is labeled by a

K−tuple of (discrete or continuous) random variables, denoted Y (v).

2.1 Model definition

Compositionality in a generative stochastic model can be achieved using proba-

bilistic independence. For each tree Y, we can write P (Y) = P (Y (r), Y1, . . . ,Ym)

where Y (r) is the label at the root node r and Yi are random sub-trees. We

say that the first-order tree-Markov property is satisfied for Y if

P (Yi|Yj, Y (r)) = P (Yi|Y (r)) ∀i, j = 1, . . . ,m. (1)

From this property one can prove by induction that

P (Y) = P (Y (r))∏

v∈V \{r}P (Y (v)|Y (pa[v])). (2)

where pa[v] denotes the parent of vertex v and V denotes the set of vertices of

Y. Unfortunately, such a compositional formulation of a model has two main

disadvantages. First, the conditional independence assumption (1) is very

unrealistic and does not allow the model to capture correlations that might

exist between any two non adjacent nodes. Second, the parameterization of

P (Y (v)|Y (pa[v])) might be problematic since the number of parameters grows

exponentially with the number of attributes in each label. Like in HMM for

sequences, we thus introduce hidden state variables in charge of “storing” and

“summarizing” distant information. More precisely, let X = {x1, . . . , xn} be a

finite set of states. We assume that each tree Y is generated by an underlying

hidden tree X, a data structure defined as follows. The skeleton of X is

identical to the skeleton of Y. Nodes of X are labeled by hidden state variables,

4

denoted X(v), taking realizations on X . In this way, P (Y) =∑

X P (Y, X),

where the sum over X indicates the marginalization over all the hidden trees.

The first-order tree-Markov property in the case of hidden tree models

holds if the two following conditions are met: (1) the first-order tree-Markov

property must hold for the hidden tree X, and (2) ∀v the observation Y (v)

is independent of the rest given X(v). These condition imply the following

global factorization formula:

P (Y, X) = P (X(r))P (Y (r)|X(r)) ·∏

v∈V \{r}P (Y (v)|X(v))P (X(v)|X(pa[v]))

(3)

that can be graphically represented by the belief networks as shown in Fig-

ure 1. The associated stochastic model will be called hidden tree-Markov

model (HTMM).

X(r)

X(v) Y(v)

Y(r)

Figure 1: Belief network for a hidden tree-Markov model. White nodes belong

to the underlying hidden tree X. Shaded nodes are observations

2.2 Parameterization

According to Eq. (3) we see that parameters for the HTMM are P (X(r)), a

prior on the root hidden state, P (Y (v)|X(v)), the emission parameters, and

P (X(v)|X(pa[v])), the transition parameters. Unless sharing mechanisms are

defined, these parameters vary with the node being considered, yielding a large

total number of parameters that may quickly lead to overfitting problems. In

HTMM, several forms of stationarity can be assumed, each associated with a

different form of parameter sharing. We say that a HTMM is fully stationary

if it is both transition and emission stationary, i.e. neither P (X(v)|X(pa[v])

5

nor P (Y (v)|X(v)) depend on v. In this case the total number of parameters

is n(1 + n + K) being K the number of parameters in each emission model

P (Y |X = x). We say that the model is locally stationary if it is emission

stationary and P (X(v)|X(pa[v]) = P (X(w)|X(pa[w]) whenever v and w are

both the i-th children of their parent in the tree. This assumption yields

n(1 + nm + K) parameters. Finally, we say the model is level stationary

if P (X(v)|X(pa[v]) = P (X(w)|X(pa[w]) and P (Y (v)|X(v)) = P (Y (w)|X(w))

whenever v and w belong to the same level of the tree. In this case the number

of parameters is n + Ln(n + K), being L the maximum allowed height of a

tree.

2.3 Inference and Learning

Since the model is a special case of Bayesian network, the two main algorithms

(inference and parameter estimation) can be derived as special cases of corre-

sponding algorithms for Bayesian networks. Inference consists of computing all

the conditional probabilities of hidden states, given the evidence entered into

the observation nodes (i.e. the labels of the tree). Since the network is singly

connected, the inference problem can be solved either by π-λ propagation [18]

or by the more general junction tree algorithm [12]. In our experiments we

implemented a specialized version of the junction tree algorithm. Given the

special structure of the HTMM network (see Figure 1), no moralization is re-

quired and cliques forming the junction tree are of two kinds: one containing

X(v) and X(pa[v]), the other one containing Y (v) and X(v). In both cases,

only two variables are contained in a clique. It should be remarked that P (Y)

is simply obtained as the normalization factor in any clique after propagation

in the junction tree. Inference in HTMMs is very efficient and runs in time

proportional to the number of nodes in Y and to the number of states in X .

Each model λi is trained using examples belonging to class ci. Learning is

formulated in the maximum likelihood framework and the presence of hidden

variables requires an iterative algorithm for solving the optimization problem.

In our implementation we have chosen Expectation-Maximization (EM), a

general family of algorithms for maximum likelihood estimation in problems

with missing data [5]. In the case of Bayesian networks (and thus also in

6

Figure 2: An invoice instance for each issuing company is shown. Invoices are

highly-structured patterns, composed by both graphical and textual fields.

the case of HTMMs), it can be shown that the E-step reduces to computing

expected sufficient statistics for the parameters (i.e., expected transition and

emission counts), and the M-step simply consists of updating parameters with

the normalized expected sufficient statistics (see, e.g., [11] for details).

3 Dataset preparation

3.1 Dataset

We chose commercial invoices as a test bed for document classification. We

collected 889 commercial invoices, issued by 9 different companies. The class

of an invoice corresponds to the issuing company. The number of available

instances of different classes in our dataset was rather unbalanced ranging

between 46 and 191. Some instances are shown in Figure 2. Since the dataset

contains classified information, images in Figure 2 have been edited and critical

fields (such as logos, addresses, and other elements that could lead to the

identification of the involved companies) have been covered by black strips.

It can be seen than invoices are complex patterns, which are composed by

many semantic fields, separated either by lines or white spaces. Although

the selected invoices contain both graphical and textual information, we made

no attempt to extract text with optical character recognition systems. The

7

purpose of our experiments is to demonstrate that document images can be

discriminated using layout information only.

3.2 XY-trees extraction

The XY-tree representation is a well known approach for describing the phys-

ical layouts of documents that can preserve relationships among single con-

stituents [17, 16]. The root of an XY-tree is associated with the whole doc-

ument image. The document is then recursively split into regions that are

separated by cut lines. Horizontal and vertical white spaces are alternatively

used for cutting, so regions at even levels in the tree are cut along the X

axis while regions at odd levels are cut along the Y axis. Since the classic

space-based version of the algorithm cannot properly handle lines across the

formatted area, our implementation allows both space and line cuts [4]. Line-

based splits are rejected if the cut line is shorter than an given fraction of the

region size. Moreover, lines always have priority over spaces.

A region is not further divided if neither a horizontal or vertical white space

can be found, or if the area of a region is smaller than a given resolution. In our

experiments this resolution was set to 1% of the total document area, yielding

an average tree size of 28 nodes on our dataset. Children are ordered left-to-

right for vertical cuts and top-to-bottom for horizontal cuts. Since the XY-tree

extraction can be strongly affected by salt-and-pepper noise, document images

were preprocessed in order to remove small connected components.

3.3 Local feature extraction

The shape of the XY-tree reflects the document layout, but it may be not dis-

criminant enough for classification. A more informative representation can be

obtained by labeling each node with a set of features (or attributes) extracted

from the region associated with the node. For each region of the document we

used a vector of six features:

1. a flag indicating whether the parent was cut using a line or a space;

2. the four coordinates of the bounding box (in normalized document space

coordinates).

3. the average grey level of the region.

8

Real variables were quantized by running the maximum entropy discretization

algorithm described in [8], obtaining discrete attributes with 10 possible real-

izations each. Note that since the discretization algorithm collects statistics

over the available data, it must be run

4 Implementation and results

We compared HTMMs to the document decision trees (DDT) algorithm de-

scribed in [1]. DDT is an instance based learner that extends the hierarchical

classifier proposed by Dengel [6, 7]. DDT stores training documents (repre-

sented as XY-trees) into a tree data structure built to perform an efficient

classification. Each leaf of a DDT contains a training example, whereas each

internal node contains the XY-subtrees shared by all the training examples

that are dominated by that node. Like in traditional decision trees, classifica-

tion is carried out descending the tree top-down and performing tests at each

internal node. Tests in this case consist of computing similarities between the

input XY-tree and the XY-subtrees contained in the children of the internal

node. Once a leaf is reached, the learner outputs the class of the training

example stored into the leaf.

We used one HTMM, λk, for each class, and each HTMM had 11 states

x1, . . . , x11. A document was classified as belonging to class i if P (Y|λk) >

P (Y|λj) for j 6= k. Transitions were restricted by a left-to-right topology, i.e.

we forced P (X(v) = xi|X(pa[v]) = xj) = 0 if |i − j| > d. In our experi-

ments we set d = 2. This helps us to reduce the number of parameters to

estimate. Moreover, we used a fully stationary model (see Section 2) as a form

of parameter sharing.

One practical difficulty for estimating classification accuracy is that datasets

in the domain of commercial invoices are typically small. Moreover, we are in-

terested in evaluating how accuracy decreases with the number of available

training examples. In the case of small datasets, well known approaches for

obtaining good estimates of classification accuracy are leave-one-out and k-

fold cross validation. In our experiments, in order to take advantage of all the

available instances and, at the same time, to evaluate accuracy as a function

of the training set size, we opted for a modified version of the cross validation

9

algorithm. The dataset was divided into M groups of instances and for each

N < M we trained the classifier on N groups and tested generalization on

the remaining M −N groups. These operations were iterated with N varying

between 1 and M − 1 and each time all the(

MN

)combinations of groups were

used to carry out an experiment on a new test and training set. In this way

the total number of single training/test experiments is 2M − 2.

7-fold

88%

90%

92%

94%

96%

98%

100%

14%28%43%57%71%86%

Fraction of examples in the training set

Acc

ura

cy

HTMMDT

10-fold

84%

86%

88%

90%

92%

94%

96%

98%

100%

10%20%30%40%50%60%70%80%90%

Fraction of examples in the training set

Acc

ura

cy

HTMMDT

(a) (b)

Figure 3: Results of the cross-validation technique, when the available exam-

ples are divided into 7 and 10 sets ((a) and (b), respectively).

Two experimental sessions were performed. In the first one, we set M = 7

while in the second one, we set M = 10. The results of the first session

are shown in Figure 3a and the results of the second one are shown in Fig-

ure 3b. HTMMs perform very well on the classification task: using 800 exam-

ples (about 90 examples for class) as a training set, the recognition percentage

was as high as 99.28% (a 34% relative error reduction compared to the DDT

algorithm). Accuracy does not rapidly decrease when a smaller number of

examples are used to train the models. The 10-fold results show that when

only 20% of the examples are used to train the models (and the remaining

80% of examples are included in the test set), the accuracy is 95%. DDTs

do not perform as well as HTMMs when the training set is large and it can

cover the statistical variability of the patterns: if 800 examples are inserted

into the DDT, then the correct classification rate is equal to 98.26%. On the

other hand, when the training set is small, DDT classification capabilities out-

perform HTMM ones. DDTs can perform well even if the training set is small

because they can focus their attention on small portions of the input patterns

10

original image state 1 state 2

state 2 state 3 state 4



Figure 4: State activation maps for a trained HTMM. Maps are computed on

the document image shown in the upper-left corner

without trying to model the whole pattern. On the other hand HTMMs take

into account the entire input pattern. This turns out to be advantageous if

enough training data is provided but may lead to overfitting if the training set

is too small.

4.1 Analysis of hidden states

In the case of sequence modeling, HMM states often have an explicit inter-

pretation in terms of temporal events and the Viterbi algorithm can be used

to “decode” these events by computing the most likely state sequence, given

the observed data. In HTMMs, a similar decoding algorithm can be easily ob-

tained by running the max-calibration procedure [13] after inserting evidence

11

Originalinvoice

1

2 3

5

7

1110

89

64

Figure 5: Averaged state activation maps. The document image shown on

the left is one instance of the class associated with the model. Activation maps

are averaged over the test set. Arrows between states indivate the most likely

state transitions.

on the observed variables Y (v). This procedure yields the most likely state

configuration given the observed XY-tree. By observing the decoded state

trees, we noted that several states tend to activate on specific types of image

regions. Thus max-calibration can be helpful to perform image segmentation.

In Figure 4 we show the state activation maps for a particular invoice image.

The activation map for state xi is two-dimensional image superimposed on the

document image that reveals on which regions state xi is active. In this context

a region R is a rectangular box contained in an XY-tree leaf u. By definition of

XY-tree, R is dominated by all the nodes in the path from the root to R. Maps

were then obtained as follows. First, we ran the max-calibration algorithm,

obtaining for each node v the most likely state x∗(v). Then, for each state xi

12

and for each region R, the activity of xi in R was defined as

a(xi, R) = {v : x∗(v) = xi and v dominates R} .

By inspecting Figure 4 we note how several states specialize to activate on

specific invoice fields. For example, state 9 is mostly active on the region

containing the customer address, state 2 on the invoice header (logo and in-

formation about the issuing company), states 10 and 11 on the purchases list,

state 3 on some bottom boxes containing subtotals, and so on. So, besides

classification, HTMMs can be used for image segmentation and automatic re-

gion labeling. Region labeling may be useful to guide subsequent information

extraction algorithms.

In Figure 5 we show a set of activation maps averaged over all the test-set

invoices of a given class. Again, it can be noted how certain states specialize

on specific invoice fields. In this example we see that state 2 corresponds to

the purchases list. The state activation decreases descending the sub-region

of its competence, reflecting that the probability of having purchased at least

n items, decreases with n, whereas at least one item is always bought. The

maps also provide interesting insights about how classification is performed.

Arrows in the diagram indicate the highest state transition probabilities after

learning. These transitions can be interpreted as a sort of grammar indicating

how a valid image should be parsed to show that it actually belongs to the

class being modeled.

5 Conclusions

In this paper, we have presented a new document categorization system in

which the documents are represented by XY-trees. The documents are classi-

fied by generalized hidden Markov models which are able to deal with labeled

trees. Some experiments have been carried out on a dataset of commercial

invoices with very promising results. We have shown that the model learns

a high-level semantic task (labelling single fields of the invoices), in order to

achieve the final document recognition. This is the result of combining the

XY-tree representation (which can effectively isolate and model the semantic

relationships among documents constituents) with the HTMM capability of

13

learning hierarchical structure in the data.

References

[1] A. Appiani, F. Cesarini, A. Colla, M. Diligenti, M. Gori, S. Marinai, and

G. Soda, “Automatic document classification and indexing in high-volume

applications,” International Journal on Document Analysis and Recognition,

vol. 4, no. 2, pp. 69–83, 2002.

[2] Y. Bengio and P. Frasconi, “An input output HMM architecture,” in Advances

in Neural Information Processing Systems 7, (G. Tesauro, D. Touretzky, and

T. Leen, eds.), pp. 427–434, The MIT Press, 1995.

[3] R. Brugger, A. Zramdini, and R. Ingold, “Modeling documents for structure

recognition using generalized n-grams,” in Proceedings of ICDAR, 1997.

[4] F. Cesarini, M. Gori, S. Marinai, and G. Soda, “Structured document segmen-

tation and representation by the modified x-y tree,” in In Proceedings of the

International Conference on Document Analysis and Recognition, pp. 563–566,

1999.

[5] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum-likelihood from

incomplete data via the EM algorithm,” Journal of Royal Statistical Society B,

vol. 39, pp. 1–38, 1977.

[6] A. Dengel, “Initial learning of document structure,” in Proceedings of ICDAR,

pp. 86–90, 1993.

[7] A. Dengel and F. Dubiel, “Clustering and classification of document structure:

A machine learning approach,” in Proceedings of ICDAR, pp. 587–591, 1995.

[8] U. M. Fayyad and K. B. Irani, “Multi-interval discretization of continuous-

valued attributes for classification learning,” in Proc. 13th Int. Joint Conf. on

Artificial Intelligence, pp. 1022–1027, Morgan Kaufmann, 1993.

[9] P. Frasconi, M. Gori, and A. Sperduti, “A general framework for adaptive

processing of data structures,” IEEE Trans. on Neural Networks, vol. 9, no. 5,

pp. 768–786, 1998.

14

[10] R. C. Gonzalez and M. G. Thomason, Syntactic Pattern Recognition. Reading,

Massachusettes: Addison Wesley, 1978.

[11] D. Heckerman, “Bayesian networks for data mining,” Data Mining and Knowl-

edge Discovery, vol. 1, no. 1, pp. 79–120, 1997.

[12] F. V. Jensen, S. L. Lauritzen, and K. G. Olosen, “Bayesian updating in recursive

graphical models by local computations,” Computational Statistical Quarterly,

vol. 4, pp. 269–282, 1990.

[13] F. Jensen, An Introduction to Bayesian Networks. New York: Springer Verlag,

1996.

[14] M. I. Jordan, Z. Ghahramani, and L. K. Saul, “Hidden markov decision trees,”

in Advances in Neural Information Processing Systems, (M. C. Mozer, M. I.

Jordan, and T. Petsche, eds.), p. 501, The MIT Press, 1997.

[15] H. Lucke, “Bayesian belief networks as a tool for stochastic parsing,” Speech

Communication, vol. 16, pp. 89–118, 1995.

[16] G. Nagy, , and M. Viswanathan, “Dual representation of segmented and tech-

nical documents,” in In Proceedings of the First International Conference on

Document Analysis and Recognition, pp. 141–151, 1991.

[17] G. Nagy and S. Seth, “Hierarchical representation of optically scanned docu-

ments,” in Proc. Int. Conf. on Pattern Recognition, pp. 347–349, 1984.

[18] J. Pearl, Probabilistic Reasoning in Intelligent Systems : Networks of Plausible

Inference. Morgan Kaufmann, 1988.

[19] C. Shin and D. Doermann, “Classification of document page images based on

visual similarity of layout structures,” in Proc. SPIE Document Recognition

and Retrieval VII 3967, pp. 182–190, 2000.

[20] C. Shin, D. Doermann, and A. Rosenfeld, “Classification of document pages

using structure-based features,” International Journal on Document Analysis

and Recognition, vol. 3, no. 4, pp. 232–247, 2001.

[21] P. Smyth, D. Heckerman, and M. I. Jordan, “Probabilistic independence net-

works for hidden markov probability models,” Neural Computation, vol. 9, no. 2,

pp. 227–269, 1997.

15

Documents

Hidden Tree Markov Models for Document Image …ai.dinfo.unifi.it/paolo/ps/pami02-HTMM.pdfHidden Tree Markov Models for Document Image Classiﬁcation Michelangelo Diligenti ∗Paolo