View
213
Download
1
Tags:
Embed Size (px)
Citation preview
Prénom Nom
Document Analysis:Structure Recognition
Prof. Rolf Ingold, University of Fribourg
Master course, spring semester 2008
© Prof. Rolf Ingold
2
Outline
Objectives Physical and logical structures Examples of applications Methodologies for structure recognition Microstructures vs. macrostructures Model driven approaches Interactive Systems
© Prof. Rolf Ingold
3
Importance of document structures
Document = Content + Structures
Structures convey abstract high level information
Structures are revealed by styles
© Prof. Rolf Ingold
4
Applications of document structure recognition
Information extraction form analysis (check readers, ...) business applications : mail distribution, invoice processing, ... analysis of museum & library notices analysis of bibliographical references
Document mining, content analysis business reports legal documents scientific publications
Intelligent indexing laws magazine & newspaper
Document restyling teaching material ...
© Prof. Rolf Ingold
5
Extended Processing Chain
Blocs
Image
Simple Text
Preprocessing
Postanalysis
OCR
Segmentation
Fonts
OFR
Logical labeling Struct. Document
Layout analysis
© Prof. Rolf Ingold
6
Physical document structures
Reveal the publisher's view Composed of a hierarchy of
physical entities text blocs, text lines and
tokens graphical primitives
Universal, i.e. independent of the document class
region
blockhr
document
region
block block block
region
hr block frm
© Prof. Rolf Ingold
7
Illustration of physical document structure
from A. Belaïd
© Prof. Rolf Ingold
8
Illustration of logical document structure
© Prof. Rolf Ingold
9
Logical structures
Reflect the author’s mind Independent of presentation
can be mapped on various physical structures
Composed of application dependent logical entities
Specific to the application and document class
article
p p p p p p p p p
author author
title titlehdln
link
article
link
document
© Prof. Rolf Ingold
10
Relation between logical and physical structure
There is no 1:1 relation between physical and logical structure There are some correspondences between as shown below
© Prof. Rolf Ingold
11
Role of style sheets
analysis
formatting
StylesheetLogical
StructurePhysical Structure
editprint
display
Document formatting is straightforward ... But document analysis is a non trivial task that generally can not be
fully automated
© Prof. Rolf Ingold
12
Methodologies
Document structural analysis can be data-driven : the recognition task is based on image analysis model-driven approaches : the recognition task is
Methods of structural document analysis can be classified into geometrical approaches syntactic approaches based on formal grammars structural approaches based on graphs rule based approaches expert systems (artificial intelligence) machine learning
© Prof. Rolf Ingold
13
Syntactic Document Recognition [Ingold89]
Full model driven approach Formal document description language
attributed grammar translated into an analysis graph
Top down matching algorithm with backtracking for macro-structure as well as micro-structure recognition
Very generic approach Sensitive to noise (no error recovering) Theoretically exponential complexity
© Prof. Rolf Ingold
14
Document Description Language [Ingold89]
Document class specific formal description composed of composition rules (context-free grammar) typographical rules (attributes)
Act:DOC => ActNumber ActContent FootNotes Headings ;
ActNumber:FRG => {Number $ Period} ;
ActContent:PRT => ActTitle ActDate Otgan {Provis} Formul {Chapter} [Validity] ;...
Chapter:PRT => ChTitle ({Section} | {Article}) ; ChTitle.zone = Inherited ChTitle.alignment = (Allowed, Centered, 0pt, 0pt, Undefined) ; ChTitle.lineHeight = 11pt ; ChTitle.spaceBefore = (Allowed,[6pt, 60pt] ) ; ChTitle.interSpace = (Forbidden, [2pt, 3pt]) ; ChTitle.font = (Times, 11pt, Bold, Roman); Article.spaceBefore = <FST: (Forbidden, [6pt, 30pt]), NXT: (Allowed, [6pt, 30pt])> ;...
© Prof. Rolf Ingold
15
Analysis graph [Ingold89]
Analysis graph for syntactic analysis where each node has two links successor (in case of
successful match) alternative (in case of
unsuccessful match)
© Prof. Rolf Ingold
16
Fuzzy document structure recognition [Hu94]
The previous approach has been adapted to be less sensitive to matching errors matching is using fuzzy logic
© Prof. Rolf Ingold
17
Fuzzy document structure recognition [Hu94]
Pattern matching is using fuzzy logic Parsing is expressed as a cost function to be optimized
finding the shortest path in a graph (solved by linear programming)
© Prof. Rolf Ingold
18
Graphein : Blackboard approach [Chenevoy92]
© Prof. Rolf Ingold
19
Model of Graphein [Chenevoy92]
© Prof. Rolf Ingold
20
Complex Layout Analysis [Azolky95]
© Prof. Rolf Ingold
21
Modeling of Scientific Journals [Azokly95]
© Prof. Rolf Ingold
22
Model for a Scientific Journal
<volume name="article" width="160" height="240"> <page name="first"> ... </page> <page name="even"> <hsep name="hs1" bloc="4 3 LEFT RIGHT" type="BLANK"/> <layer name="principle"> <vsep name="vs1" bloc="40 65 TOP hs1" type="BLANK"/> <vsep name="vs2" bloc="[50,60] 4 hs1 BOTTOM" type="BLANK"/> <region name="center" bloc="vs2 RIGHT hs1 BOTTOM"
content="ANY"/> <region name="margin" bloc="LEFT vs2 hs1 BOTTOM"
content="TEXT"/> ... </layer> <layer name="secondary"> <hsep name="hs2" bloc="[10,220] 2 LEFT RIGHT" type="BLANK"> <subst value="hs1"/> <hsep/> <hsep name="hs3" bloc="[10,220] 2 LEFT RIGHT" type="BLANK"> <subst value="BOTTOM"/> <hsep/> <region name="figure" bloc="LEFT RIGHT hs2 hs3"
content="FIGS"/> > </layer> </page> ...
© Prof. Rolf Ingold
23
Use of Document Recognition Models
There is no universal approach !
Document recognition systems must be tuned for specific applications for specific document classes
Contextual information is required Models provide information like
generic document structures (DTD or XML-schema) geometrical and typographical attributes (style information) semantic information (keywords, dictionaries, databases, ...) statistical information
© Prof. Rolf Ingold
24
Content of document models
Generic structure Document Type Definition (DTD) or XML-schema
Style information Absolute or relative positioning Typographical attributes & formatting rules
Semantics (if available) Linguistic information, keywords Application specific ontology
Probabilistic information Frequencies of items or sequences, co-occurrences
© Prof. Rolf Ingold
25
Trouble with document models
Document models are hard to produce and to maintain implicit models (hard coded in the application) => hard to modify, adapt, extend explicit models, written in a formal language => cumbersome to produce, needs high expertise abstract models, learned automatically => needs a lot of training data (with ground-truth!)
Need for more flexible tools: assisted environments with friendly user interfaces recognition improving with use models are learned incrementally
© Prof. Rolf Ingold
26
Pattern Based Document Understanding [Robaday 03]
Configurations consist of Set of vertices
Labeled (type) Attributed (pos, typo, ...)
Edges between vertices Labeled (neighborhood
relation) Attributed (geom, ...)
Model consists of Extraction rules For each class
Attribute selector List of pattern
extraction
configura-tion
model
classification
document image
rules
patt.
sele
cto
r
id
© Prof. Rolf Ingold
27
Evolution of 2-CREM performance
0
50
100
150
200
250
300
350
400
450
500
0 50 100 150 200
improvement of correct labeling as a function of clicks used for correcting labels manually
© Prof. Rolf Ingold
28
Conclusion
Structure recognition of documents is still an open issue Solutions exist for specialized applications
Generic approaches are not mature model are hard to establish training data is missing
As alternative interactive systems with incremental model adaptation