Upload
alexey-shigarov
View
119
Download
0
Embed Size (px)
Citation preview
. . . . . .
Configurable Table Structure Recognitionin Untagged PDF Documents1
Alexey Shigarov1
Andrey Mikhailov1
[email protected] Altaev1
1Matrosov Institute for System Dynamics and Control Theory,Siberian Branch of the Russian Academy of Sciences
16th ACM Symposium on Document EngineeringSeptember 15, 2016, Vienna, Austria
1This work was financially supported by the Russian Foundation forBasic Research (grant 15-37-20042)
. . . . . .
Introduction
I Nganji2 estimates that 95.5% of scientific articles publishedby four leading publishers are untagged PDF documents
I “Untagged” means no tables and cells, only printinginstructions for text chunks and graphics
I So, PDF Table Extraction is the challenging task
I Today, some academic and commercial tools continue toappear and compete
I Motivation for our work consists inI defining a configurable part (parameters and ad-hoc
heuristics) in the process of table structure recognitionI examining features of appearance of text printing
instruction in PDF files for recovering human reading orderI reaching a high accuracy on the existing competition
dataset
2J. Nganji, The Portable Document Format (PDF) accessibility practiceof four journal publishers. Library & Information Science Research, 2015,37(3), 254-262.
. . . . . .
Table Structure RecognitionPreprocessing
A table is a collection of related data a
A table is a collection of related data b
A table is a collection of related data cf g
d e
I Splitting original text chunks (a) into one-character chunks(b)
I Merging one-character chunks into word chunks andreindexing the order of their appearance (c)
I Splitting each rectangle (d) into four rulings (e)
I Merging segments of one visual line (f ) into one ruling (g)
I Heuristics: eliminating text chunks containing onlyitemization or padding characters
. . . . . .
Table Structure RecognitionText Block Recovering
a
b
Fiscal
year
R&D
expenditures
(bn yen)
1996 a) 15.079
GDP2)
(bn yen)
506.480
1
2
3
4
5
6
7 8
13 14 16
18
Ratio of R&D
expenditures to
GDP
a) 2.98
12
17
7
8 9
10 11
15
Fiscal
year
R&D
expenditures
(bn yen)
1996 a) 15.079
GDP2)
(bn yen)
506.480
Ratio of R&D
expenditures to
GDP
a) 2.98
I Merging word chunks (a) into text blocks (b)
I In the best case: each block is a textual content of a cell
I Heuristics: adjacency in the order of the appearance, norullings, identical fonts, word and line spacing, vertical andhorizontal projections
. . . . . .
Table Structure RecognitionCell Recovering
Bounding boxes of text blocks Whitespace gaps
a
y
x
bi bi
b
y
x
I There are two ways to arrange text blocks into cellsI Analysis of witespace gaps between text blockI Analysis of connected text blocks (bounding boxes)
I Heuristics: a column containing only one non-empty cell ismerged with the nearest column to the left
. . . . . .
Configuring
1
2
3
4
0.75
0.8
0.85
0.9
1
2
3
4
F-score
the height
factorthe width factor
kw kh
1
2
3
4
75
1
22
3
44
core
the height
factore width factor
kw kh
I Defining formulas to set up word and line spacing, as wellvertical and horizontal projections
I Choosing predefined ad-hoc heuristics
I Searching “optimal” parametrs on a target dataset
. . . . . .
Experimental Evaluation
I Two configuration were implemented. In the better case,we have:
Recall 0.9233Precision 0.9499F-score 0.9364
I The evaluation is based onI The methodology for algorithms for table understanding in
PDF documents3
I “ICDAR 2013 Table Competition” dataset4
I Nurminen’s Python scripts5 for comparing ground-truthand results
3M. Gobel, T. Hassan, E. Oro, G. Orsi. A methodology for evaluatingalgorithms for table understanding in PDF documents. In Proc. of theDocEng’12. 2012, pp. 45-48.
4M. Gobel, T. Hassan, E. Oro, G. Orsi, ICDAR 2013 Table Competition.In Proc. of the 12th ICDAR, Washington, DC, 2013, pp. 1449-1453.
5http://tamirhassan.com/competition/dataset-tools.html
. . . . . .
Web-Application for PDF Table Extraction
I Our experimental web-application is available athttp://cells.icc.ru/pdfte
I Now, it enables only manual table selection, but automatictable structure recognition
I Extracted tables are accessible in HTML and Excel format
I Further they can be transformed into a relational formthrough http://cells.icc.ru/ssdc