8
. . . . . . Configurable Table Structure Recognition in Untagged PDF Documents 1 Alexey Shigarov 1 [email protected] Andrey Mikhailov 1 [email protected] Andrey Altaev 1 [email protected] 1 Matrosov Institute for System Dynamics and Control Theory, Siberian Branch of the Russian Academy of Sciences 16th ACM Symposium on Document Engineering September 15, 2016, Vienna, Austria 1 This work was financially supported by the Russian Foundation for Basic Research (grant 15-37-20042)

Configurable Table Structure Recognition in Untagged PDF Documents

Embed Size (px)

Citation preview

Page 1: Configurable Table Structure Recognition in Untagged PDF Documents

. . . . . .

Configurable Table Structure Recognitionin Untagged PDF Documents1

Alexey Shigarov1

[email protected]

Andrey Mikhailov1

[email protected] Altaev1

[email protected]

1Matrosov Institute for System Dynamics and Control Theory,Siberian Branch of the Russian Academy of Sciences

16th ACM Symposium on Document EngineeringSeptember 15, 2016, Vienna, Austria

1This work was financially supported by the Russian Foundation forBasic Research (grant 15-37-20042)

Page 2: Configurable Table Structure Recognition in Untagged PDF Documents

. . . . . .

Introduction

I Nganji2 estimates that 95.5% of scientific articles publishedby four leading publishers are untagged PDF documents

I “Untagged” means no tables and cells, only printinginstructions for text chunks and graphics

I So, PDF Table Extraction is the challenging task

I Today, some academic and commercial tools continue toappear and compete

I Motivation for our work consists inI defining a configurable part (parameters and ad-hoc

heuristics) in the process of table structure recognitionI examining features of appearance of text printing

instruction in PDF files for recovering human reading orderI reaching a high accuracy on the existing competition

dataset

2J. Nganji, The Portable Document Format (PDF) accessibility practiceof four journal publishers. Library & Information Science Research, 2015,37(3), 254-262.

Page 3: Configurable Table Structure Recognition in Untagged PDF Documents

. . . . . .

Table Structure RecognitionPreprocessing

A table is a collection of related data a

A table is a collection of related data b

A table is a collection of related data cf g

d e

I Splitting original text chunks (a) into one-character chunks(b)

I Merging one-character chunks into word chunks andreindexing the order of their appearance (c)

I Splitting each rectangle (d) into four rulings (e)

I Merging segments of one visual line (f ) into one ruling (g)

I Heuristics: eliminating text chunks containing onlyitemization or padding characters

Page 4: Configurable Table Structure Recognition in Untagged PDF Documents

. . . . . .

Table Structure RecognitionText Block Recovering

a

b

Fiscal

year

R&D

expenditures

(bn yen)

1996 a) 15.079

GDP2)

(bn yen)

506.480

1

2

3

4

5

6

7 8

13 14 16

18

Ratio of R&D

expenditures to

GDP

a) 2.98

12

17

7

8 9

10 11

15

Fiscal

year

R&D

expenditures

(bn yen)

1996 a) 15.079

GDP2)

(bn yen)

506.480

Ratio of R&D

expenditures to

GDP

a) 2.98

I Merging word chunks (a) into text blocks (b)

I In the best case: each block is a textual content of a cell

I Heuristics: adjacency in the order of the appearance, norullings, identical fonts, word and line spacing, vertical andhorizontal projections

Page 5: Configurable Table Structure Recognition in Untagged PDF Documents

. . . . . .

Table Structure RecognitionCell Recovering

Bounding boxes of text blocks Whitespace gaps

a

y

x

bi bi

b

y

x

I There are two ways to arrange text blocks into cellsI Analysis of witespace gaps between text blockI Analysis of connected text blocks (bounding boxes)

I Heuristics: a column containing only one non-empty cell ismerged with the nearest column to the left

Page 6: Configurable Table Structure Recognition in Untagged PDF Documents

. . . . . .

Configuring

1

2

3

4

0.75

0.8

0.85

0.9

1

2

3

4

F-score

the height

factorthe width factor

kw kh

1

2

3

4

75

1

22

3

44

core

the height

factore width factor

kw kh

I Defining formulas to set up word and line spacing, as wellvertical and horizontal projections

I Choosing predefined ad-hoc heuristics

I Searching “optimal” parametrs on a target dataset

Page 7: Configurable Table Structure Recognition in Untagged PDF Documents

. . . . . .

Experimental Evaluation

I Two configuration were implemented. In the better case,we have:

Recall 0.9233Precision 0.9499F-score 0.9364

I The evaluation is based onI The methodology for algorithms for table understanding in

PDF documents3

I “ICDAR 2013 Table Competition” dataset4

I Nurminen’s Python scripts5 for comparing ground-truthand results

3M. Gobel, T. Hassan, E. Oro, G. Orsi. A methodology for evaluatingalgorithms for table understanding in PDF documents. In Proc. of theDocEng’12. 2012, pp. 45-48.

4M. Gobel, T. Hassan, E. Oro, G. Orsi, ICDAR 2013 Table Competition.In Proc. of the 12th ICDAR, Washington, DC, 2013, pp. 1449-1453.

5http://tamirhassan.com/competition/dataset-tools.html

Page 8: Configurable Table Structure Recognition in Untagged PDF Documents

. . . . . .

Web-Application for PDF Table Extraction

I Our experimental web-application is available athttp://cells.icc.ru/pdfte

I Now, it enables only manual table selection, but automatictable structure recognition

I Extracted tables are accessible in HTML and Excel format

I Further they can be transformed into a relational formthrough http://cells.icc.ru/ssdc