12
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. The Functional Extension Parser – a rule-based system for flexible structural analysis Lukas Gander University of Innsbruck Bratislava 07.05.2010

Bratislava WS - Gander - UIBK - The Functional Extension Parser_pdf

Embed Size (px)

Citation preview

Page 1: Bratislava WS - Gander - UIBK - The Functional Extension Parser_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

The Functional Extension Parser – a rule-based system for flexible structural analysis

Lukas Gander University of InnsbruckBratislava 07.05.2010

Page 2: Bratislava WS - Gander - UIBK - The Functional Extension Parser_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Overview Objectives of the Functional Extension Parser Concepts of the FEP Workflow FEP Core Current status Expected benefits Vision

Page 3: Bratislava WS - Gander - UIBK - The Functional Extension Parser_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Objectives of the FEP The Functional Extension Parser (FEP) is a software tool capable of

detecting and reconstructing some of the main features of a digitised book.

These features are:– Page numbers– Print space– Logical structural elements like

Footnotes Headlines Running titles Marginalia Signature Marks

– Detection and reconstruction of the table of content

3

Page 4: Bratislava WS - Gander - UIBK - The Functional Extension Parser_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Concepts of the FEP Human beings are able to identify logical

structure elements of books simply by looking at the layout without understanding the language

A person intuitively applies a set of rules. OCR output provides much more than a

simple fulltext– Coordinates of lines, blocks, strings.– Style information like bold or italic– Font size and font type– Mostly everything what a user can see on the

image is somehow available within the OCR output

4

Page 5: Bratislava WS - Gander - UIBK - The Functional Extension Parser_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

FEP Workflow

5

Page 6: Bratislava WS - Gander - UIBK - The Functional Extension Parser_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

FEP Core Architecture

6

Page 7: Bratislava WS - Gander - UIBK - The Functional Extension Parser_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Current Status During the last year the whole infrastructure was set up. This

includes– The Visualizer and Editor Application which is online available under

http://dea-gulliver.uibk.ac.at/org.dea.impact.FEP_Prototype.FEP_Prototype/FEP_Prototype.html

– FEP Core module using a rulebased approach

First rule sets were developed for page number detection and print space reconstruction– 98.34 % correctly detected page numbers– 91.77% correctly reconstructed print spaces

7

Page 8: Bratislava WS - Gander - UIBK - The Functional Extension Parser_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Expected benefits Page number detection

– Results of page number detection can be used for quality assurance for the whole digitisation process. missing pages which were lost during the scan process are identified. Duplicated pages can be determined

– page numbers are a prerequisite for users browsing through the book in a digital library application.

Print space reconstruction– The size of the page was always calculated on the basis of the print

space. During digitisation process information about the margins within the document are lost. The margins needed for a reprint can be calculated using the print space and well known reconstruction schemes.

8

Page 9: Bratislava WS - Gander - UIBK - The Functional Extension Parser_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Expected benefits (2) Print space reconstruction

– All images can be cropped to the same size which allows an enjoyable look and feel (with the content centered) in digital repositories. (e.g Google books)

Logical structure reconstruction– Improvement for knowledge discovery in digital repositories. Headlines

for example are more important than normal text or footnotes. A reliable result of the logical structure analysis allows an adequate handling of these elements during indexing process (e.g Headlines should be boosted, running titles and signature marks be ignored)

9

Page 10: Bratislava WS - Gander - UIBK - The Functional Extension Parser_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Expected benefits (3)

10

Reconstruction of TOC eases navigation in– PDF– EPUB– Online repositories

It is a very challenging task– Google books shows good but not

perfect results– Microsoft Serbia won INEX book

structure 2008competition with precision of 53 %

Page 11: Bratislava WS - Gander - UIBK - The Functional Extension Parser_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Vision

11

Page 12: Bratislava WS - Gander - UIBK - The Functional Extension Parser_pdf

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

12