73
http://www.inftyproject.org/ Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu Suzuki Kyushu University InftyProject ((http://www/inftyproject.org) Science Accessibility Net (http://www.sciaccess.net) DML 2010 July 7, 2010, Paris

Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/

Mathematical formulae recognition and logical structure analysis of

mathematical papers

Masakazu SuzukiKyushu University

InftyProject ((http://www/inftyproject.org)Science Accessibility Net (http://www.sciaccess.net)

DML 2010

July 7, 2010, Paris

Page 2: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org 2

Plan of the talk

About InftyProject

Making Rich Digital Mathematical Libraries Process Flow and Technical Components

Formulae Recognition

Adaptive Method Character and Symbol Recognition

Logical Structure Analysis

Page 3: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 3

Section 1About Infty Project

Page 4: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 4

InftyProject

The beginning:- Started as a research project to help visually impaired people

in scientific fields in 1995.

- Digitization of of mathematical journals, books, etc..

Current research subjects:- Recognition and understanding of math documents,

- User interface and data conversion, etc.

Policy: - Priority in practical system development.

Page 5: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 5

InftyProject

Main system developmentInftyReader : Math OCR software

InftyEditor : Editor of math documentsData conversion(XML, LaTeX, HTML, PDF, etc.)

ChattyInfty : InftyEditor + speech output

URL:http;//www.inftyproject.org

Go

Page 6: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org 6

sAccessNet

InftyReader is used for- Helping people with visual handicaps working in scientific fields,

- Digitization of mathematical/scientific Journals in Japan,

e.g. J.Math.Soc.Japan, Japanese J. Math., Tokyo J.Math, etc.,(11 journals of mathematics pulished in Japan)

by the not-for-profit organization “Science Accessibility Net”

http;//www.sciaccess.net/

Go

Page 7: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://infty.kyushu-u.ac.jp 7

“InftyReader” OCR software for math documents

Demonstration.Original Image

Recognition Result

Sample: A sample of Math Journal digitized using InftyReader

Page 8: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 8

Section 2Toward Rich DML

Page 9: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://infty.kyushu-u.ac.jp 9

Digitization of Math Journals Different levels: Level 1: Scanned images of papers

e.g. GIF, TIFF

Level 2: Searchable digitized document e.g. PDF with hidden text

Level 3: Structured document with linkse.g. XML, HTML(+MathML), LATEX, …

Level 4: (partially) Executable document e.g. Mathematica, Maple

Level 5: Formally presented document. e.g. Mizar, OMDoc

Page 10: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://infty.kyushu-u.ac.jp 10

Digitization of Math Journals Different levels: Level 1: Bitmap images of printed materials.

e.g. GIF, TIFF

Level 2: Searchable digitized document e.g. PDF with hidden text

Level 3: Structured document with links.e.g. XML, HTML(+MathML), LATEX, …

Level 4: (partially) Executable document. e.g. Mathematica, Maple

Level 5: Formally presented document. e.g. Mizar, OMDoc

Infty : Level 1 → Level 3

Page 11: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://infty.kyushu-u.ac.jp 11

Process Flow of Digitization

Layout Analysis : Segmentation of Areas (Text, Table, Figure)

Recognition per line (Character recognition, Math/Text segmentation, Math. Structure analysis)

Document Structure analysis(Chapter, Section, Itemize, Theorem description, References, etc.)

XMLOutputs

LaTeX, HTML+MathML,PDF, Braille codes, etc.

PDFImage File (TIF) Texts

Page 12: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 12

Layout Analysis

Segmentation of Areas (Text, Table, Figure)

Recognition per line (Character recognition, Math. Structure analysis)

Document Structure analysis(Title, Chapter, Section, Itemize, Theorem, Bib, etc.)

XMLOutputs LaTeX. HTML,

Human readable TeXBraille codes, Speak data, etc.

PDFImage File (TIF)(Pre processing)

Page 13: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 13

Layout Analysis

Segmentation of Areas Table Analysis

Recognition per line (Character recognition, Math. Structure analysis)

Document Structure analysis(Title, Chapter, Section, Itemize, Theorem, Bib, etc.)

XMLOutputs LaTeX. HTML,

Human readable TeXBraille codes, Speak data, etc.

PDFImage File (TIF)

Page 14: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://infty.kyushu-u.ac.jp 14

Process Flow of Digitization

Layout Analysis : Segmentation of Areas (Text, Table, Figure)

Recognition per line (Character recognition, Math/Text segmentation, Math. Structure analysis)

Document Structure analysis(Chapter, Section, Itemize, Theorem description, References, etc.)

XMLOutputs

LaTeX, HTML+MathML,PDF, Braille codes, etc.

PDFImage File (TIF) Texts

Line Segmentation

Page 15: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 15

Line Segmentation (Sample)

Page 16: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 16

Line Segmentation (Sample)

Page 17: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 17

A Method of Line Segmentation

Page 18: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://infty.kyushu-u.ac.jp 18

Process Flow of Digitization

Layout Analysis : Segmentation of Areas (Text, Table, Figure)

Recognition per line (Character recognition, Math/Text segmentation, Math. Structure analysis)

Document Structure analysis(Chapter, Section, Itemize, Theorem description, References, etc.)

XMLOutputs

LaTeX, HTML+MathML,PDF, Braille codes, etc.

PDFImage File (TIF) Texts

Page 19: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://infty.kyushu-u.ac.jp 19

Math/Text Segmentation

Number of characters in Math area is about 20% of all the characters in pure math journals.

Math. structure

No math. structure

Page 20: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org 20

Recognition of Ordinary Texts & Math/Text Area Segmentation

= Simultaneous Process using DP1. Combination of different OCR engines

2. Score using relative position check

Current version:

Infty + Two commercial OCRs (Toshiba + Media Drive)

New version: FineReader engine will be added.

Math/Text Segmentation

Page 21: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org 21

Methods For further improvement of ordinary text recognition

+ multi-lingua recogniton. Introduction of ABBY FineReader engine.

Method an effect …

Method 1 (Recognition of words)

F : N ü h a m a - g u n ,

E : N i i h a r n a - g u n ,

I : N i i / ι a m a - g u n ,

Page 22: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org 22

Methods For further improvement of ordinary text recognition

+ multi-lingua recogniton. Introduction of ABBY FineReader engine.

Method an effect …

Method 1.

F : N ü h a m a - g u n ,

E : N i i h a r n a - g u n ,

I : N i i / ι a m a - g u n ,

Page 23: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org 23

Methods For further improvement of ordinary text recognition

+ multi-lingua recogniton. Introduction of ABBY FineReader engine.

Method an effect …

Method 1.

Result N i i h a m a g u n ,

F : N ü h a m a - g u n ,

E : N i i h a r n a - g u n ,

I : N i i / ι a m a - g u n ,

Page 24: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org 24

Methods For further improvement of ordinary text recognition

+ multi-lingua recogniton. Introduction of ABBY FineReader engine.

Method an effect …

Method 2 (Use character sizes and positions)

Page 25: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org 25

Methods For further improvement of ordinary text recognition

+ multi-lingua recogniton. Introduction of ABBY FineReader engine.

Method an effect …

Method 2 (Use character sizes and positions)

Page 26: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://infty.kyushu-u.ac.jp 26

Process Flow of Digitization

Layout Analysis : Segmentation of Areas (Text, Table, Figure)

Recognition per line (Character recognition, Math/Text segmentation, Math. Structure analysis)

Document Structure analysis(Chapter, Section, Itemize, Theorem description, References, etc.)

XMLOutputs

LaTeX, HTML+MathML,PDF, Braille codes, etc.

PDFImage File (TIF) Texts

Page 27: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://infty.kyushu-u.ac.jp 27

Formulae Recognition

Recognition of Variety of Rare Symbols

Distinction of Fonts.(Italic, Bold, Bbb, Caligraphic,etc.)

Segmentation of Touched/Broken charactersin Math Area is still a difficult problem.

Stable Structure Analysis of math formulae against the miss-recognition of characters.

Distinction of Noises and Small symbols

Page 28: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://infty.kyushu-u.ac.jp 28

Process Flow of Digitization

Layout Analysis : Segmentation of Areas (Text, Table, Figure)

Recognition per line (Character recognition, Math/Text segmentation, Math. Structure analysis)

Document Structure analysis(Chapter, Section, Itemize, Theorem description, References, etc.)

XMLOutputs

LaTeX, HTML+MathML,PDF, Braille codes, etc.

PDFImage File (TIF) Texts

Page 29: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 29

Document Structure Analysis Detection of :

Title, Autor, Section, Subsection, Itemization, BibItem, Theorem, Lemma, etc.

- A Naïve method:Line classification using the combination features such as:

Character size, Font Information (Bold, Italic, Small Capital), Keywords, Indentation, Starting with Numbers or Specialpattern (e.g. “[Num]”), etc.

- Stronger method is required in actual digitization.

Hyperlink inside document.

Page 30: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 30

Section 3Formulae Recognition

by Infty

Page 31: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 31

Formulae recognition

• R. Anderson, Syntax directed recognition of hand-printed two dimensional mathematcs. Interactive System for Experimental Applied Mathematics,

M.Klerer and J. Reinfelds, Eds, Academic Press, 1968, pp. 436-459• M. Okamoto and H. Twaakyondo, Structure analysis and recognition of mathematical expressions, 3rd ICDAR, 1995, Montreal, (1995), 430--437.• R. J. Fateman, T. Tokuyasu, B. P. Berman and N.Mitchell,Optical Character Recognition and Parsing of Typeset Mathematics,Journal of Visual Communication and Image Representation vol.7, no.1, (1996), 2--15.• Y. Eto and M.Suzuki, Mathematical formula recognition using virtual link network, 6th ICDAR, 2001, Seattle, IEEE Computer Society Press , 430--437

Page 32: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 32

Infty OCR engine

Developed in Suzuki Lab.,using more than 1,500,000 sample images of characters and symbols from various math. books/journals.

Recognizes more than 500 categories Various math symbols Various fonts: Roman,Italic,Calligraphic,Bbb, some

German fonts, etc.

High speed Three step classification :

“rough” classification → “strict” classification

Page 33: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 33

3 step classifications

classificationα

Input imageSectional features

(3dim)

Directional element features(36dim)

DB

classificationDB

Peripheral feature +Density fearure

(128dim)

classificationDB

Recognition result(candidates)aα d

Page 34: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 34

Voting method

Directional element

feature(36dim)

Peripheral feature(64dim)

Density feature(64dim)

α σ a dd

θ α a ed

αa ed a

1st(5)

2nd(4)

3rd(3)

4th(2)

5th(1)

Order of candidates(Scores)

aα d

1st(14)

2nd(9)

3rd(7)

Candidates(Total scores)

1st(5)

2nd(4)

3rd(3)

4th(2)

5th(1)

1st(5)

2nd(4)

3rd(3)

4th(2)

5th(1)

Page 35: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 35

Voting method

Directional element

feature(36dim)

Peripheral feature(64dim)

Density feature(64dim)

α σ a dd

θ α a ed

αa ed a

1st(5)

2nd(4)

3rd(3)

4th(2)

5th(1)

Order of candidates(Scores)

aα d

1st(14)

2nd(9)

3rd(7)

Candidates(Total scores)

1st(5)

2nd(4)

3rd(3)

4th(2)

5th(1)

1st(5)

2nd(4)

3rd(3)

4th(2)

5th(1)

Voting → Normalization of the score of symbol recognition

Page 36: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 36

Structure Analysis of Formulae

Input (image) Horizontal Horizontal Horizontal

RSubScript

RSubScript

Output(Tree Structure)

Page 37: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 37

Structure Analysis of Formulae

∑=

n

iia

0Σi = 0

n

a i

Page 38: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 38

Some difficult cases :

Structure Analysis of Formulae

Page 39: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 39

Some difficult cases :

Structure Analysis of Formulae

Page 40: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 40

Some difficult cases :

Structure Analysis of Formulae

Page 41: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 41

Link possibilities :

Structure Analysis of Formulae

Page 42: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 42

Similar characters :

Structure Analysis of Formulae

SZz s

N 1-*∈

Which candidatesare appropiriate?

Page 43: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 43

χχ∫a

b

x2

d x

Structure Analysis of Formulae

Virtual link network

Page 44: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 44

Structure Analysis of Formulae

∫a

b

x2

d xSearch for correctspanning tree

Page 45: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 45

Virtual link network

Nodes =Candidates of character recognition

Virtual link network

Input image

Each Link has a label and the link costLink: Horizontal, Upper, Under, Rsup, Rsub,Lsup, Lsub

Page 46: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

Link Cost

Definitions of :Normalized size (NSize) and Normalized center (NCenter)

x:y:z = 28:51:21(dafault value)

Page 47: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 47

Link Cost

Page 48: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

Link Cost

-600

-400

-200

0

200

400

600

200 400 600 800 1000 1200 1400

Distribution map in the (H,D)-plane

Horizontal position

Superscript position

Subscript position

Character pairs

Page 49: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

Big Operators-Alphabets

-600

-400

-200

200

400

600

200 400 600 800 1000 1200 1400

Alphabets-Alphabets

-600

-400

-200

200

400

600

200 400 600 800 1000 1200 1400

Alphabets-Operators

-600

-400

-200

200

400

600

200 400 600 800 1000 1200 1400

0

0

0

Integrals-Alphabets

-600

-400

-200

0

200

400

600

200 400 600 800 1000 1200 1400

Page 50: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 50

StructureTree

• Minimum total cost

• Link restrictions

Optimizationunder constraints

Network

Extraction of Structure Tree

Page 51: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 51

StructureTree

• Minimum total cost

• Link restrictions

Optimizationunder constraints

Network

Extraction of Structure Tree

Extraction of minimum cost spanning tree is NP-hard!

↓Strategy of the current version:

Beam search

Page 52: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

Extraction of Structure Tree

Page 53: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

RR

RR

R R R

Linearize

Page 54: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

RR

RR

R R R

Linearize

Page 55: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

RR

RR

R R REach path

corresponds to a spanning tree!

Linearize

Page 56: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

Search of spanning tree

Each path corresponds to a spanning

tree

Beam Search:At each step, we hold a fixed number of paths (=Beam) with lowest costs, and use them at the next step.

Page 57: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 57

StructureTree

• Minimum total cost

• Link restrictions

Optimizationunder constraints

Network

Extraction of Structure Tree

- Extraction of minimum cost spanning tree is NP-hard.- Beam search fails sometimes to get optimal solution.

Page 58: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 58

StructureTree

• Minimum total cost

• Link restrictions

Optimizationunder constraints

Network

Extraction of Structure Tree

- Extraction of minimum cost spanning tree is NP-hard.- Beam search fails sometimes to get optimal solution.- Some other better strategy?

Page 59: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 59

Section 4Large Volume Recognition

Page 60: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://infty.kyushu-u.ac.jp 60

Large Volume Digitization

Retro-digitization of journals,

Reproduction of old book/series of books,

Translation to different languages,

Braille transcription, DAISY talking book,

etc.

Page 61: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://infty.kyushu-u.ac.jp 61

Large Volume Digitization

Adaptive method is efficient:

Get information from the target document:- Character features,- Math formula parameters,- Layout parameters, etc.

Recognition

or(Directly) After manual checking(Semi-automatic)

Page 62: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://infty.kyushu-u.ac.jp 62

“InftySystem” for large scald digitization

Applications:1. InftyReader downloadable from our web site:

http://www.sciaccess.net2. InftyReader Pro (professional version)

3. BatchInfty

4. CharImageManager

Page 63: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://infty.kyushu-u.ac.jp 63

“InftySystem” for large scald digitization

Applications:1. InftyReader downloadable from our web site:

http://www.sciaccess.net2. InftyReader Pro (professional version)

3. BatchInfty

4. CharImageManager

Page 64: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://infty.kyushu-u.ac.jp 64

“InftySystem” for large scald digitization

Applications:1. InftyReader downloadable from our web site:

http://www.sciaccess.net2. InftyReader Pro (professional version)

3. BatchInfty

4. CharImageManager

Page 65: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://infty.kyushu-u.ac.jp 65

“InftySystem” for large scale digitization

Process Flow using BatchInfty & InftyReader pro1. Noise reduction, centering, etc.

2. Trial recognition

3. Extraction features:- Document style → Logical structure analysis- Character cluster images → OCR engine

4. Recognition & verification

5. PDF output

Page 66: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 66

Problems Full automatization of the adaptive method

From the target documents:Extraction of character features / layout parameters

Improvement of- Character recognition- Formulae recognition- Logical structure analysis

Without manual correction

Page 67: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 67

Problems Further improvement of character/symbol

recognition and structure analysis of math expressions. Touched characters, Broken characters in math area

Low resolution image

Different type face (Old books, typewriter prints, etc.)

Bold char detection in math area

Page 68: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 68

Problems Logical Structure Analysis (Automatic detection

and manual correction) Title, Autor, Section, Subsection, Itemization, BibItem,

Theorem, Lemma, etc.

Hyperlink inside document.

Page 69: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 69

Problems Detection/Analysis of Figures and Tables Detection of characters in figures

Table structure analysis

Graphs → Tables

Page 70: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 70

Challenge

Is it possible to realize:

OCR with higher accuracy than manual imput/correction by human?

Page 71: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 71

Challenge

Is it possible to realize:

OCR with higher accuracy than manual imput/correction by human?

(I hope a student who challenges to this difficult problem appears in near future!)

Page 72: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 72

Conclusion InftyProject.

Research group of math information processing. Demo (InftyReader). A Brief sketch of the methods used in Infty and

the current state of the art. There are many problems unsolved, especially in

practical sense. Proposed some problems to be attacked.

Page 73: Mathematical formulae recognition and logical structure ...sojka/dml-2010-suzuki.pdf Mathematical formulae recognition and logical structure analysis of mathematical papers Masakazu

http://www.inftyproject.org/ 73

“INFTY” an integrated OCR for mathematical documents

Thanks you!

Masakazu SuzukiGraduate School/ Faculty of Math.Kyushu UniversityE-mail: [email protected]

InftyProject: http://www.inftyproject.orgScience Accessibility Net: http://www.sciaccess.net