30
Prénom Nom Document Analysis: Segmentation & Layout Analysis Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

Document Analysis: Segmentation & Layout Analysis

  • Upload
    janus

  • View
    43

  • Download
    1

Embed Size (px)

DESCRIPTION

Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008. Document Analysis: Segmentation & Layout Analysis. Outline. Objectives of layout analysis Classification of layout analysis methods Splitting methods Grouping methods Text-Graphics-Image Separation - PowerPoint PPT Presentation

Citation preview

Page 1: Document Analysis: Segmentation & Layout Analysis

Prénom Nom

Document Analysis:Segmentation & Layout Analysis

Prof. Rolf Ingold, University of Fribourg

Master course, spring semester 2008

Page 2: Document Analysis: Segmentation & Layout Analysis

© Prof. Rolf Ingold

2

Outline

Objectives of layout analysis Classification of layout analysis methods Splitting methods Grouping methods Text-Graphics-Image Separation Text line segmentation Word and character segmentation Field extraction from forms

Page 3: Document Analysis: Segmentation & Layout Analysis

© Prof. Rolf Ingold

3

Objectives of layout analysis and segmentation

The role of segmentation is to split a document image into regions of interest

Regions of interest may be of different granularity levels: graphics or text blocs, text lines, words, characters

The goal of layout analysis is to get a hierarchical description of segmented objects

Page 4: Document Analysis: Segmentation & Layout Analysis

© Prof. Rolf Ingold

4

Segmentation strategies

Segmentation produces a hierarchy of physical objects

Two strategies can be used top-down segmentation: starting with the entire image, split it

recursively down to elementary shapes bottom-up segmentation: starting at pixel level, detect

connected components and group them hierarchically

Hybrid methods combine both strategies

Segmentation methods can be data-driven using only data properties (without contextual

knowledge) model-driven, i.e., using contextual knowledge

Page 5: Document Analysis: Segmentation & Layout Analysis

© Prof. Rolf Ingold

5

Top-down methods

Top-down methods decompose the entire page into a hierarchy of rectangular regions

Top-down approaches perform recursive XY-cuts horizontal and vertical projection profile analysis white streams (spaces) analysis run length smoothing algorithm (RLSA)

Page 6: Document Analysis: Segmentation & Layout Analysis

© Prof. Rolf Ingold

6

Recursive XY-Cut

The page is cut alternatively horizontally and vertically according to white spaces Robust for most printed modern documents Supposes page images to be unskewed Does not work for all kind of layouts

Non rectangular formatting Complex mosaics (illustration next)

Resulting hierarchy may not reflect the natural structure (illustration below)

Page 7: Document Analysis: Segmentation & Layout Analysis

© Prof. Rolf Ingold

7

Top-Down Segmentation

Recursive splitting can be performed by horizontal and vertical profile analysis images need to be "unskewed" !

Page 8: Document Analysis: Segmentation & Layout Analysis

© Prof. Rolf Ingold

8

Top-Down Segmentation (2)

Order in which X-Y cuts are performed is critical

Page 9: Document Analysis: Segmentation & Layout Analysis

© Prof. Rolf Ingold

9

White streams analysis

Principle: detect maximal rectangular white blocs split regions recursively according to thresholds

Page 10: Document Analysis: Segmentation & Layout Analysis

© Prof. Rolf Ingold

10

Run Length Smearing Algorithm (RLSA)

The Run Length Smearing Algorithm (RLSA) is a morphological operator it replaces white runs that are smaller or equal to a given

threshold by black runs it can be applied horizontally as well as vertically

Page 11: Document Analysis: Segmentation & Layout Analysis

© Prof. Rolf Ingold

11

RLSA based segmentation

RLSA can be used to segment a page into blocs using three steps applied horizontally applied vertically combined by logical and

operator

Threshold values are critical and have to be chosen according to document class using statistical white space

analysis

Page 12: Document Analysis: Segmentation & Layout Analysis

© Prof. Rolf Ingold

12

Bottom-up methods

Bottom-up methods start at pixel levels and groups them together in a hierarchy of multi-rectangular regions (shapes delimited by horizontal and

vertical segments) arbitrary shapes

Bottom up methods use connected component extraction region grouping

Page 13: Document Analysis: Segmentation & Layout Analysis

© Prof. Rolf Ingold

13

Connected components

In a binary image, a connected component is a set of black pixels connected by 4- or 8-adjacency

five 4-connected components two 8-connected components

Page 14: Document Analysis: Segmentation & Layout Analysis

© Prof. Rolf Ingold

14

Extraction of connected components

Connected components can be extracted by different algorithms By a one pass full image scanning process, from top to bottom

and from left to right By a border following algorithm, using as first pixel a border

pixel supposed to be known

Page 15: Document Analysis: Segmentation & Layout Analysis

© Prof. Rolf Ingold

15

Scanning based CC Extraction

for each scan line ly

for each black run r

if on line ly-1 there is no run k-adjacent to r

create a new component containing r

else if on line ly-1 there exist one run r’ k-adjacent to r

add r to the component containing r’

else if on line ly-1 there exist several runs ri k-adjacent to r

merge all components containing such a ri

add r to that component

merge

Page 16: Document Analysis: Segmentation & Layout Analysis

© Prof. Rolf Ingold

16

P Qd

R2

Border following algorithm

consider P0 S having a 4-neighbor Q0 S

P ← P0 ; Q ← Q0 ; d ← direction of Q according to P ;

repeat

let Ri be the neighbor of P in direction (d+i) mod 8

if R2 S then Q ← R2 ; d ← (d+2) mod 8;

else

if R1 S then P ← R2; Q ← R1;

else P ← R1; d ← (d2) mod 8;

add P to the contour

until P = P0 and Q = Q0

P

Q

d

R2

R1

Page 17: Document Analysis: Segmentation & Layout Analysis

© Prof. Rolf Ingold

17

Illustration of connected components

Page 18: Document Analysis: Segmentation & Layout Analysis

© Prof. Rolf Ingold

18

Connected components from RLSA

Connected components can be used to detect characters

Word can be located using RLSA

Page 19: Document Analysis: Segmentation & Layout Analysis

© Prof. Rolf Ingold

19

Grouping components

Grouping connected components is non trivial

Grouping rules are based on relative positioning distances and thresholds component classification

Parameters can be estimated statistically

Page 20: Document Analysis: Segmentation & Layout Analysis

© Prof. Rolf Ingold

20

Allen's relations in 2D space

Relative positioning of two rectangles generate 169 configurations !

Page 21: Document Analysis: Segmentation & Layout Analysis

© Prof. Rolf Ingold

21

Threshold estimation

Thresholds can be estimated on statistical distributions of horizontal spaces for character grouping into words and word

grouping into text lines vertical spacing for grouping text lines into text blocs

Page 22: Document Analysis: Segmentation & Layout Analysis

© Prof. Rolf Ingold

22

Distributions of component sizes

Components can be classified into symbols letters hairlines punctuation

according to their size

Page 23: Document Analysis: Segmentation & Layout Analysis

© Prof. Rolf Ingold

23

Region grouping

Page 24: Document Analysis: Segmentation & Layout Analysis

© Prof. Rolf Ingold

24

Docstrum

The docstrum method [O'Gorman] is using a graph that connects each connected component to its k closest neighbors

Page 25: Document Analysis: Segmentation & Layout Analysis

© Prof. Rolf Ingold

Model driven layout analysis [Azokly95]

Page 26: Document Analysis: Segmentation & Layout Analysis

© Prof. Rolf Ingold

Generic macrostructures

In a model-driven approach, generic macrostructures are used a formal language describes margins and separators

Page 27: Document Analysis: Segmentation & Layout Analysis

© Prof. Rolf Ingold

Formal description of macrostructures

VOLUME Article ISWIDTH = 160; HEIGHT = 240;PAGE Garde IS ... END;PAGE Paire IS

HSEP hs1 = (4, 3, LEFT, RIGHT, BLANK);LAYER Principal IS

VSEP vs1 = (40, 65, TOP, hs1, BLANK);VSEP vs2 = ([50,60], 4, hs1, BOTTOM, BLANK);REGION Centre = (vs2, RIGHT, hs1, BOTTOM, ANY, NORMAL);REGION Marge = (LEFT, vs2, hs1, BOTTOM, TEXT, SMALL);...

END;LAYER Secondaire IS

HSEP hs2 = ([10,220], 2, LEFT, RIGHT, BLANK) SUBST hs1;HSEP hs3 = ([20,240], 2, LEFT, RIGHT, BLANK) SUBST BOTTOM;REGION Figure = (LEFT, RIGHT, hs2, hs3, {TABLE,

GRAPHICS});END;

END;PAGE Impaire IS ... END;

END;

Page 28: Document Analysis: Segmentation & Layout Analysis

© Prof. Rolf Ingold

Evaluation of segmentation results

Segmentation is rarely perfect; it generates undersegmentation : real components are merged oversegmentation : a single component is split

Special metrics have been developed to evaluate a segmentation result

In ICDAR'03 and ICDAR'05 scientific contests were organized

Page 29: Document Analysis: Segmentation & Layout Analysis

© Prof. Rolf Ingold

Conclusion

Segmentation is a crucial step in document analysis

Segmentation is almost solved for printed documents with regular layout form analysis

Results are rarely perfect Contextual knowledge may improve the results Advanced pattern recognition method are required

Segmentation remains an open problem for uncontrolled handwriting and graphical documents

Page 30: Document Analysis: Segmentation & Layout Analysis

© Prof. Rolf Ingold

Component hierarchy