Survey of Text Line Segmentation Methods of Historical Documents Article written by Laurence Likeformann-Sulem, Abderrazak Zahour, Bruno Taconet(2006)

Survey of Text Line Segmentation Methods of Historical Documents

• Article written by Laurence Likeformann-Sulem , Abderrazak Zahour, Bruno Taconet(2006)

• Presenting: Erez Lefel and Koby Israel

1 .Introduction

Text line extraction is generally seen as preprocessing step for tasks such as Document structure extraction Printed character or handwriting recognition

Text line extraction is most common in ancient and historical documents – printed or handwritten.

Introduction cont.

2 .Characteristics and representation of text lines

Some definitions:Baseline: fictitious line which follow and joins the

lower part of the character body in a text line.Median line: fictitious line which follow and joins

the upper part of the character body in a text line.Upper line: fictitious line which joined the top of

ascenders.Lower line: fictitious line which joins the bottom of

decenders.

Characteristics and representation of text lines cont.

Overlapping components: components which are descenders and ascenders located in the region of an adjacent line

Touching component: components which are ascenders and descenders belonging to consecutive lines which are thus connected.


Text line segmentation: labeling process which consists in assigning the same label to spatially aligned units


Influence of author styleBaseline fluctuation: the baseline may vary

due to writer movement. It may be straight, straight by segments or curved.

Line spacing: lines that are widely spaced are easy to find, problem starts when lines spacing is very small. If exists at all.

Insertions: words or short text lines may appear between the principal text lines or in the mergins.


Influence of poor image qualityImperfect preprocessing: smudges, present

of seeping ink or variable background intensity make image preprocessing difficult and produce binarization errors.

Stroke fragmentation and merging: dots and broken strokes due to low quality images and/or binarization may produce many connected components


Three main axes of document complexity for text line segmentation

3 .Text Line Segmentation

3.1 Preprocessing

In an ideal situation, text line extraction would be preformed on a clean document: without background noise and non-textual elements. the writing would be well contrasted. With as little fragmentation as possible.

In reality, preprocessing is often necessary.The Preprocessing methods has to be tailored

to each document.All of the above has to be removed before

using any text line extraction method.

Preprocessing –cont.

Non textual elements like : book bindings book sides thumb marks from someone holding the book open

Can be removed upon criteria such as position and intensity level.


Other non-textual element such as: Stamps Seals Ornamentation ( קישוט, (עיטור decorated initials

All of these can be removed using knowledge about the shape, the color or the position of these elements


Extracting text from figures can also be performed using texture or morphological filters

Linear graphical elements such as big crosses (called “St Andre’s crosses”) appear in some of Flaubert’s manuscripts.

Removing these elements is performed through GUI by Kalman filtering .

Kalman filterFrom Wikipedia, the free encyclopedia

In statistics, the Kalman filter is a mathematical method named after Rudolf E. Kalman. Its purpose is to use measurements that are observed over time that contain noise (random variations) and other inaccuracies, and produce values that tend to be closer to the true values of the measurements and their associated calculated values. The Kalman filter has many applications in technology, and is an essential part of the development of space and military technology.


Textual but unwanted elements such as bleed through text can be removed by: Filtering Combining the back side image with the front side

image


Binarization using global thresholding Usually does not work with historical documents, That’s because the background is not uniform.

Binarization using local thresholding Determining the threshold value based on the local

properties of the image, e.g. pixel by pixel or region by region


Writing may be faint so that over-segmentation or under-segmentation may occur.

3.2 .Projection–based methods

Projection-profiles are commonly used for printed document segmentation.

The vertical projection-profile is obtained by summing pixel values along the horizontal axis for each y value.

The gaps between the text lines in the vertical direction can be observed.

Profile(y)

Projection–based methods – cont.

The vertical profile is not sensitive to writing fragmentation.

Other ways for obtaining a profile curve. Counting connected component Projecting black/white transition



Profile curve can be smoothed by median filter or gaussian filter to eliminate local maxima.

The profile curve is then analyzed to find its maxima and minima

Cuts are made at significant minima.


Drawbacks: Short lines will provide low picks Narrow lines will not produce significant

peaks In the naïve form, can’t handle skew in the

text


In Shapiro’s work, the global orientation (skew angle) of a handwritten page is first searched by applying a Hough transform on the entire image. Once this skew angle is obtained, projections are achieved along this angle.

3.3 .Smearing methods

For printed and binarized documents, smearing methods can be applied.

Consecutive black pixels along the horizontal direction are smeared:

the white space between them is filled with black pixels if their distance is within a predefined threshold.

The bounding boxes of the connected components in the smeared image enclose text lines.

Smearing methods – cont.

3.4 .Grouping methods

Methods consist in building alignments by aggregating units in a bottom-up strategy.

The units may be pixels or higher level, such as connected components, blocks etc.

Units are joined together to form alignments.The joining scheme relies on both local and

global criteria.

Grouping methods – cont.

Every method has to face the following Initiating alignments: one or several seeds for each

alignment Defining a unit’s neighborhood for reaching the next

unit (it is generally a rectangular or angular area). Solving conflicts: as one unit may belongs to several

alignments under construction a choice has to be made: discard one alignment or keep both alignments.


Defining a unit’s neighborhood for reaching the next unit:


Contrary to printed documents, a simple nearest-neighbor joining scheme would often fail to

group complex handwritten units, as the nearest neighbor often belongs to another line

?

Grouping methods cont.

When having a conflict, choice has to be made! Decision can be made by alignment quality

measures given. Decision can be made by comparing the quality

measure of the competing units in the neighborhood in the next iteration.

Quality measures generally include the strength of the alignment (number of units included)

Other quality elements may concern component size, component spacing etc.


Example of text lines extracted on church registers


Likforman-sulem and Faure have developed an iterative method based on perceptual grouping for forming alignments, which has been applied to handwritten pages.

Anchors are detected by selecting connected components elongated in specific directions (0° , 45° , 90° , 125° )

Each of these anchors becomes the seed of an alignment. First, each anchor, then each alignment, is extended to the left and to the right according to given rules.

A penalty is given when the alignment includes anchors of different directions.


3.5 .Methods based on the Hough transform

The Hough transform is a very popular technique for finding straight lines in images

This method can extract oriented text lines and sloped annotations under the assumption that such lines are almost straight

Methods based on the Hough transform – cont.

The centroids of the connected components are the units for the Hough transform.

A set of aligned units in the image along a line with parameters (ρ, θ) is included in the corresponding cell (ρ, θ) of the Hough domain

Methods based on the Hough transform – cont.

3.6 .Repulsive-Attractive network method

This method is based on Repulsive-Attractive forces.

Method works directly on grey-level images and consists in iteratively adapting the y-position of a predefined number of baseline units.

This method has been applied to ancient Ottoman document archives and latin texts.

Repulsive-Attractive network method how it works?

Baselines are constructed one by one from the top of the image to bottom.

Pixels of the image act as attractive forces for baselines.

Already extracted baselines act as repulsive forces.The baseline to be extracted is initialized just under

the previously examined one, in order to be repelled by it and attracted by the pixels of the line below.

The lines must have similar length.The result is a set of baselines, each one passing

through word bodies.

Repulsive-Attractive network method cont.

Pseudo baselines extracted by a Repulsive-Attractive network on Ancient Ottoman text

3.7 .Processing of overlapping and touching components

Overlapping and touching components are the main challenged for text line extraction since no white space is left between lines.

Processing of overlapping and touching components cont.

Detection of ambiguous components can be done in several ways Components size. Component belongs to several alignments. Component belongs to no alignment.


Once component is detected as ambiguous it must be classified to one of the two categories above Component is an overlapping component(belongs to

upper/lower alignment) Component is touching component

In grouping methods (seen in 3.4) its common to use the component ambiguity attribute in order to calculate whether to add the component to the group or not.



In Likforman-Sulem method, touching and overlapping components are detected after the text line extraction process described in 3.5 (Methods base on Hough transform).

These components are those which are intersected by at least two different lines (ρ, θ) corresponding to primary cells of validated alignments.


Zahour’s method for detecting touching and overlapping components: Cut the text into 8 columns. A projection-profile is performed on each column. In each histogram, two consecutive minima delimit a text

block Classify text blocks into 3 categories – small, average, big

(using k-means algorithm) Overlapping components necessarily belong to big physical

blocks. Using average text block from average and small groups in

order to decide to how many pieces the big text blocks should be cut into.

3.8 Non Latin documents

The inter-line space in Latin documents is filled with single dots, ascenders and descenders.

The Arabic script is connected and cursive.ancient Arabic documents include diacritical

points (ث ك) Ancient Hebrew documents can include

decorated words

Non Latin documents- cont.

In the alphabets of some Indian scripts many basic characters have an horizontal line (the head line) in the upper part

3.8.1 Ancient Arabic documents

The writing in these documents is very dense, and the line spacing is quite small.

The method developed in Zahour et al. begins with the detection of overlapping and touching components presented in 3.7

3.8.1 Ancient Hebrew documents

The manuscripts studied in Likforman-Sulem et al. are written in Hebrew, using “Dfus” letters, as most characters are made of horizontal and vertical strokes.

The Scrolls, intended to be used in the synagogue, do not include diacritics, so there is no separation between words or sentences.

3.8.2 Ancient Hebrew documents –cont.

Cases of overlapping components occur as characters such as Lamed (ל), Kaf (כ), and final letters ( , , ץ, ן ך .(ף

Since the majority of characters are composed of one connected component, it is more convenient to perform text line segmentation from connected components units. (3.5 Hough transform)

Summary

Analysis of historical document images is a relatively new domain.

These methods have been developed within several projects which perform transcript mapping, authentication, word mapping or word recognition.

As the need for recognition and mapping of handwritten material increases, text line segmentation will be used more and more.

Summary – cont.

Contrary to printed modern documents, a historical document has unique characteristics due to style, artistic effect and writer skillsThere is no universal

segmentation method which can fit all these

documents

Questions?