1st Seminar Version 3

1

DEVELOPING AN ALGORITHM FOR ARABIC DOCUMENT IMAGE UNDERSTANDING

M. Sc. Proposal Seminar

Presented by:Ibrahim Amer

TA at Computer Science Dept.

Supervised by: Prof. Dr. Mostafa Gadal-Haqq Dr. Salma Hamdy

2

Agenda IntroductionObjectivesProblem StatementRelated Work Proposed WorkTime Plan References

3

Introduction

Arabic OCR research area started 3 decades ago. Yet, there is no reliable, fully automated Arabic

Document Analysis system available commercially. There are many open issue need to be resolved;

Layout analysis and font size and style are some of these issues.

4

Introduction

A hierarchy of document processing

5

Introduction

Layout Analysis

6

Introduction

OCR

7

Objective Develop an Algorithm for Document Image

Analysis for Arabic documents. Algorithm Functionality:

Layout Analysis:Text lines extraction and separation from images, halftones or graphs.

Font (multi-font) style and Size determination

8

Problem Statement

Given an image of an Arabic document with different font styles and sizes. Also, the document contains images and symbol. The Algorithm should produce a computer editable version of the document

Atabic DIA

System

9

MotivationThere are a lot of work developed for this research topic for other languages.

There are very few limited research works for Arabic documents.

This is due to the cursive nature of Arabic script.

10

Document layout analysis

There are three classes for document layout analysis: top-down, bottom-up and hybrid methods.

Salma Hamdy

figures

Salma Hamdy

title of each one.

11

Top-Down Approach The recursive X-Y cut is a top-down page segmentation

technique that decomposes a document image recursively into a set of rectangular blocks.

The X-Y cut segmentation algorithm, is a tree-based top-down algorithm.

The root of the tree represents the entire document page.

Top-Dow

n

Salma Hamdy

add figures

Salma Hamdy

tree

13

Bottom-Up Approach Bottom-up approach is the opposite of top-down

approach.

It starts with characters then form lines, blocks, columns and finally the whole page

Bottom-Up

14

Used widely in many document structure analysis algorithms.

It is used to identify non-white connected regions.

It provides rectangular regions for non-white regions (text or pictures)

These rectangular regions can be categorized as text or non-text using many approaches.

Preprocessing: Smearing

15

Smearing The run-length smearing algorithm (RLSA) works on

binary images where white pixels are represented by 0’s and black pixels by 1’s

The algorithm transforms a binary sequence x into y according to the following rules: . 0’s in x are changed to 1’s in y if the number of adjacent

0’s is less than or equal to a predefined threshold C. 1’s in x are unchanged in y.

16

Smearing

17 Related Work

18

1. Separation of Text and Non-text in Document Layout Analysis using a Recursive Filter

Step 1: Binarize the document image using Sauvola algorithm.

Step 2: Connected component analysis and whitespace extraction.

Step 3: Perform the heuristic filter. Step 4: Perform the recursive filter Step 5: Reshape regions and post processing .

19

Results

Example of proposed method with Korean document: (a,d) Input binary document, (b,e) Text document, (c,f) Non-text document.

22

3. Layout Analysis for Arabic Historical Document Images Using Machine Learning

This paper introduce an approach that segments text appearing in page margins (a.k.a side-notes text) from manuscripts with complex layout format.

Arabic historical document image with complex layout formatting due to side-notes text.

Salma Hamdy

numbering

23

Layout Analysis for Arabic Historical Document Images Using Machine Learning

This method is trying to segment side notes textual regions . Machine learning approach were used in this paper depending

on simple features extracted from the document using connected component analysis.

The learning and classification were relying on AutoMLP framework. The framework adapts itself

It trains a small number of networks in parallel with different learning rates and different numbers of hidden layers.

The whole process is repeated a few number of times, and finally the best network is selected as an optimally trained MLP classifier.

Salma Hamdy

diagram

24

System Architecture

Preprocessing

Binarization, noise removal

Connected Componen

t

Historical Document

Feature Extraction

Normalized heightForeground areaRelative distance

Orientation

AutoMLP Classifier

25

Results

(a) and (b) depict the segmentation of two samples before post-processing. (c) and (d) represent the final segmentation, respectively.

Salma Hamdy

graph

26

4. Layout Analysis of Arabic Script Documents

This paper presents a layout analysis system for extracting text lines in reading order from scanned Arabic script document images written in different languages (Arabic, Urdu, Persian, etc.) and different styles (Naskh, Nastaliq, etc.).

The presented system is based on a suitable combination of different well established techniques for analyzing Latin script documents that have proven to be robust against different types of document image degradations.

27

3. Layout Analysis of Arabic Script Documents

Steps can be summarized as follows:

28

Text and Non-text Segmentation In this paper they use Bloomberg method (the

multiresolution morphology-based method).

29

Text Line Detection We use a ridge-based text line finding algorithm for

warped, Latin script camera-captured document images.

The method consists of two standard image processing techniques: (i) oriented anisotropic Gaussian filter bank smoothing (ii) ridge detection.

30

Text Line Reading Order Determination A reading order determination method tries to find the order of text

lines with respect to their corresponding reading flow. The reading order can be determined by applying some ordering

criteria over the positioning of the text lines.

The ordering criteria are stated as follows: “Text-Line ‘a’ comes before text-line ‘b’ if their ranges of x-coordinates

overlap and if text-line ‘a’ is above text-line ‘b’ on the page”. “Text-Line ‘a’ comes before text-line ‘b’ if a is entirely to the right of ‘b’

and if there does not exist a text-line ‘c’ whose y-coordinates are between ‘a’ an ‘b’ and whose range of x-coordinates overlaps both ‘a’ and ‘b’ ”.

31

Results

(a) And (d) document after removing non-text blocks, (b) and (e) document after text lines detection, (c) and (f) reading order determination

Salma Hamdy

titles

32

Proposed WorkPreprocessingBinarization,

noise removal, smearing,

etc..

Layout Analysis

Document

Font Style Determinat

ionUsing deep

learning

Reading Order

Determination

Text Line

Detection

Using deep

learning technique (CNN)

33

Deep Learning Deep Learning is about learning multiple levels of representation and

abstraction that help to make sense of data such as images, sound, and text. For more about deep learning algorithms.

Deep and wide, “Deep” means many hidden layers and “Wide” means many inputs/hidden nodes.

Deep learning has approved its success in many fields, computer vision, speech recognition/translation and even game playing.

A lot of big computer companies has focused on developing AI solutions for many real life problems using deep learning.

Such as facebook deepface method or image captioning system released two days ago.

Google: google brain, AlphaGo and wild cats robots. Nvidia: autonomous self driving cars

34

Conv Nets Is a type of feed-forward artificial neural network in

which the connectivity pattern between its neurons is inspired by the organization of the animal visual cortex.

https://en.wikipedia.org/wiki/Feedforward_neural_network

https://en.wikipedia.org/wiki/Artificial_neural_network

https://en.wikipedia.org/wiki/Artificial_neuron

https://en.wikipedia.org/wiki/Visual_cortex

https://en.wikipedia.org/wiki/Visual_cortex

35

Conv Nets

Google brain

36

Proposed Work We will aim to develop an algorithm that is capable to identify

textual regions of Arabic documents as well as font styles for characters’ shapes .

The first step will be a preprocessing step which require noise removal and skew correction for scanned document.

The next step will be identifying and labeling of textual regions and separate them from non-textual ones we can use CNN.

After identifying text regions successfully we will identify font style of Arabic characters.

Advanced step: determining the reading order of the document.

37

Proposed Plan The timeline is divided into 4 phases that reflect the work progress within

the 2 years of the research period. Each phase is considered as a milestone within which a report should be submitted with information about the work progress.

Phase I: Survey of the state of the art for document image analysis. Phase II: Survey of related work for document image analysis. Phase III: Developing and implementing the proposed algorithm for

document image analysis. Phase IV: Experimental studies and performance evaluation of the

algorithm. Phase IIV: Evaluation of the proposed model and comparative study

with other implementations.

38

References [1]. Document Structure Analysis Algorithms: A Literature Survey, Song Mao, Azriel

Rosenfeld, and Tapas Kanungo [2]. Separation of Text and Non-text in Document Layout Analysis using a Recursive

Filter, Tuan-Anh Tran, In-Seop Na, Soo-Hyung Kim [3]. Layout Analysis for Arabic Historical Document Images Using Machine Learning,

Syed Saqib Bukhari, Thomas M. Breuel Technical University of Kaiserslautern, Germany. [4]. Performance Comparison of Six Algorithms for Page Segmentation, [5]. A Fast Algorithm for Bottom-Up Document Layout Analysis IEEE, Anikó Simon, Jean-

Christophe Pret, and A. Peter Johnson [6]. Separation of Text and Non-text in Document Layout Analysis using a Recursive

Filter, Tuan-Anh Tran, In-Seop Na, Soo-Hyung Kim [7]. Hybrid Page Segmentation with Efficient Whitespace Rectangles Extraction and

Grouping, Kai Chen, Fei Yin, Cheng-Lin Liu [8]. Layout Analysis of Arabic Script Documents, Syed Saqib Bukhari, Faisal Shafait, and

Thomas M. Breuel