35
DEVELOPING AN ALGORITHM FOR ARABIC DOCUMENT IMAGE UNDERSTANDING M. Sc. Proposal Seminar Presented by: Ibrahim Amer TA at Computer Science Dept. 1 Supervised by: Prof. Dr. Mostafa Gadal-Haqq Dr. Salma Hamdy

1st Seminar Version 3

Embed Size (px)

Citation preview

Page 1: 1st Seminar Version 3

1

DEVELOPING AN ALGORITHM FOR ARABIC DOCUMENT IMAGE UNDERSTANDING

M. Sc. Proposal Seminar

Presented by:Ibrahim Amer

TA at Computer Science Dept.

Supervised by: Prof. Dr. Mostafa Gadal-Haqq Dr. Salma Hamdy

Page 2: 1st Seminar Version 3

2

Agenda IntroductionObjectivesProblem StatementRelated Work Proposed WorkTime Plan References

Page 3: 1st Seminar Version 3

3

Introduction

Arabic OCR research area started 3 decades ago. Yet, there is no reliable, fully automated Arabic

Document Analysis system available commercially. There are many open issue need to be resolved;

Layout analysis and font size and style are some of these issues.

Page 4: 1st Seminar Version 3

4

Introduction

A hierarchy of document processing

Page 5: 1st Seminar Version 3

5

Introduction

Layout Analysis

Page 6: 1st Seminar Version 3

6

Introduction

OCR

Page 7: 1st Seminar Version 3

7

Objective Develop an Algorithm for Document Image

Analysis for Arabic documents. Algorithm Functionality:

Layout Analysis:Text lines extraction and separation from images, halftones or graphs.

Font (multi-font) style and Size determination

Page 8: 1st Seminar Version 3

8

Problem Statement

Given an image of an Arabic document with different font styles and sizes. Also, the document contains images and symbol. The Algorithm should produce a computer editable version of the document

Atabic DIA

System

Page 9: 1st Seminar Version 3

9

MotivationThere are a lot of work developed for this research topic for other languages.

There are very few limited research works for Arabic documents.

This is due to the cursive nature of Arabic script.

Page 10: 1st Seminar Version 3

10

Document layout analysis

There are three classes for document layout analysis: top-down, bottom-up and hybrid methods.

Salma Hamdy
figures
Salma Hamdy
title of each one.
Page 11: 1st Seminar Version 3

11

Top-Down Approach The recursive X-Y cut is a top-down page segmentation

technique that decomposes a document image recursively into a set of rectangular blocks.

The X-Y cut segmentation algorithm, is a tree-based top-down algorithm.

The root of the tree represents the entire document page.

Top-Dow

n

Salma Hamdy
add figures
Salma Hamdy
tree
Page 12: 1st Seminar Version 3

13

Bottom-Up Approach Bottom-up approach is the opposite of top-down

approach.

It starts with characters then form lines, blocks, columns and finally the whole page

Bottom-Up

Page 13: 1st Seminar Version 3

14

Used widely in many document structure analysis algorithms.

It is used to identify non-white connected regions.

It provides rectangular regions for non-white regions (text or pictures)

These rectangular regions can be categorized as text or non-text using many approaches.

Preprocessing: Smearing

Page 14: 1st Seminar Version 3

15

Smearing The run-length smearing algorithm (RLSA) works on

binary images where white pixels are represented by 0’s and black pixels by 1’s

The algorithm transforms a binary sequence x into y according to the following rules: . 0’s in x are changed to 1’s in y if the number of adjacent

0’s is less than or equal to a predefined threshold C. 1’s in x are unchanged in y.

Page 15: 1st Seminar Version 3

16

Smearing

Page 16: 1st Seminar Version 3

17 Related Work

Page 17: 1st Seminar Version 3

18

1. Separation of Text and Non-text in Document Layout Analysis using a Recursive Filter

Step 1: Binarize the document image using Sauvola algorithm.

Step 2: Connected component analysis and whitespace extraction.

Step 3: Perform the heuristic filter. Step 4: Perform the recursive filter Step 5: Reshape regions and post processing .

Page 18: 1st Seminar Version 3

19

Results

Example of proposed method with Korean document: (a,d) Input binary document, (b,e) Text document, (c,f) Non-text document.

Page 19: 1st Seminar Version 3

22

3. Layout Analysis for Arabic Historical Document Images Using Machine Learning

This paper introduce an approach that segments text appearing in page margins (a.k.a side-notes text) from manuscripts with complex layout format.

Arabic historical document image with complex layout formatting due to side-notes text.

Salma Hamdy
numbering
Page 20: 1st Seminar Version 3

23

Layout Analysis for Arabic Historical Document Images Using Machine Learning

This method is trying to segment side notes textual regions . Machine learning approach were used in this paper depending

on simple features extracted from the document using connected component analysis.

The learning and classification were relying on AutoMLP framework. The framework adapts itself

It trains a small number of networks in parallel with different learning rates and different numbers of hidden layers.

The whole process is repeated a few number of times, and finally the best network is selected as an optimally trained MLP classifier.

Salma Hamdy
diagram
Page 21: 1st Seminar Version 3

24

System Architecture

Preprocessing

Binarization, noise removal

Connected Componen

t

Historical Document

Feature Extraction

Normalized heightForeground areaRelative distance

Orientation

AutoMLP Classifier

Page 22: 1st Seminar Version 3

25

Results

(a) and (b) depict the segmentation of two samples before post-processing. (c) and (d) represent the final segmentation, respectively.

Salma Hamdy
graph
Page 23: 1st Seminar Version 3

26

4. Layout Analysis of Arabic Script Documents

This paper presents a layout analysis system for extracting text lines in reading order from scanned Arabic script document images written in different languages (Arabic, Urdu, Persian, etc.) and different styles (Naskh, Nastaliq, etc.).

The presented system is based on a suitable combination of different well established techniques for analyzing Latin script documents that have proven to be robust against different types of document image degradations.

Page 24: 1st Seminar Version 3

27

3. Layout Analysis of Arabic Script Documents

Steps can be summarized as follows:

Page 25: 1st Seminar Version 3

28

Text and Non-text Segmentation In this paper they use Bloomberg method (the

multiresolution morphology-based method).

Page 26: 1st Seminar Version 3

29

Text Line Detection We use a ridge-based text line finding algorithm for

warped, Latin script camera-captured document images.

The method consists of two standard image processing techniques: (i) oriented anisotropic Gaussian filter bank smoothing (ii) ridge detection.

Page 27: 1st Seminar Version 3

30

Text Line Reading Order Determination A reading order determination method tries to find the order of text

lines with respect to their corresponding reading flow. The reading order can be determined by applying some ordering

criteria over the positioning of the text lines.

The ordering criteria are stated as follows: “Text-Line ‘a’ comes before text-line ‘b’ if their ranges of x-coordinates

overlap and if text-line ‘a’ is above text-line ‘b’ on the page”. “Text-Line ‘a’ comes before text-line ‘b’ if a is entirely to the right of ‘b’

and if there does not exist a text-line ‘c’ whose y-coordinates are between ‘a’ an ‘b’ and whose range of x-coordinates overlaps both ‘a’ and ‘b’ ”.

Page 28: 1st Seminar Version 3

31

Results

(a) And (d) document after removing non-text blocks, (b) and (e) document after text lines detection, (c) and (f) reading order determination

Salma Hamdy
titles
Page 29: 1st Seminar Version 3

32

Proposed WorkPreprocessingBinarization,

noise removal, smearing,

etc..

Layout Analysis

Document

Font Style Determinat

ionUsing deep

learning

Reading Order

Determination

Text Line

Detection

Using deep

learning technique (CNN)

Page 30: 1st Seminar Version 3

33

Deep Learning Deep Learning is about learning multiple levels of representation and

abstraction that help to make sense of data such as images, sound, and text. For more about deep learning algorithms.

Deep and wide, “Deep” means many hidden layers and “Wide” means many inputs/hidden nodes.

Deep learning has approved its success in many fields, computer vision, speech recognition/translation and even game playing.

A lot of big computer companies has focused on developing AI solutions for many real life problems using deep learning.

Such as facebook deepface method or image captioning system released two days ago.

Google: google brain, AlphaGo and wild cats robots. Nvidia: autonomous self driving cars

Page 31: 1st Seminar Version 3

34

Conv Nets  Is a type of feed-forward artificial neural network in

which the connectivity pattern between its neurons is inspired by the organization of the animal visual cortex.

Page 32: 1st Seminar Version 3

35

Conv Nets

Google brain

Page 33: 1st Seminar Version 3

36

Proposed Work We will aim to develop an algorithm that is capable to identify

textual regions of Arabic documents as well as font styles for characters’ shapes .

The first step will be a preprocessing step which require noise removal and skew correction for scanned document.

The next step will be identifying and labeling of textual regions and separate them from non-textual ones we can use CNN.

After identifying text regions successfully we will identify font style of Arabic characters.

Advanced step: determining the reading order of the document.

Page 34: 1st Seminar Version 3

37

Proposed Plan The timeline is divided into 4 phases that reflect the work progress within

the 2 years of the research period. Each phase is considered as a milestone within which a report should be submitted with information about the work progress.

Phase I: Survey of the state of the art for document image analysis. Phase II: Survey of related work for document image analysis. Phase III: Developing and implementing the proposed algorithm for

document image analysis. Phase IV: Experimental studies and performance evaluation of the

algorithm. Phase IIV: Evaluation of the proposed model and comparative study

with other implementations.

Page 35: 1st Seminar Version 3

38

References [1]. Document Structure Analysis Algorithms: A Literature Survey, Song Mao, Azriel

Rosenfeld, and Tapas Kanungo [2]. Separation of Text and Non-text in Document Layout Analysis using a Recursive

Filter, Tuan-Anh Tran, In-Seop Na, Soo-Hyung Kim [3]. Layout Analysis for Arabic Historical Document Images Using Machine Learning,

Syed Saqib Bukhari, Thomas M. Breuel Technical University of Kaiserslautern, Germany. [4]. Performance Comparison of Six Algorithms for Page Segmentation, [5]. A Fast Algorithm for Bottom-Up Document Layout Analysis IEEE, Anikó Simon, Jean-

Christophe Pret, and A. Peter Johnson [6]. Separation of Text and Non-text in Document Layout Analysis using a Recursive

Filter, Tuan-Anh Tran, In-Seop Na, Soo-Hyung Kim [7]. Hybrid Page Segmentation with Efficient Whitespace Rectangles Extraction and

Grouping, Kai Chen, Fei Yin, Cheng-Lin Liu [8]. Layout Analysis of Arabic Script Documents, Syed Saqib Bukhari, Faisal Shafait, and

Thomas M. Breuel