Upload
ibrahim-amer
View
52
Download
1
Embed Size (px)
Citation preview
1
DEVELOPING AN ALGORITHM FOR ARABIC DOCUMENT IMAGE UNDERSTANDING
M. Sc. Proposal Seminar
Presented by:Ibrahim Amer
TA at Computer Science Dept.
Supervised by: Prof. Dr. Mostafa Gadal-Haqq Dr. Salma Hamdy
2
Agenda IntroductionObjectivesProblem StatementRelated Work Proposed WorkTime Plan References
3
Introduction
Arabic OCR research area started 3 decades ago. Yet, there is no reliable, fully automated Arabic
Document Analysis system available commercially. There are many open issue need to be resolved;
Layout analysis and font size and style are some of these issues.
4
Introduction
A hierarchy of document processing
5
Introduction
Layout Analysis
6
Introduction
OCR
7
Objective Develop an Algorithm for Document Image
Analysis for Arabic documents. Algorithm Functionality:
Layout Analysis:Text lines extraction and separation from images, halftones or graphs.
Font (multi-font) style and Size determination
8
Problem Statement
Given an image of an Arabic document with different font styles and sizes. Also, the document contains images and symbol. The Algorithm should produce a computer editable version of the document
Atabic DIA
System
9
MotivationThere are a lot of work developed for this research topic for other languages.
There are very few limited research works for Arabic documents.
This is due to the cursive nature of Arabic script.
10
Document layout analysis
There are three classes for document layout analysis: top-down, bottom-up and hybrid methods.
11
Top-Down Approach The recursive X-Y cut is a top-down page segmentation
technique that decomposes a document image recursively into a set of rectangular blocks.
The X-Y cut segmentation algorithm, is a tree-based top-down algorithm.
The root of the tree represents the entire document page.
Top-Dow
n
13
Bottom-Up Approach Bottom-up approach is the opposite of top-down
approach.
It starts with characters then form lines, blocks, columns and finally the whole page
Bottom-Up
14
Used widely in many document structure analysis algorithms.
It is used to identify non-white connected regions.
It provides rectangular regions for non-white regions (text or pictures)
These rectangular regions can be categorized as text or non-text using many approaches.
Preprocessing: Smearing
15
Smearing The run-length smearing algorithm (RLSA) works on
binary images where white pixels are represented by 0’s and black pixels by 1’s
The algorithm transforms a binary sequence x into y according to the following rules: . 0’s in x are changed to 1’s in y if the number of adjacent
0’s is less than or equal to a predefined threshold C. 1’s in x are unchanged in y.
16
Smearing
17 Related Work
18
1. Separation of Text and Non-text in Document Layout Analysis using a Recursive Filter
Step 1: Binarize the document image using Sauvola algorithm.
Step 2: Connected component analysis and whitespace extraction.
Step 3: Perform the heuristic filter. Step 4: Perform the recursive filter Step 5: Reshape regions and post processing .
19
Results
Example of proposed method with Korean document: (a,d) Input binary document, (b,e) Text document, (c,f) Non-text document.
22
3. Layout Analysis for Arabic Historical Document Images Using Machine Learning
This paper introduce an approach that segments text appearing in page margins (a.k.a side-notes text) from manuscripts with complex layout format.
Arabic historical document image with complex layout formatting due to side-notes text.
23
Layout Analysis for Arabic Historical Document Images Using Machine Learning
This method is trying to segment side notes textual regions . Machine learning approach were used in this paper depending
on simple features extracted from the document using connected component analysis.
The learning and classification were relying on AutoMLP framework. The framework adapts itself
It trains a small number of networks in parallel with different learning rates and different numbers of hidden layers.
The whole process is repeated a few number of times, and finally the best network is selected as an optimally trained MLP classifier.
24
System Architecture
Preprocessing
Binarization, noise removal
Connected Componen
t
Historical Document
Feature Extraction
Normalized heightForeground areaRelative distance
Orientation
AutoMLP Classifier
25
Results
(a) and (b) depict the segmentation of two samples before post-processing. (c) and (d) represent the final segmentation, respectively.
26
4. Layout Analysis of Arabic Script Documents
This paper presents a layout analysis system for extracting text lines in reading order from scanned Arabic script document images written in different languages (Arabic, Urdu, Persian, etc.) and different styles (Naskh, Nastaliq, etc.).
The presented system is based on a suitable combination of different well established techniques for analyzing Latin script documents that have proven to be robust against different types of document image degradations.
27
3. Layout Analysis of Arabic Script Documents
Steps can be summarized as follows:
28
Text and Non-text Segmentation In this paper they use Bloomberg method (the
multiresolution morphology-based method).
29
Text Line Detection We use a ridge-based text line finding algorithm for
warped, Latin script camera-captured document images.
The method consists of two standard image processing techniques: (i) oriented anisotropic Gaussian filter bank smoothing (ii) ridge detection.
30
Text Line Reading Order Determination A reading order determination method tries to find the order of text
lines with respect to their corresponding reading flow. The reading order can be determined by applying some ordering
criteria over the positioning of the text lines.
The ordering criteria are stated as follows: “Text-Line ‘a’ comes before text-line ‘b’ if their ranges of x-coordinates
overlap and if text-line ‘a’ is above text-line ‘b’ on the page”. “Text-Line ‘a’ comes before text-line ‘b’ if a is entirely to the right of ‘b’
and if there does not exist a text-line ‘c’ whose y-coordinates are between ‘a’ an ‘b’ and whose range of x-coordinates overlaps both ‘a’ and ‘b’ ”.
31
Results
(a) And (d) document after removing non-text blocks, (b) and (e) document after text lines detection, (c) and (f) reading order determination
32
Proposed WorkPreprocessingBinarization,
noise removal, smearing,
etc..
Layout Analysis
Document
Font Style Determinat
ionUsing deep
learning
Reading Order
Determination
Text Line
Detection
Using deep
learning technique (CNN)
33
Deep Learning Deep Learning is about learning multiple levels of representation and
abstraction that help to make sense of data such as images, sound, and text. For more about deep learning algorithms.
Deep and wide, “Deep” means many hidden layers and “Wide” means many inputs/hidden nodes.
Deep learning has approved its success in many fields, computer vision, speech recognition/translation and even game playing.
A lot of big computer companies has focused on developing AI solutions for many real life problems using deep learning.
Such as facebook deepface method or image captioning system released two days ago.
Google: google brain, AlphaGo and wild cats robots. Nvidia: autonomous self driving cars
34
Conv Nets Is a type of feed-forward artificial neural network in
which the connectivity pattern between its neurons is inspired by the organization of the animal visual cortex.
35
Conv Nets
Google brain
36
Proposed Work We will aim to develop an algorithm that is capable to identify
textual regions of Arabic documents as well as font styles for characters’ shapes .
The first step will be a preprocessing step which require noise removal and skew correction for scanned document.
The next step will be identifying and labeling of textual regions and separate them from non-textual ones we can use CNN.
After identifying text regions successfully we will identify font style of Arabic characters.
Advanced step: determining the reading order of the document.
37
Proposed Plan The timeline is divided into 4 phases that reflect the work progress within
the 2 years of the research period. Each phase is considered as a milestone within which a report should be submitted with information about the work progress.
Phase I: Survey of the state of the art for document image analysis. Phase II: Survey of related work for document image analysis. Phase III: Developing and implementing the proposed algorithm for
document image analysis. Phase IV: Experimental studies and performance evaluation of the
algorithm. Phase IIV: Evaluation of the proposed model and comparative study
with other implementations.
38
References [1]. Document Structure Analysis Algorithms: A Literature Survey, Song Mao, Azriel
Rosenfeld, and Tapas Kanungo [2]. Separation of Text and Non-text in Document Layout Analysis using a Recursive
Filter, Tuan-Anh Tran, In-Seop Na, Soo-Hyung Kim [3]. Layout Analysis for Arabic Historical Document Images Using Machine Learning,
Syed Saqib Bukhari, Thomas M. Breuel Technical University of Kaiserslautern, Germany. [4]. Performance Comparison of Six Algorithms for Page Segmentation, [5]. A Fast Algorithm for Bottom-Up Document Layout Analysis IEEE, Anikó Simon, Jean-
Christophe Pret, and A. Peter Johnson [6]. Separation of Text and Non-text in Document Layout Analysis using a Recursive
Filter, Tuan-Anh Tran, In-Seop Na, Soo-Hyung Kim [7]. Hybrid Page Segmentation with Efficient Whitespace Rectangles Extraction and
Grouping, Kai Chen, Fei Yin, Cheng-Lin Liu [8]. Layout Analysis of Arabic Script Documents, Syed Saqib Bukhari, Faisal Shafait, and
Thomas M. Breuel