Text extraction using document structure features and support vector machines

1. using Document Structure Features and Support Vector Machines Konstantinos Zagoris, Nikos Papamarkos Image Processing and Multimedia Laboratory Department of Electrical & Computer Engineering Democritus University of Thrace 67100 Xanthi, Greece Email: [email protected] http://ipml.ee.duth.gr/~papamark/

2. Nowadays, there is abundance of document images such as technical articles, business letters, faxes and newspapers without any indexing information In order to successfully exploit them from systems such as OCR a text localization technique must be employed 3. Bottom Up Techniques Top Down Techniques 4. The proposed technique is a continuation of the work "PLA using RLSA and a neural network of C. Strouthopoulos, N. Papamarkos and C. Chamzas In proposed work the feature set is adaptive The feature reduction technique is simpler The classifier is the SVMs 5. Apply Preprocessing Techniques (binarization, noire reduction) Locate, Merge and Extract Blocks Extract the Features from the Blocks Find the Blocks Which Contain Text using Support Vector Machines Locate or Extract the Text Blocks and Present them to User 6. The Original Document After the Pre-Processing Step The Connected Components The Expanded Connected Components The Final Blocks 7. The Features are a set of suitable Document Structure Elements (DSEs) which the blocks contain DSE is any 3x3 binary block There are total 29 = 512 DSEs b0 b8 b7 b6 b5 b4 b3 b2 b1 The Pixel Order of the DSEs 8 0 2i j ji i L b The DSE of L142 8. The initial descriptor of the block is the histogram of the DSEs that the block contains The length of the initial descriptor is 510 The L0 and L511 DSEs are removed because they correspond to pure background and pure document objects, respectively A feature reduction algorithm is applied which reduces the number of features. The selected features are the DSEs which they most reliable separate the text blocks from the others. We call this feature reduction algorithm Feature Standard Deviation Analysis of Structure Elements (FSDASE) 9. Find the Standard Deviation for the Text Blocks SDXT(Ln) for each Ln DSE Find the Standard Deviation for the non Text Blocks SDXP(Ln) for each Ln DSE Normalize them Then define the O(Ln) vector as O(Ln)=|SDXT (Ln) SDXP(Ln)| Finally, take those 32 DSEs that correspond to the first 32 maximum values of O(Ln). 10. The goal of the FSDASE is to find those DSEs that have maximum SD at the text blocks and minimum SD at the non text blocks and the opposite A training dataset is required Does not cause a problem because such dataset already is required for the training of the SVMs Therefore the final block descriptor is a vector with 32 elements and it corresponds to the frequency of the 32 DSEs that the block contains 11. The descriptor has the ability to adapt to the demands of each set of documents images A noisy document has different set of DSEs than a clear document If there is available more computational power, the descriptor can increase its size easily above 32 This descriptor is used to train the Support Vector Machines 12. Based on statistical learning theory They need training data They separate the space that the training data is reside to two classes. The training data must be linear separable. 13. If the training data are not linear separable (as in our case) then they mapped from the input space to a feature space using the kernel method Our experiments showed the Radial Basis Function (exp{-|x-x`|) as the most robust kernel The parameters of SVMs are detected by a cross- validation procedure using a grid search The output of SVM classifies each block as text or not 14. The Document Image Database from the University of Oulu is employed In our experiments we used the set of the 48 article documents Those image documents contained a mixture of text and pictures From this database five images are selected and the extracted blocks used to determine the proper DSEs and to be employ as training samples for the SVMs The overall results are: Document Images Blocks Success Rate 48 25958 98.453% 15. ORIGINAL IMAGE THE OUTPUT OF THE PROPOSED METHOD 16. ORIGINAL IMAGE THE OUTPUT OF THE PROPOSED METHOD 17. ORIGINAL IMAGE THE OUTPUT OF THE PROPOSED METHOD 18. ORIGINAL IMAGE THE OUTPUT OF THE PROPOSED METHOD 19. ORIGINAL IMAGE THE OUTPUT OF THE PROPOSED METHOD 20. A bottom-up text localization technique is proposed that detects and extracts homogeneous text from document images A Connected Component analysis technique is applied which detects the objects of the document A flexible descriptor is extracted based on structural elements The descriptor has the ability to adapt to the demands of each set of documents images For example a noisy document has different set of DSEs than a clear document If there is available more computational power, the descriptor can increase its size easily above 32 A trained SVM classify the objects as text and non-text The experimental results are much promised 21. ! THANK YOU!

Technology

Text extraction using document structure features and support vector machines