20
Digitizing and Retrieving Printed Arabic Documents Kareem Darwish Senior Scientist Qatar Computing Research Institute

Dr. Kareem Darwish's presentation at QITCOM 2011

  • View
    525

  • Download
    2

Embed Size (px)

DESCRIPTION

QITCOM 2011May 24 | Day 1 | INNOVATESession 2: Digitizing Arabic Content - Lead the WaySpeaker: Dr. Kareem Darwish, Arabic Language Technology Senior Scientist - Qatar Computing Research Institute, Qatar Foundation Topic: E-Learning: The Future of Arabic Digital ContentFor more information visit www.qitcom.com.qa

Citation preview

Page 1: Dr. Kareem Darwish's presentation at QITCOM 2011

Digitizing and Retrieving Printed Arabic Documents

Kareem Darwish

Senior Scientist

Qatar Computing Research Institute

Page 2: Dr. Kareem Darwish's presentation at QITCOM 2011

Overview

Some Magic

Search results

Scanning

Page 5: Dr. Kareem Darwish's presentation at QITCOM 2011

Scanning

http://www.kirtas.com

Page 6: Dr. Kareem Darwish's presentation at QITCOM 2011

Scanning

http://www.kirtas.com

Page 7: Dr. Kareem Darwish's presentation at QITCOM 2011

Result of Scanning

Courtesy of the Library of Alexandria

http://www.colophon.com

Page 8: Dr. Kareem Darwish's presentation at QITCOM 2011

Magic: Optical Character Recognition

Courtesy of the Library of Alexandria

والنيران المراقبة ناحيتى من . الساحلى السهل على

تسلكها التى االقتراب وطرقناحية من عربية قوات أى

طرق دى ر تنح الشرتى : التالية الثالثة أهمها خمسة

ا- وهو ألول ا الطريق ا - هـ بغداد من " 2233ألقصر

إلى. النحراف ا أو ، المفرتىقبل دمشق 3أ3الرطبة

األردن،إلى العرا من الرابى ا محاور

ألردد وا سوريابغداد- " 2 من االثانى الطريق

- - - ا دمشق بالميرا كمال أبوألردد.

األطول - 3 وهو الثالسث الطريق - - - الزور دير الموصل داد بغ من

- دمشق- حملروألردن " 68ا

OCR output (Sakhr)

Page 9: Dr. Kareem Darwish's presentation at QITCOM 2011

Arabic OCR is Hard

• Letters change shape depending on position in word, with dots distinguishing them from each other

– تـ ، ـتـ ، ـت– قـ ، ـقـ ، ـق ، ق

• Diacritics are optional

– ق ، ق� ، ق� ، ق� ، ق� ، ق�• Some letter combinations have special shapes

(ligatures):

– ل + ا = ال• Letter elongations (Kashida) are often used

– قبـــــــــــــــــــــــــــل قبل• Letters are connected

Isolated End Middle Start

ت ـت ـتـ تـ

ي ـي ـيـ يـ

ق ـق ـقـ قـ

Page 10: Dr. Kareem Darwish's presentation at QITCOM 2011

Arabic OCR is Hard

Diacritics and dots easily confusable. If manuscript is old,they can be confused with speckle on page

Word error rate is typically greater than 20% !

Page 11: Dr. Kareem Darwish's presentation at QITCOM 2011

Arabic OCR is Hard

الخليقة تقاكظ سوق الجنة والنار وبها وتام• فهى واألبا إر رالفجار رالكفارإلى المؤمنين

منشأ الخلق واألمر والثواب والعقاب ،وهى رغها رعن له الخليقة خطقت الذى اهدن

والحسابحقرقها السمؤال

Typical OCR output

Page 12: Dr. Kareem Darwish's presentation at QITCOM 2011

Arabic Morphology Challenges

• Arabic uses complex derivational morphology:– Root (ex. ktb)– Stem – root in a template (ex. mkAtbp)– Word – stem with optional determiner, preposition,

coordinating conjunctions, plural suffix, etc. (ex. w+Al+mkAtbp+At wAlmkAtbAt)

– Estimated number of possible words: 60 billion• Morphology dictates diacritics, which change meaning

– Ex. Elm (Eelm, Ealam, Eolem: Knowledge, flag, acknowledge)

• No specific writing standard is prevalent:– Ex. The trailing letters in Ely (Ali) and ElY (on) are

often interchanged

Page 13: Dr. Kareem Darwish's presentation at QITCOM 2011

Arabic Morphology

• For regular Arabic search, morphological analysis is typically used:– Full morphological analysis:

• Sebawai, Buckwalter, IBM Lee, AMIRA– Light stemming – remove common prefixes and

suffixes• Al-Stem or Light-10

• For OCR they fail

Page 14: Dr. Kareem Darwish's presentation at QITCOM 2011

OCR Error Handling

• Error correction:– Word level techniques:

• Dictionary lookup (Jurafsky & Martin, 2000)

– Character level model uses confusion matrix– Typically font dependent

• Character n-gram model:– Some character sequences are more

common than others– Presence of a rare character sequence

indicates position of error

argmax P ( WordOrg | WordOCR ) = P ( WordOCR | WordOrg ) P ( WordOrg )

Char level model Word level model

Page 15: Dr. Kareem Darwish's presentation at QITCOM 2011

OCR Error Handling

• Error correction:– Passage level/context sensitive techniques:

• Using language modeling (bi or trigram LM):

• Clustering words in passage:– assumes salient terms appear more than

once:– Ex. Kennedy; Kemedy; Kennody; etc.

P ( Wordoriginal | WordOCR ) =

P ( WordOCR | WordOrg ) P ( WordOrg )

P(WordOrg | WordOrg-1 )

Page 16: Dr. Kareem Darwish's presentation at QITCOM 2011

OCR Error Handling

• Multi-source fusion:– Uses language modeling to fuse the output of

multiple OCR systems

• Query garbling:– Use a character level model to generate multiple

degraded versions of a query• Ex.: cement => cement, cornent, cernont, etc.

– Set degraded versions of a term as synonyms

Page 17: Dr. Kareem Darwish's presentation at QITCOM 2011

Arabic OCR Text Retrieval

• Without error handling Use character n-grams (3 & 4-grams)

الخليقة تقاكظ سوق الجنة والنار وبها وتام فهى واألبا إر رالفجار رالكفارإلى المؤمنين

منشأ الخلق واألمر والثواب والعقاب ،وهى رغها رعن له الخليقة خطقت الذى اهدن

رالفجاروالحسابحقرقها السمؤال والفجار

، الف ، رالفجا ، جار

وال ، الف ، فجا ، جار

Page 18: Dr. Kareem Darwish's presentation at QITCOM 2011

Presenting Results

• Presenting OCR output to users is not an option• How would a ranked list of images look like

– How would we generate image snippets?– How do we highlight salient terms in these

images?

Page 19: Dr. Kareem Darwish's presentation at QITCOM 2011

Presenting Results

• What is the unit of search?– Is it book, chapter, page

Page 20: Dr. Kareem Darwish's presentation at QITCOM 2011

Concluding Remarks

• Scanning is a fairly mature technology• Arabic OCR has quite a ways to go• Quality of search is tied to the quality of OCR• Presentation Issues persist