Finding Entries in an On-line Arabic Dictionary

Preview:

DESCRIPTION

Finding Entries in an On-line Arabic Dictionary. 27 May 2010 27 th Annual HCIL Symposium Sarah C. Wayland, C. Anton Rytting, David Zajic, Timothy Buckwalter, Jason White, Corey Miller, Jeffrey Carnes, Nathanael Lynn, Paul Rodrigues, Michael Maxwell, Evelyn Browne. Arabic is not English. - PowerPoint PPT Presentation

Citation preview

Finding Entries in an On-line Arabic Dictionary

27 May 2010

27th Annual HCIL Symposium

Sarah C. Wayland, C. Anton Rytting, David Zajic, Timothy Buckwalter, Jason White, Corey Miller, Jeffrey Carnes, Nathanael Lynn, Paul Rodrigues, Michael Maxwell, Evelyn Browne

LANGUAGE RESEARCH IN SERVICE TO THE NATION

Arabic is not English

• Different sounds (e.g., voiceless uvular /q/, retroflex /l/, voiced velar fricative /gh/, glottal stop / ‘ /)

• Different letters (مباريات)

• Different morphology (templatic vs. affixative)

• Written form doesn’t reflect spoken dialect

• Keyboard has different layout/letters

2

LANGUAGE RESEARCH IN SERVICE TO THE NATION

Many informal texts diverge from Modern Standard Arabic

Texts differ from classroom Arabic in orthography, morphology, and lexical content.

LANGUAGE RESEARCH IN SERVICE TO THE NATION

Many informal texts diverge from Modern Standard Arabic

Texts differ from classroom Arabic in orthography, morphology, and lexical content.

Orthographic differences are based on dialect pronunciations, typographical errors, and ... “style.”

LANGUAGE RESEARCH IN SERVICE TO THE NATION

Orthographic Differences

Dialect SATTS (no vowels)

Native (no vowels)

MSA (Modern Standard Arabic) KLB لبكIraqi (with Persian character)

#CLB

J-LB لبچ

Iraqi (with MSA character) JLB لبج

Some dialects use non-standard characters

LANGUAGE RESEARCH IN SERVICE TO THE NATION

Many informal texts diverge from Modern Standard Arabic

Texts differ from classroom Arabic in orthography, morphology, and lexical content.

Orthographic differences are based on dialect pronunciations, typographical errors, and ... “style.”

LANGUAGE RESEARCH IN SERVICE TO THE NATION

Many informal texts diverge from Modern Standard Arabic

Texts differ from classroom Arabic in orthography, morphology, and lexical content.

Orthographic differences are based on dialect pronunciations, typographical errors, and ... “style.”

LANGUAGE RESEARCH IN SERVICE TO THE NATION

Many informal texts diverge from Modern Standard Arabic

Texts differ from classroom Arabic in orthography, morphology, and lexical content.

Orthographic differences are based on dialect pronunciations, typographical errors, and ... “style.”

LANGUAGE RESEARCH IN SERVICE TO THE NATION

Phonetic Differences

ق Educated Urban (MSA) لبق

qlb /qalb/

گ Iraq لبگ

glb /gaLub/

غ Sudan لبغ

qhlb /ghaLib/

أ Cairo لبأ

’lb /’alb/

Consonants sometimes vary across dialects

LANGUAGE RESEARCH IN SERVICE TO THE NATION

Morphologically Complex

qalub* قلب “heart”

قلبال Al-qalb “the-heart”

بوقل *quluwb “hearts”

وقلالب

Al-quluwb “the-hearts”

يقلب qalb-iy “my-heart”

نابوقل quluwb-naA “our-hearts”

كقلب qalb-ak “your-heart (to a man)”

كقلب qalb-ik “your-heart (to a woman)”

بيقل qulayb “little heart”

* (the only forms listed in the dictionary)

LANGUAGE RESEARCH IN SERVICE TO THE NATION

The Arabic keyboard makes difficult-to-detect typos likely

LANGUAGE RESEARCH IN SERVICE TO THE NATION

The Arabic keyboard makes difficult-to-detect typos likely

Adjacent letters are often visually similar

LANGUAGE RESEARCH IN SERVICE TO THE NATION

The Arabic keyboard makes difficult-to-detect typos likely

Adjacent letters are often visually similar

LANGUAGE RESEARCH IN SERVICE TO THE NATION

The Arabic keyboard makes difficult-to-detect typos likely

Adjacent letters are often visually similar

LANGUAGE RESEARCH IN SERVICE TO THE NATION

The Arabic keyboard makes difficult-to-detect typos likely

Adjacent letters also often sound similar (with contrasts not found in English)

LANGUAGE RESEARCH IN SERVICE TO THE NATION

The Arabic keyboard makes difficult-to-detect typos likely

Adjacent letters also often sound similar (with contrasts subject to place-assimilation)

LANGUAGE RESEARCH IN SERVICE TO THE NATION

The Arabic keyboard makes difficult-to-detect typos likely

Adjacent letters also often sound similar (particularly so in some dialect pronunciations)

LANGUAGE RESEARCH IN SERVICE TO THE NATION

Putting DYM…? together• A query is checked by

composing a single-string finite state automaton (FSA) with: – weighted keyboard, visual,

and sound-based FSTs – a dictionary FSA (with

weights for dialect variants)

• The n-best paths yielding unique strings are calculated

• The corresponding strings are displayed to the user

sound-basedkeyboard

HARB, ?ARB, OARB, ....

visual

LANGUAGE RESEARCH IN SERVICE TO THE NATION19

LANGUAGE RESEARCH IN SERVICE TO THE NATION20

LANGUAGE RESEARCH IN SERVICE TO THE NATION21

LANGUAGE RESEARCH IN SERVICE TO THE NATION22

LANGUAGE RESEARCH IN SERVICE TO THE NATION23

LANGUAGE RESEARCH IN SERVICE TO THE NATION24

LANGUAGE RESEARCH IN SERVICE TO THE NATION25

Show verbs Show non-verbs

LANGUAGE RESEARCH IN SERVICE TO THE NATION26

Download Results

LANGUAGE RESEARCH IN SERVICE TO THE NATION27

LANGUAGE RESEARCH IN SERVICE TO THE NATION

Arabic is not English!

• One user interface for all languages will not work

• We must customize the user interface to take into account the unique structure of each language

28

Sarah C. Wayland

swayland@casl.umd.edu

301-226-8938

Recommended