12
Proceedings of the 15th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 54–65 Brussels, Belgium, October 31, 2018. c 2018 The Special Interest Group on Computational Morphology and Phonology https://doi.org/10.18653/v1/P17 54 Complementary Strategies for Low Resourced Morphological Modeling Alexander Erdmann and Nizar Habash Computational Approaches to Modeling Language Lab New York University Abu Dhabi United Arab Emirates {ae1541,nizar.habash}@nyu.edu Abstract Morphologically rich languages are challeng- ing for natural language processing tasks due to data sparsity. This can be addressed either by introducing out-of-context morphological knowledge, or by developing machine learning architectures that specifically target data spar- sity and/or morphological information. We find these approaches to complement each other in a morphological paradigm modeling task in Modern Standard Arabic, which, in ad- dition to being morphologically complex, fea- tures ubiquitous ambiguity, exacerbating spar- sity with noise. Given a small number of out- of-context rules describing closed class mor- phology, we combine them with word embed- dings leveraging subword strings and noise re- duction techniques. The combination outper- forms both approaches individually by about 20% absolute. While morphological resources already exist for Modern Standard Arabic, our results inform how comparable resources might be constructed for non-standard dialects or any morphologically rich, low resourced language, given scarcity of time and funding. 1 Introduction Morphologically rich languages pose many chal- lenges for natural language processing tasks. This often takes the shape of data sparsity, as the in- crease in the number of possible inflections for any given core concept leads to a lower aver- age word frequency of individual (i.e., unique) word types. Hence, models have fewer chances to learn about types based on their in-context behavior. One common, albeit time consuming response to this challenge is to introduce out- of-context morphological knowledge, hand craft- ing rules to relate forms inflected from the same lemma. The other common response is to adopt machine learning architectures specifically target- ing data sparsity and/or morphological informa- tion. We find these two responses to be comple- mentary in a paradigm modeling task for Modern Standard Arabic (MSA). MSA is characterized by morphological rich- ness and extreme orthographic ambiguity, com- pounding the issue of data sparsity with noise (Habash, 2010). Despite its challenges, MSA is relatively well resourced, with many solutions for morphological analysis and disambiguation leveraging large amounts of annotated data, hand crafted rules, and/or sophisticated neural archi- tectures (Khoja and Garside, 1999; Habash and Rambow, 2006; Smrž, 2007; Graff et al., 2009; Pasha et al., 2014; Abdelali et al., 2016; Inoue et al., 2017; Zalmout and Habash, 2017). Such resources and techniques, however, are not avail- able or not viable for the many under resourced and often mutually unintelligible dialects of Ara- bic (DA), which are similarly morphologically rich and highly ambiguous (Chiang et al., 2006; Erdmann et al., 2017). Many recent efforts seek to develop morphological resources for DA, but most are under developed or specific to one di- alect (Habash et al., 2012; Eskander et al., 2013; Jarrar et al., 2014; Al-Shargi et al., 2016; Eskan- der et al., 2016a; Khalifa et al., 2016, 2017; Zribi et al., 2017; Zalmout et al., 2018; Khalifa et al., 2018). This work does not aim to develop a full mor- phological analysis and disambiguation resource, but to inform how one might be most efficiently developed for any DA variety or similarly low re- sourced language, given scarcity of time and fund- ing. For such a resource to be practical and eas- ily extendable to new DA varieties, it must take as input the natural, highly ambiguous orthogra- phy. Thus, we do not rely on constructed phono- logical representations to clarify ambiguities, as is common practice when modeling morphology for its own sake (Cotterell et al., 2016, 2017). To in-

Complementary Strategies for Low Resourced Morphological ...1 Introduction Morphologically rich languages pose many chal-lenges for natural language processing tasks. This often takes

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Complementary Strategies for Low Resourced Morphological ...1 Introduction Morphologically rich languages pose many chal-lenges for natural language processing tasks. This often takes

Proceedings of the 15th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 54–65Brussels, Belgium, October 31, 2018. c©2018 The Special Interest Group on Computational Morphology and Phonology

https://doi.org/10.18653/v1/P17

54

Complementary Strategies for Low Resourced Morphological Modeling

Alexander Erdmann and Nizar HabashComputational Approaches to Modeling Language Lab

New York University Abu DhabiUnited Arab Emirates

{ae1541,nizar.habash}@nyu.edu

Abstract

Morphologically rich languages are challeng-ing for natural language processing tasks dueto data sparsity. This can be addressed eitherby introducing out-of-context morphologicalknowledge, or by developing machine learningarchitectures that specifically target data spar-sity and/or morphological information. Wefind these approaches to complement eachother in a morphological paradigm modelingtask in Modern Standard Arabic, which, in ad-dition to being morphologically complex, fea-tures ubiquitous ambiguity, exacerbating spar-sity with noise. Given a small number of out-of-context rules describing closed class mor-phology, we combine them with word embed-dings leveraging subword strings and noise re-duction techniques. The combination outper-forms both approaches individually by about20% absolute. While morphological resourcesalready exist for Modern Standard Arabic,our results inform how comparable resourcesmight be constructed for non-standard dialectsor any morphologically rich, low resourcedlanguage, given scarcity of time and funding.

1 Introduction

Morphologically rich languages pose many chal-lenges for natural language processing tasks. Thisoften takes the shape of data sparsity, as the in-crease in the number of possible inflections forany given core concept leads to a lower aver-age word frequency of individual (i.e., unique)word types. Hence, models have fewer chancesto learn about types based on their in-contextbehavior. One common, albeit time consumingresponse to this challenge is to introduce out-of-context morphological knowledge, hand craft-ing rules to relate forms inflected from the samelemma. The other common response is to adoptmachine learning architectures specifically target-ing data sparsity and/or morphological informa-

tion. We find these two responses to be comple-mentary in a paradigm modeling task for ModernStandard Arabic (MSA).

MSA is characterized by morphological rich-ness and extreme orthographic ambiguity, com-pounding the issue of data sparsity with noise(Habash, 2010). Despite its challenges, MSAis relatively well resourced, with many solutionsfor morphological analysis and disambiguationleveraging large amounts of annotated data, handcrafted rules, and/or sophisticated neural archi-tectures (Khoja and Garside, 1999; Habash andRambow, 2006; Smrž, 2007; Graff et al., 2009;Pasha et al., 2014; Abdelali et al., 2016; Inoueet al., 2017; Zalmout and Habash, 2017). Suchresources and techniques, however, are not avail-able or not viable for the many under resourcedand often mutually unintelligible dialects of Ara-bic (DA), which are similarly morphologicallyrich and highly ambiguous (Chiang et al., 2006;Erdmann et al., 2017). Many recent efforts seekto develop morphological resources for DA, butmost are under developed or specific to one di-alect (Habash et al., 2012; Eskander et al., 2013;Jarrar et al., 2014; Al-Shargi et al., 2016; Eskan-der et al., 2016a; Khalifa et al., 2016, 2017; Zribiet al., 2017; Zalmout et al., 2018; Khalifa et al.,2018).

This work does not aim to develop a full mor-phological analysis and disambiguation resource,but to inform how one might be most efficientlydeveloped for any DA variety or similarly low re-sourced language, given scarcity of time and fund-ing. For such a resource to be practical and eas-ily extendable to new DA varieties, it must takeas input the natural, highly ambiguous orthogra-phy. Thus, we do not rely on constructed phono-logical representations to clarify ambiguities, as iscommon practice when modeling morphology forits own sake (Cotterell et al., 2016, 2017). To in-

Page 2: Complementary Strategies for Low Resourced Morphological ...1 Introduction Morphologically rich languages pose many chal-lenges for natural language processing tasks. This often takes

55

form how such a resource should be developed,we evaluate minimally rule based and unsuper-vised techniques for clustering words that belongto the same paradigm in MSA. We primarily usepre-existing MSA resources only for evaluation,constraining resource availability to emulate DAsettings during training, as we lack the resourcesto evaluate our techniques in DA. Our best sys-tem combines a minimal set of rules describingclosed class morphology with word embeddingsthat leverage subword strings and noise reduc-tion strategies. The former, despite being cheaperand easier to produce than other rule-based sys-tems, provides valuable out-of-context morpho-logical knowledge, which the latter complementsby modeling the in-context behavior of words andmorphemes. Combining the techniques outper-forms either individually by about 20% absolute.

2 Morphology and Ambiguity

Arabic morphology is structurally and function-ally complex. Structurally, paradigms are rela-tively large. Component cells convey morpho-syntactic properties at a much finer granularitythan English. Functionally, many morphologi-cal processes are non-concatenative, or templatic.Arabic roots are lists of de-lexicalized radicals,which must be mapped onto a template to derivea word. The derived word will then exhibit somepredictable semantic and morpho-syntactic rela-tionship to the root, based on its template. For ex-ample, the root X X P r d d,1 having to do with re-sponding, could take a singular nominal templatewhere geminates are collapsed, becoming XP rd,‘response’, or a so-called broken plural template,separating the geminate with a long vowel to be-come XðXP rdwd, ‘responses’. Arabic orthographycomplicates the issue further, as diacritics mark-ing short vowels, gemination, and case endingsare typically not written. In addition to causingfrequent lexical ambiguity among forms that arepronounced differently, this also causes templaticprocesses to appear to be concatenative or com-pletely disappear. For example, deriving ‘to cool’XQK. brd (fully diacritized, X

��Q�K. bar∼ad) from ‘cold-

ness’ XQK. brd (fully diacritized, XQ�K. bar.d) involves

doubling the second root letter and adding a shortvowel before the third, yet these templatic changesusually disappear in the orthography.

1Arabic transliteration is presented in the Habash-Soudi-Buckwalter scheme (Habash et al., 2007).

Most templatic processes are derivational, de-riving new core meanings with separate paradigmsfrom a shared root. Inflectional processes gener-ally concatenate affixes to a shared stem to realizedifferent cells in the same paradigm. Broken plu-rals however, like XðXP rdwd, are a notable excep-tion, resulting from a templatic inflectional pro-cess. Approximately 55% of all plurals are broken(Alkuhlani and Habash, 2011).

Arabic is further characterized by frequent cliti-cization of prepositions, conjunctions, and ob-ject pronouns. Thus, a single syntactic word cantake many cliticized forms, potentially becominghomonymous with inflections of unrelated lemmasor distinct cells in the same paradigm. The XQK. brd,‘response’–‘coldness’ ambiguity exemplifies this.The ‘response’ meaning interprets H. b as a cliti-cized preposition meaning ‘with’, while the ‘cold-ness’ meaning interprets H. b as the first root radi-cal. To investigate how these morphological traitsaffect our ability to model paradigms, we definethe following morphological structures.

Paradigm All words that share a certain lemmacomprise a paradigm, e.g., in Figure 1, theparadigm of verbal lemma �

X �P rad∼, ‘to respond’,contains the four words connected to it by a solidline. Ambiguity within the paradigm is referredto as syncretism, and is very common in Ara-bic. For example, the present tense second per-son masculine singular form is syncretic with thethird person feminine singular in verbs, as shownby XQ

�K trd, ‘you[masc.sing]/she respond(s)’. Addi-

tionally, orthography normalizes short vowel dis-tinctions between past tense second person mas-culine, second person feminine, and first personforms (and sometimes third person feminine), thuscausing

��HX

�X �P radadta, �

H�

X�X �P radadti, and

��HX

�X �P

radadtu, respectively, to be orthographically syn-cretic. Cliticized forms can also cause unique syn-cretisms, e.g., A

KXQK. brdnA has two possible inter-

pretations from the same lemma X��Q�K. bar∼ad, ‘to

cool’. If the final suffix AK nA is interpreted as a

past tense verbal exponent, it means ‘we cooled’,whereas if it is interpreted as a cliticized personalpronoun, it becomes ‘he/it cooled us’.

Subparadigm At or below the paradigm level,subparadigms are comprised of all words thatshare the same lemma ambiguity. Lemma ambi-guity refers to the set of all lemmas a word couldhave been derived from out of context. Hence, XQK.

Page 3: Complementary Strategies for Low Resourced Morphological ...1 Introduction Morphologically rich languages pose many chal-lenges for natural language processing tasks. This often takes

56

Figure 1: A clan of two families with two paradigms each, connected by both derivational and coincidentalambiguities. Line dotting style is only used to visually distinguish paradigm membership.

brd and AKXQK. brdnA form a subparadigm, being the

only words in Figure 1 which can all be derivedexclusively from lemmas, �

X �P rad∼, ‘response’, XQ�K.

bard, ‘coldness’, and X��Q�K. bar∼ad, ‘to cool’.

Family At or above the paradigm level, a fam-ily is comprised of all paradigms which can belinked via derivational ambiguity, such that allparadigms are derived from the same root. Thus,all forms mapping to the two paradigms which inturn map to the root X P H. b r d, relating to cold,constitute a single family. The subparadigm ofXQK. brd and A

KXQK. brdnA link the two component

paradigms via derivational ambiguity.2

Clan At or above the family level, a clan is com-prised of all families which can be linked by coin-cidental ambiguity. Thus, the subparadigm of XQK.

brd and AKXQK. brdnA, whose derivational ambiguity

joins the paradigms of the X P H. b r d family, alsoconnects that family to the unrelated X X P r d dfamily via coincidental ambiguity. This is causedby the multiple possible analyses of H. b as eithera cliticized preposition or a root letter.

3 Experiments

In this section, we describe the data, design, andmodels used in our experiments.

2The linguistic concept of derivational family differs fromours in that it does not require any ambiguous forms to beshared by derivationally related paradigms. However, identi-fying such derivational families automatically is non-trivial.Even if the shared root can be identified, it can be difficultto determine whether the root is mono or polysemous, e.g.,P ¨

�� š ς r could refer to hair, poetry, or feeling. Regard-

less, our definition of family better serves our investigationinto the effects of ambiguity.

3.1 Data

To train word embedding models, we use a cor-pus of 500,000 Arabic sentences (13 millionwords) randomly selected from the corpus usedin Almahairi et al. (2016). This makes our find-ings more generalizable to DA, as many dialectshave similar amounts of available data (Erdmannet al., 2018). We clean our corpus via standardpreprocessing3 and analyze each word out of con-text with SAMA (Graff et al., 2009) to get the setof possible fully diacritized lemmas from which itcould be derived.4

To build an evaluation set, we sum the frequen-cies of all types within each paradigm and bucketparadigms based on frequency. We randomly se-lect evaluation paradigms such that all 10 buck-ets contribute at least 10 paradigms each. Forall selected paradigms, any paradigms from thesame clan are also selected, allowing us to as-sume that the paradigms included in the evalua-tion set are independent of those that are not in-cluded. Paradigms with only a single type arediscarded, as these are not interesting for analy-sis. Our resulting EVAL set contains 1,036 wordsfrom 91 paradigms and a great deal of ambiguityat all levels of abstraction (see Table 1). Becausewe prohibit paradigms from entering EVAL with-out the rest of their clan, EVAL also exhibits thedesirable property of reflecting a generally realis-tic distribution of ambiguity: 36% of its vocab-ulary are lemma ambiguous as compared to 39%for the entire corpus.

3We remove the rarely used diacritics and Alif/Ya normal-ize (El Kholy and Habash, 2012).

4We exclude words from the embedding model and eval-uation set if they either cannot be analyzed by SAMA, onlyreceive proper noun analyses, or if they do not also occur inthe larger Arabic Gigaword corpus (Parker et al., 2011). Thiscontrols for many idiosyncrasies.

Page 4: Complementary Strategies for Low Resourced Morphological ...1 Introduction Morphologically rich languages pose many chal-lenges for natural language processing tasks. This often takes

57

Count Ambiguous Non-derivationallyAmbiguous

Clan 49 18 5Family 55 24 11

Paradigm 91 60 14Subparadigm 116 48 6

Word 1,036 372 85

Table 1: Statistics from the EVAL set. Morpholog-ical structures by level of abstraction. Ambiguousstructures contain at least one lemma ambiguousform. Non-derivationally ambiguous structurescontain at least one coincidentally lemma ambigu-ous form.

Figure 2: Best clustering strategies for twoparadigms–dotted versus dashed ovals–given sin-gle or multi prototype vocabulary representations.

3.2 Approach and Evaluation Metric

We build single and multi prototype representa-tions of the entire vocabulary, then examine howwell they reflect the paradigms in EVAL. Eachrepresentation can be thought of as a tree whereeach word is a leaf at depth 0, i.e., W1, W2, andW3 in Figure 2. Descending down the tree, wordsare clustered with other words’ branches at subse-quent depths until the clustering algorithm finishesor the root is reached where all words in the vo-cabulary are clustered together. All trees use somemodel of word similarity to guide clustering. Inmulti prototype representations, a word’s leaf pro-totype at depth 0 can be copied and grafted ontoother words’ branches at non-zero depths beforethose branches are clustered to its own. Such is thecase of W2, which is copied as W ′

2 at depth 1 ofW3’s branch before W3’s branch connects to W2’s.This enables partially overlapping paradigms to bemodeled, like those in Figure 2.

We evaluate the trees via average maximum F-score. For each word in EVAL, we descend fromits leaf, at each depth calculating an F-score for

the overlap between the words that have beenclustered to the leaf’s branch so far and the leafword’s known paradigm mates, i.e., the set ofwords sharing at least one lemma with the leaf.Thus, paradigms are soft clusters in our represen-tation, in that, for each word in a paradigm, its setof proposed paradigm mates need not be consis-tent with any of its proposed paradigm mates’ setsof proposed paradigm mates. We then take thebest F-score for each leaf word in EVAL, regard-less of the depth level at which it was achieved,and average these maximum F-scores. This re-flects how cohesively paradigms are representedin the tree.5 Additionally, we report the averagedepth at which templatic and concatenatively re-lated paradigm mates are added.

Because we evaluate via average maximum F-score, this metric represents the potential perfor-mance of any given model. Future work willaddress predicting the depth level where aver-age maximum F-score is achieved for a givenleaf word via rule-based and/or empirical tech-niques that have proven successful for relatedtasks (Narasimhan et al., 2015; Soricut and Och,2015; Cao and Rei, 2016; Bergmanis and Gold-water, 2017; Sakakini et al., 2017).

3.3 Word Similarity Models

We use the following word similarity models forclustering words in single and multi prototype treerepresentations.

LEVENSHTEIN The LEVENSHTEIN baselineuses only orthographic edit distances to form amulti prototype tree. At each depth level i, thebranch will include every word which has anedit distance of i when compared to the leaf.Transitivity does not hold in this model, as wordsx and y could be in each other’s depth 1 branch,but the fact that z is in y’s depth 1 branch does notimply its inclusion in x’s depth 1 branch. If theedit distance between x and z is greater than 1,copies, or additional prototypes must be made of xand y. Because morphology involves complicatedprocesses that cannot be explained merely viaorthographic similarity, we predict this model willperform poorly. Still, this baseline is useful toensure that other models are learning something

5To control for idiosyncratic paradigms, we calculated amacro F-score averaged over the average maximum F-scoresof individual paradigms, though we do not report this as re-sults were not significantly different.

Page 5: Complementary Strategies for Low Resourced Morphological ...1 Introduction Morphologically rich languages pose many chal-lenges for natural language processing tasks. This often takes

58

from words’ in-context behavior or out-of-contextmorphological knowledge beyond what can besuperficially induced from edit distances.

DELEX We use a de-lexicalized (DELEX) mor-phological analyzer to predict morphological re-latedness. The analyzer covers all MSA closed-class affixes and clitics and their allowed combi-nations in open class parts-of-speech (POS); how-ever there is no information about stems and lem-mas in the model.6 The affixes and clitics and theircompatibility rules were extracted from SAMA

(Graff et al., 2009). They are relatively cheap tocreate for any DA or other languages. The inde-pendent, expensive component of SAMA is the in-formation regarding stems and lemmas, which weused to form our evaluation set. We are inspired byDarwish (2002), who demonstrated the creation ofan Arabic shallow analyzer in one day. Our ap-proach can be easily extended to DA at least in asimilar manner to Salloum and Habash (2014).

To determine if two MSA words are possiblyin the same paradigm, we do the following: (1) weuse the analyzer to identify all potential stems withcorresponding POS for each word (these stemsare simply the leftover string after removing anyprefixal and suffixal strings which match a prefix-suffix combination deemed compatible by SAMA);(2) each stem is deterministically converted into anorthographic root as per Eskander et al. (2013) byremoving Hamzas (the set of letters representingthe glottal stop phoneme, i.e., Z ’,

@ Â, @

A,

�@ A,

ð' w,Zø' y), long vowels ( @ A, ø

y, ð w, ø ý), and reduc-

ing geminate consonants (e.g., XXP rdd → XP rd);(3) two words are determined to be possibly fromthe same paradigm if there exists a possible ortho-graphic root–POS analysis shared by both words.

DELEX builds a multi prototype tree with amaximum depth of 1. For each leaf word, it usesthe above algorithm to identify all words in thevocabulary which can possibly share a paradigmwith the leaf word, and grafts them into the branch.Hence, a word can belong to more than one hy-pothesized paradigm. Because DELEX has access

6The system includes 15 basic prefixes/proclitics ( @ A, È@

Al, H. b,

¬ f, ú

¯ fy, ¼ k, È l, B lA, AÓ mA,

à n, � s, �H t, ð w, ø

y, and φ) in 84 unique combinations; and 30 suffixes/enclitics( @ A,

à@ An, �H@ At, è h, Aë hA, Ñë hm, AÒë hmA, áë hn, ¼ k, Õ»

km, AÒ» kmA, á» kn, à n, A

K nA, ú

G ny, �

è ~, �H t, A

�K tA,

àA�K tAn,

Õç�' tm, AÖ

�ß tmA, á

�K tn, ú

�G ty, á�

�K tyn, ð w, @ð wA,

àð wn, ø

y, áK

yn and φ) in 193 unique combinations.

to valuable morphological knowledge, we predictit will be a competitive baseline. Furthermore, itshould produce nearly perfect recall, only miss-ing rare exceptional forms, e.g., broken pluralsthat introduce new consonants such as l .

×@QK. brAmj,

‘programs’, the plural of l .×A

KQK. brnAmj, ‘program’.

We expect its precision to be weak because itlacks lexical or stem-pattern information, leadingto rampant clustering of derivationally related andunrelated forms. For example, a word like �

è Q KAg.

jAyz~, ‘prize’ (true root P ð h. j w z) receives the

orthographic root P h. j z (long vowel, hamza let-ter, and suffix are dropped), which clusters it withunrelated forms such as Z Qk. jz’, ‘part’ (true rootZ P h. j z ’), and Qk. jz, ‘shearing’ (true root P P h.

j z z).

Word Embedding Models (W2V, FT, and FT+)We use different word embedding models to buildsingle prototype representations of the vocabularyvia binary hierarchical clustering (Müllner et al.,2013). In order to analyze the effects of data spar-sity, we do not impose a minimum word frequencycount, but learn vectors for the entire vocabulary.At depth 0, we consider each leaf word to be itsown branch. Descending down the tree, we it-eratively join the closest two branches based onWard distance (Ward Jr, 1963). Joined branchesare represented by the centroid of their componentwords’ vectors (though, as in other models, we donot include the leaf word as a match when calcu-lating average maximum F-score). We continue it-erating until only a single root remains containingthe entire vocabulary.

These trees are single prototype because the in-put embeddings only provide one vector for eachword, regardless of whether or not it is ambiguousin any way. While this is a limitation for thesemodels,7 existing multi prototype word embed-dings generally model sense ambiguity, which iseasier to capture (though harder to evaluate) giventhe unsupervised settings in which embeddings aretypically trained (Reisinger and Mooney, 2010;Huang et al., 2012; Chen et al., 2014; Bartunovet al., 2016). Adapting multi prototype embed-

7A single prototype oracle that always correctly mapsnon-lemma ambiguous words to their paradigm and mapslemma ambiguous words only to their largest possibleparadigm scores 97% (92% specifically on lemma ambigu-ous types). This represents the best possible performance forsingle prototype models.

Page 6: Complementary Strategies for Low Resourced Morphological ...1 Introduction Morphologically rich languages pose many chal-lenges for natural language processing tasks. This often takes

59

dings to model lemma ambiguities is non-trivial,especially without lots of supervision. We leavethis for future work.

Because trees built from word embeddings areall constructed via the same binary clustering al-gorithm, the depths at which templatic and con-catenatively inflected paradigm mates are joinedin Table 2 are comparable vertically across W2V,FT, and FT+ as well as horizontally. However, themulti prototype trees are shorter and fatter, suchthat the templatic and concatenative average joindepths are only comparable horizontally with eachother, i.e., within the same model.

W2V The Gensim implementation ofWORD2VEC (Mikolov et al., 2013a; Rehurekand Sojka, 2010) uses the SkipGram algorithmwith 200 dimensions and a context window of 5tokens on either side of the target word. As thisdoes not have access to any subword informationand is specifically designed for semantics, notmorphology, we predict that it will not performwell in our evaluation.

FT We train a FASTTEXT (Bojanowski et al.,2016) implementation with the same parametersas W2V, except a word’s vector is the sum of itsSkipGram vector and that of all its componentcharacter n-grams between length 2 and 6. Sinceshort vowels are not written, many Arabic affixesare only one character. With FASTTEXT bookend-ing words with start/end symbols in its internalrepresentation, outermost single-letter affixes arefunctionally two characters. By inducing knowl-edge of such affixes, these character n-gram pa-rameters outperform the language agnostic rangeof 3 to 6 proposed by Bojanowski et al. (2016).

With the ability to model how subword stringsbehave in context, FT should outperform bothLEVENSHTEIN and W2V, though without accessto scholar seeded knowledge of morphologicalstructures, it is difficult to predict how FT willcompare with DELEX. Errors may arise from clus-tering words based on affixes indicative of syn-tactic behavior instead of the stem, which indi-cates paradigm membership. Also, if the wordis infrequent and contains no semantically dis-tinct subword string with higher frequency, theembeddings will be noisy. Frequency and noisealso interact with the hubbiness, or crowdednessof the embedding region, as rural regions will re-quire less precision in the vectors to cluster well,

whereas there is little room for noise in crowdedurban regions where many similar but morpholog-ically unrelated words could interfere.

FT+ We build another FT model by concatenat-ing the vectors learned from two variant FT mod-els, one with the normal window size of 5 and onewith a narrow window size of 1. Both are trainedon a preprocessed corpus where phrases have beenprobabilistically identified in potentially uniquedistributions over multiple copies of each sen-tence, as described in Erdmann et al. (2018).8 Thistechnique attempts to better model syntactic cues–which are better encoded with narrow context win-dows (Pennington et al., 2014; Trask et al., 2015;Goldberg, 2016; Tu et al., 2017)–while avoidingtreating non-compositional phrases as composi-tional, and also learning from multiple, poten-tially complementary phrase-chunkings of everysentence. By combining these sources of infor-mation, FT+ is designed to learn more meaningfulvectors without requiring additional data. We pre-dict it will uniformly outperform FT by reducingnoise in the handling of sparse forms like infre-quent inflections–a hallmark of morphologicallyrich languages.

FT+&DELEX We make unique copies for eachleaf word’s branch extending all the way to theroot in the single prototype FT+ tree. Then, foreach leaf word, at every depth of its branch copy,we use DELEX to prune any words which couldnot share an orthographic root with the leaf word.Pruning is local to that branch copy, and does notaffect the branch copies of paradigm mates whichhad originally been proposed by FT+ before mak-ing branch copies. This makes FT+&DELEX amulti-prototype model. After pruning, the F-scoreis recalculated for each depth of each leaf word’sbranch and a new average maximum F-score is re-ported. Because FT+ encodes information regard-ing the in-context behavior of words, it is quitecomplementary to the out-of-context morpholog-ical knowledge supplied by DELEX. We predictthis model will outperform all others.

8For control, we compared every possible combination ofnarrow and wide window sizes (1 or 5), dimension sizes (200or 400), and techniques for phrase identification (none, deter-ministic (Mikolov et al., 2013b), and probabilistic (Erdmannet al., 2018)), but none approached the performance achievedwith the parameters used in FT+.

Page 7: Complementary Strategies for Low Resourced Morphological ...1 Introduction Morphologically rich languages pose many chal-lenges for natural language processing tasks. This often takes

60

Word Similarity Multi Averaged Scores Join DepthModel Prototype Max F-Score Precision Recall Concat Temp

LEVENSHTEIN X 22.0 35.5 23.4 3.5 4.1DELEX X 52.9 41.6 99.3 1.0 1.0

W2V 2.1 6.7 28.1 17.1 17.4FT 39.2 66.0 44.2 13.7 16.8

FT+ 50.2 71.8 52.9 13.3 16.4FT+&DELEX X 71.5 74.0 81.3 13.3 16.4

Table 2: Scores for clustering words with their paradigm mates in tree representations built from differentmodels of word similarity. Scores are calculated as described in Section 3.2, with precision and recallextracted from the depth that maximizes F and then averaged over all words in EVAL. Join depths refer tothe average depth at which templatic or concatenatively related paradigm mates are added to the branch.

4 Results and Discussion

The results in Table 2 provide strong evidence insupport of our hypotheses. The only model per-forming worse than the LEVENSHTEIN edit dis-tance baseline is W2V, which only understands thein-context, semantic behavior of words. By be-ing able to learn morphological knowledge fromin-context behavior of subword strings, FT greatlyimproves over both W2V and LEVENSHTEIN,demonstrating that it learns far more than can beinferred from out-of-context subword strings, i.e.,edit distance, or in-context distributional seman-tic knowledge without any morphology, i.e., W2V.As predicted, FT+ improves uniformly over FT inall categories, presumably by reducing noise inthe vectors of infrequent inflections. Interestingly,with no access to subword information, W2V per-forms equally poorly on both templatic and con-catenatively related paradigm mates, whereas FT

and FT+ greatly improve on concatenative mates,but not templatic ones. This is likely because FT

and FT+ can identify patterns in subword strings,but not in non-adjacent characters.

DELEX’s strong baseline performance demon-strates that simple, out-of-context, de-lexicalizedknowledge of morphology is sufficient to out-perform the best word embedding model thatonly learns from words’ in-context behaviors.However, given the complementarity betweenDELEX’s knowledge and the information FT+ canlearn, it is not surprising that the combination ofthese techniques, FT+&DELEX, far outperformseither system individually.

Specific Examples We discuss a numberof examples that illustrate the variety inthe behavior and complementarity of rule-based DELEX, embedding-based FT+, and thecombined FT+&DELEX models. For each

example, we specify the strength of the max-imum F-score for the three models as such:9

strengthDELEX+strengthFT+→strengthFT+&DELEX,e.g., LOW+MID→HI denotes poor DELEX andmediocre FT+ performance on a word, yieldinghigh performance in the combined model.

• �è Q

KAg. jAyz~, ‘prize’ (LOW+HI→HI)

This word has high orthographic root ambigu-ity since its second morphological root radicalis a Hamza. This results in matching wordswith unrelated true roots like Z Qk. jz’, ‘part’ andQk. jz, ‘shearing’ under DELEX. It also has

high root fertility, in that different paradigmscan come from the same true root, like Q

K� A

�g.

jAyz, ‘permissible’, further challenging DELEX.FT+ does relatively better, capturing the word’sother inflections, even the broken plural Q

K @ñk.

jwAyz, as their in-context behavior is similar to�è Q

KAg. jAyz~. Interesting recall errors by FT+ in-

clude semantically and orthographically similar�è Q

KA

¯ fAyz~, ‘winner[fem.sing]’. The combina-

tion yields a perfect F-score.

• àñ«QîE yhrςwn, ‘they rush’ (HI+LOW→HI)This word has an unambiguous orthographicroot with no root fertility, resulting in a perfectF-score for DELEX. However, FT+ misses sev-eral inflections such as ¨Qî

E nhrς , ‘we rush’,

and �I«Qëð whrςt, ‘and I/you/she rushed’. FT+

also makes many semantically and/or syntacti-cally similar precision errors:

àñ«Qå�� ysrςwn,‘they hurry’,

àñ«PA�� ySArςwn, ‘they wrestle’,and

àñ«Q�®K yqrςwn, ‘they ring (a bell)’. The

combination leads to a perfect F-score.

• ú¾JÓA

JKX dynAmyky, ‘dynamic’ (HI+HI→HI)

This word has an unambiguous orthographic9The strength designation HI is used for F-scores above

75%, LOW for scores below 25%, and MID for the rest.

Page 8: Complementary Strategies for Low Resourced Morphological ...1 Introduction Morphologically rich languages pose many chal-lenges for natural language processing tasks. This often takes

61

root based on a foreign borrowing and relativelyunique semantics and subword strings. Thus, itachieves a perfect F-score in all three models.

• @ðQå���JK @ AntšrwA, ‘they spread out’

(MID+MID→HI)This word has high orthographic root ambiguity(and, incidentally, fertility) due to the presenceof

à n and �H t, which could belong to a root,

template, or prefix. This leads to a 63% F-scoreunder DELEX with many precision errors:èPA

��

��K @ AntšArh, ‘his spreading out’, PðA

��

��Kð

wntšAwr, ‘we discuss’, and ¼PA�

���K ntšArk, ‘we

collaborate’. FT+ scores only 47%, proposingsemantically related but morphologically unre-lated or only derivationally related forms: e.g.,Qå

���JJÓ mntšr, ‘spread out’ (adjective), and @ð Q»QÖ

�ß

tmrkzwA, ‘they centralized’ (antonym). Thissemantic knowledge however, complementsDELEX’s knowledge, such that the combinationis almost perfect (98%).

• Z

­» kf’, ‘efficient’ (LOW+LOW→LOW)While 17% of words are LOW in DELEX and28% in FT+, only 4% are LOW in FT+&DELEX.This word exemplifies that 4%, occupying thegap between DELEX’s knowledge and FT+’s. Ithas an extremely ambiguous orthographic rootdue to the true root containing a Hamza and thefirst letter being interpretable as a proclitic orroot radical. Thus, DELEX achieves 2% F. FT+is only slightly better (5%). It is likely that thisword’s low frequency is the main contributor toits noisy embedding, as it only appears once inour corpus. The combination F-score is thus,only 11%.

5 Related Work

This work builds on several others addressingword embeddings and computational morphology.

Word Embeddings Word embeddings aretrained by predicting either a target word given itscontext (Continuous Bag of Words) or elementsof the context given a target (SkipGram) inunannotated corpora (Mikolov et al., 2013a), withthe learned vectors modeling how words relateto each other. Embeddings have been adaptedto incorporate word order (Trask et al., 2015) orsubword information (Bojanowski et al., 2016) tomotivate the learned vectors to specifically capturesyntactic, morphological, or other similarities.

Word embeddings are generally single proto-type models, in that they learn one vector for eachword, which can be problematic for ambiguousforms (Reisinger and Mooney, 2010; Huang et al.,2012; Chen et al., 2014). Bartunov et al. (2016)propose a multi prototype model that learns dis-tinct vectors for distinct meanings of types basedon variation in the contexts within which they ap-pear. Gyllensten and Sahlgren (2015), argue thatsingle prototype embeddings actually can modelambiguity because the defining characteristics ofa word’s different meanings typically manifest indifferent dimensions of the highly dimensionalvector space. They find ambiguous words’ rela-tive nearest neighbors in a relative neighborhoodgraph often correlate with distinct meanings. Suchworks however, deal with sense ambiguity, or ab-stract semantic distinctions between different us-ages of a word with potentially the same morpho-syntactic properties and core meaning. Evalu-ation usually requires linking to large semanticdatabases which, for Arabic, are still underdevel-oped (Black et al., 2006; Badaro et al., 2014; El-razzaz et al., 2017).

Computational Morphology This field ofstudy includes rule-based, machine learning, andhybrid approaches to modeling morphology. Thetraditional approach is to hand write rules toidentify the morphological properties of words(Beesley, 1998; Khoja and Garside, 1999; Habashand Rambow, 2006; Smrž, 2007; Graff et al.,2009; Habash, 2010). These can be used forout-of-context analysis–which SAMA (Graffet al., 2009) performs for MSA–or they can becombined with machine learning approaches thatleverage information from the context in which aword appears. MADAMIRA (Pasha et al., 2014),for example, is trained on an annotated corpusto disambiguate SAMA’s analyses based on thesurrounding sentence.

Other systems use machine learning withoutrules. They can train on annotated data, likeFaruqui et al. (2016) who learn morpho-syntacticlexica from a small seed, or they can learn with-out supervision, like Luo et al. (2017) who induce"morphological forests" of derivationally relatedwords by predicting suffixes and prefixes based onthe vocabulary alone. Some approaches seek tobe language independent. MORFESSOR (Creutzand Lagus, 2005), for instance, segments wordsbased on unannotated text. However, it deter-

Page 9: Complementary Strategies for Low Resourced Morphological ...1 Introduction Morphologically rich languages pose many chal-lenges for natural language processing tasks. This often takes

62

ministically produces context-irrelevant segmen-tations, causing error propagation in languageslike Arabic, characterized by high lexical ambigu-ity (Saleh and Habash, 2009; Pasha et al., 2014).A few systems have incorporated word embed-dings to perform segmentation (Narasimhan et al.,2015; Soricut and Och, 2015; Cao and Rei, 2016),with some attempting to model and analyze re-lations between underlying morphemes as well(Bergmanis and Goldwater, 2017; Sakakini et al.,2017), though none of these distinguish betweeninflectional and derivational morphology. Eskan-der et al. (2016b) propose another segmentationsystem using Adaptor Grammars for six typolog-ically distinct languages. Snyder and Barzilay(2010) actually use multiple languages simultane-ously, finding the parallels between them usefulfor disambiguation in morphological and syntac-tic tasks.

Our work is closely related to Avraham andGoldberg (2017), who train embeddings ona Hebrew corpus with disambiguated morpho-syntactic information appended to each token.Similarly, Cotterell and Schütze (2015) "guide"German word embeddings with morphological an-notation, and Gieske (2017) use morphological in-formation encoded in word embeddings to inflectGerman verbs. For Arabic, Rasooli et al. (2014)induce paradigmatic knowledge from raw text toproduce unseen inflections, and Eskander et al.(2013) identify orthographic roots and use themto extract features for paradigm completion givenannotated data. While we adopt the concept ofapproximating the linguistic root with an ortho-graphic root, we do not use annotated data wherethe stem has already been determined as in Eskan-der et al. (2013). Thus, we generate all possibleorthographic roots for a given word instead of justone, as discussed in Section 3.3.

Sakakini et al. (2017) provide an alternativeunsupervised technique for extracting roots inSemitic languages, however, we chose to adoptthe orthographic root concept instead for severalreasons. Firstly, despite performing comparablywith other empirical techniques, Sakakini et al.(2017)’s root extractor is not extremely accurate.While our implementation generates potentiallymultiple orthographic roots with imperfect preci-sion, the near perfect recall is useful for pruningwithout propogating error. A major reason why wefind DELEX and FT+ to complement one another

is the independence of the orthographic root ex-traction rules and the distributional statistics lever-aged by word embeddings. Sakakini et al. (2017)’sroot extractor however, depends on embeddings toidentify roots. Furthermore, their root extractorcannot be used to generate multi prototype mod-els as it only produces one root per word. Finally,despite orthographic roots’ dependance on handwritten rules, we show that these rules are veryfew, such that adapting Sakakini et al. (2017)’sroot extractor to a new language or dialect wouldnot necessarily require any less effort than writingnew rules.

6 Conclusion and Future Work

In this work, we demonstrated that out-of-context,rule-based knowledge of morphological structure,even in minimal supply, greatly complementswhat word embeddings can learn about morphol-ogy from words’ in-context behaviors. We dis-cussed how Arabic’s morphological richness andmany forms of ambiguity interact with differentword similarity models’ ability to represent mor-phological structure in a paradigm clustering task.Our work quantifies the value of leveraging sub-word information when learning embeddings andthe further value of noise reduction techniques tar-geting the sparsity caused by complex morphol-ogy. Our best performing model uses out-of-context rules to prune unlikely paradigm matessuggested by our best embedding model, achiev-ing an F-score of 71.5% averaged over our eval-uation vocabulary. Our results inform how onewould most cost effectively construct morphologi-cal resources for DA or similarly under resourced,morphologically complex languages.

Our future work will target templatic morpho-logical processes which still challenge our bestmodel, requiring knowledge of patterns realizedover non-adjacent characters. We will also ad-dress errors due to ambiguity, either by adapt-ing multi prototype embedding models to cap-ture morphological ambiguity, including knowl-edge of paradigm structure in our de-lexicalizedrules, or by using disambiguated lemma frequen-cies to model ambiguity probabilistically. In ap-plying this work to DA, we will additionallyneed to address the issue of noisy, unstandardizedspelling. We will also investigate different knowl-edge transfer techniques to leverage the many re-sources available for MSA.

Page 10: Complementary Strategies for Low Resourced Morphological ...1 Introduction Morphologically rich languages pose many chal-lenges for natural language processing tasks. This often takes

63

ReferencesAhmed Abdelali, Kareem Darwish, Nadir Durrani, and

Hamdy Mubarak. 2016. Farasa: A fast and furioussegmenter for Arabic. In Proceedings of the 2016Conference of the North American Chapter of theAssociation for Computational Linguistics: Demon-strations, pages 11–16.

Faisal Al-Shargi, Aidan Kaplan, Ramy Eskander, NizarHabash, and Owen Rambow. 2016. Morphologi-cally annotated corpora and morphological analyz-ers for Moroccan and Sanaani Yemeni Arabic. In10th Language Resources and Evaluation Confer-ence (LREC 2016).

Sarah Alkuhlani and Nizar Habash. 2011. A Corpusfor Modeling Morpho-Syntactic Agreement in Ara-bic: Gender, Number and Rationality. In Proceed-ings of the 49th Annual Meeting of the Associationfor Computational Linguistics (ACL’11), Portland,Oregon, USA.

Amjad Almahairi, Kyunghyun Cho, Nizar Habash, andAaron C. Courville. 2016. First result on Arabicneural machine translation. CoRR, abs/1606.02680.

Oded Avraham and Yoav Goldberg. 2017. The inter-play of semantics and morphology in word embed-dings. arXiv preprint arXiv:1704.01938.

Gilbert Badaro, Ramy Baly, Hazem Hajj, NizarHabash, and Wassim El-Hajj. 2014. A large scaleArabic sentiment lexicon for Arabic opinion mining.In Proceedings of the EMNLP 2014 Workshop onArabic Natural Language Processing (ANLP), pages165–173.

Sergey Bartunov, Dmitry Kondrashkin, Anton Osokin,and Dmitry Vetrov. 2016. Breaking sticks and ambi-guities with adaptive skip-gram. In Artificial Intelli-gence and Statistics, pages 130–138.

Kenneth Beesley. 1998. Arabic morphology using onlyfinite-state operations. In Proceedings of the Work-shop on Computational Approaches to Semitic Lan-guages, pages 50–7, Montereal.

Toms Bergmanis and Sharon Goldwater. 2017. Fromsegmentation to analyses: A probabilistic model forunsupervised morphology induction. In Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational Linguistics:Volume 1, Long Papers, volume 1, pages 337–346.

William Black, Sabri Elkateb, Horacio Rodriguez,Musa Alkhalifa, Piek Vossen, Adam Pease, andChristiane Fellbaum. 2006. Introducing the Arabicwordnet project. In Proceedings of the third inter-national WordNet conference, pages 295–300. Cite-seer.

Piotr Bojanowski, Edouard Grave, Armand Joulin,and Tomas Mikolov. 2016. Enriching word vec-tors with subword information. arXiv preprintarXiv:1607.04606.

Kris Cao and Marek Rei. 2016. A joint model for wordembedding and word morphology. arXiv preprintarXiv:1606.02601.

Xinxiong Chen, Zhiyuan Liu, and Maosong Sun. 2014.A unified model for word sense representation anddisambiguation. In Proceedings of the 2014 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP), pages 1025–1035.

David Chiang, Mona Diab, Nizar Habash, Owen Ram-bow, and Safiullah Shareef. 2006. Parsing ArabicDialects. In Proceedings of EACL, Trento, Italy.EACL.

Ryan Cotterell, Christo Kirov, John Sylak-Glassman,Géraldine Walther, Ekaterina Vylomova, PatrickXia, Manaal Faruqui, Sandra Kübler, DavidYarowsky, Jason Eisner, and Mans Hulden. 2017.CoNLL-SIGMORPHON 2017 shared task: Uni-versal morphological reinflection in 52 languages.CoRR, abs/1706.09031.

Ryan Cotterell, Christo Kirov, John Sylak-Glassman,David Yarowsky, Jason Eisner, and Mans Hulden.2016. The SIGMORPHON 2016 shared task—morphological reinflection. In Proceedings of the2016 Meeting of SIGMORPHON, Berlin, Germany.Association for Computational Linguistics.

Ryan Cotterell and Hinrich Schütze. 2015. Morpho-logical word-embeddings. In Proceedings of the2015 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, pages 1287–1292.

Mathias Creutz and Krista Lagus. 2005. Unsupervisedmorpheme segmentation and morphology inductionfrom text corpora using Morfessor 1.0. HelsinkiUniversity of Technology.

Kareem Darwish. 2002. Building a shallow Arabicmorphological analyzer in one day. In Computa-tional Approaches to Semitic Languages, an ACL’02Workshop, pages 47–54, Philadelphia, PA.

Ahmed El Kholy and Nizar Habash. 2012. Ortho-graphic and morphological processing for English–Arabic statistical machine translation. MachineTranslation, 26(1-2):25–45.

Mohammed Elrazzaz, Shady Elbassuoni, Khaled Sha-ban, and Chadi Helwe. 2017. Methodical evaluationof Arabic word embeddings. In Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics (Volume 2: Short Papers), pages454–458, Vancouver, Canada.

Alexander Erdmann, Nizar Habash, Dima Taji, andHouda Bouamor. 2017. Low resourced machinetranslation via morpho-syntactic modeling: Thecase of dialectal arabic. In Proceedings of MT Sum-mit 2017, Nagoya, Japan.

Alexander Erdmann, Nasser Zalmout, and NizarHabash. 2018. Addressing noise in multidialectalword embeddings. In Proceedings of the 56th An-nual Meeting of the Association for ComputationalLinguistics (Volume 2: Short Papers), volume 2,pages 558–565.

Ramy Eskander, Nizar Habash, and Owen Rambow.2013. Automatic extraction of morphological lex-icons from morphologically annotated corpora. In

Page 11: Complementary Strategies for Low Resourced Morphological ...1 Introduction Morphologically rich languages pose many chal-lenges for natural language processing tasks. This often takes

64

Proceedings of the 2013 conference on empiri-cal methods in natural language processing, pages1032–1043.

Ramy Eskander, Nizar Habash, Owen Rambow, andArfath Pasha. 2016a. Creating resources for Dialec-tal Arabic from a single annotation: A case study onEgyptian and Levantine. In Proceedings of COLING2016, the 26th International Conference on Compu-tational Linguistics: Technical Papers, pages 3455–3465, Osaka, Japan.

Ramy Eskander, Owen Rambow, and Tianchun Yang.2016b. Extending the use of adaptor grammarsfor unsupervised morphological segmentation of un-seen languages. In Proceedings of COLING 2016,the 26th International Conference on ComputationalLinguistics: Technical Papers, pages 900–910.

Manaal Faruqui, Ryan McDonald, and Radu Soricut.2016. Morpho-syntactic lexicon generation usinggraph-based semi-supervised learning. Transac-tions of the Association for Computational Linguis-tics, 4:1–16.

Sharon Gieske. 2017. Inflecting verbs with word em-beddings: A systematic investigation of morpholog-ical information captured by German verb embed-dings. Master’s thesis, University of Amsterdam.

Yoav Goldberg. 2016. A primer on neural networkmodels for natural language processing. J. Artif. In-tell. Res.(JAIR), 57:345–420.

David Graff, Mohamed Maamouri, Basma Bouziri,Sondos Krouna, Seth Kulick, and Tim Buckwal-ter. 2009. Standard Arabic morphological analyzer(SAMA) version 3.1. Linguistic Data ConsortiumLDC2009E73.

Amaru Cuba Gyllensten and Magnus Sahlgren.2015. Navigating the semantic horizon usingrelative neighborhood graphs. arXiv preprintarXiv:1501.02670.

Nizar Habash, Ramy Eskander, and Abdelati Hawwari.2012. A morphological analyzer for Egyptian Ara-bic. In Proceedings of the twelfth meeting of thespecial interest group on computational morphologyand phonology, pages 1–9. Association for Compu-tational Linguistics.

Nizar Habash and Owen Rambow. 2006. MAGEAD:a morphological analyzer and generator for the Ara-bic dialects. In Proceedings of the 21st InternationalConference on Computational Linguistics and the44th annual meeting of the Association for Compu-tational Linguistics, pages 681–688. Association forComputational Linguistics.

Nizar Habash, Abdelhadi Soudi, and Tim Buckwalter.2007. On Arabic Transliteration. In A. van denBosch and A. Soudi, editors, Arabic Computa-tional Morphology: Knowledge-based and Empiri-cal Methods. Springer.

Nizar Y Habash. 2010. Introduction to Arabic naturallanguage processing, volume 3. Morgan & Clay-pool Publishers.

Eric H Huang, Richard Socher, Christopher D Man-ning, and Andrew Y Ng. 2012. Improving word

representations via global context and multiple wordprototypes. In Proceedings of the 50th Annual Meet-ing of the Association for Computational Linguis-tics: Long Papers-Volume 1, pages 873–882. Asso-ciation for Computational Linguistics.

Go Inoue, Hiroyuki Shindo, and Yuji Matsumoto.2017. Joint prediction of morphosyntactic cate-gories for fine-grained Arabic part-of-speech tag-ging exploiting tag dictionary information. In Pro-ceedings of the 21st Conference on ComputationalNatural Language Learning (CoNLL 2017), pages421–431, Vancouver, Canada. Association for Com-putational Linguistics.

Mustafa Jarrar, Nizar Habash, Diyam Akra, and NasserZalmout. 2014. Building a corpus for PalestinianArabic: a preliminary study. In Proceedings ofthe EMNLP 2014 Workshop on Arabic Natural Lan-guage Processing (ANLP), pages 18–27.

Salam Khalifa, Nizar Habash, Fadhl Eryani, OssamaObeid, Dana Abdulrahim, and Meera Al Kaabi.2018. A Morphologically Annotated Corpus ofEmirati Arabic. In Proceedings of the Eleventh In-ternational Conference on Language Resources andEvaluation (LREC 2018), Miyazaki, Japan.

Salam Khalifa, Sara Hassan, and Nizar Habash.2017. A morphological analyzer for Gulf Arabicverbs. WANLP 2017 (co-located with EACL 2017),page 35.

Salam Khalifa, Nasser Zalmout, and Nizar Habash.2016. Yamama: Yet another multi-dialect Arabicmorphological analyzer. In Proceedings of COLING2016, the 26th International Conference on Compu-tational Linguistics: System Demonstrations, pages223–227.

Shereen Khoja and Roger Garside. 1999. StemmingArabic text. Lancaster, UK, Computing Depart-ment, Lancaster University.

Jiaming Luo, Karthik Narasimhan, and Regina Barzi-lay. 2017. Unsupervised learning of morphologicalforests. arXiv preprint arXiv:1702.07015.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013a. Efficient estimation of wordrepresentations in vector space. arXiv preprintarXiv:1301.3781.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013b. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in neural information processingsystems, pages 3111–3119.

Daniel Müllner et al. 2013. fastcluster: Fast hierar-chical, agglomerative clustering routines for R andPython. Journal of Statistical Software, 53(9):1–18.

Karthik Narasimhan, Regina Barzilay, and TommiJaakkola. 2015. An unsupervised method for un-covering morphological chains. arXiv preprintarXiv:1503.02335.

Robert Parker, David Graff, Ke Chen, Junbo Kong, andKazuaki Maeda. 2011. Arabic Gigaword Fifth Edi-tion. LDC catalog number No. LDC2011T11, ISBN1-58563-595-2.

Page 12: Complementary Strategies for Low Resourced Morphological ...1 Introduction Morphologically rich languages pose many chal-lenges for natural language processing tasks. This often takes

65

Arfath Pasha, Mohamed Al-Badrashiny, Ahmed ElKholy, Ramy Eskander, Mona Diab, Nizar Habash,Manoj Pooleery, Owen Rambow, and Ryan Roth.2014. MADAMIRA: A Fast, Comprehensive Toolfor Morphological Analysis and Disambiguation ofArabic. In In Proceedings of LREC, Reykjavik, Ice-land.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In Proceedings of the 2014 confer-ence on empirical methods in natural language pro-cessing (EMNLP), pages 1532–1543.

Mohammad Sadegh Rasooli, Thomas Lippincott, NizarHabash, and Owen Rambow. 2014. Unsupervisedmorphology-based vocabulary expansion. In Pro-ceedings of the 52nd Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers), volume 1, pages 1349–1359.

Radim Rehurek and Petr Sojka. 2010. Software Frame-work for Topic Modelling with Large Corpora. InProceedings of the LREC 2010 Workshop on NewChallenges for NLP Frameworks, pages 45–50, Val-letta, Malta. ELRA. http://is.muni.cz/publication/884893/en.

Joseph Reisinger and Raymond J Mooney. 2010.Multi-prototype vector-space models of word mean-ing. In Human Language Technologies: The 2010Annual Conference of the North American Chap-ter of the Association for Computational Linguistics,pages 109–117. Association for Computational Lin-guistics.

Tarek Sakakini, Suma Bhat, and Pramod Viswanath.2017. Fixing the infix: Unsupervised discoveryof root-and-pattern morphology. arXiv preprintarXiv:1702.02211.

Ibrahim M. Saleh and Nizar Habash. 2009. Automaticextraction of lemma-based bilingual dictionaries formorphologically rich languages. In Proceedings ofMT Summit, Ottawa, Canada.

Wael Salloum and Nizar Habash. 2014. ADAM: An-alyzer for Dialectal Arabic Morphology. Journalof King Saud University-Computer and InformationSciences, 26(4):372–378.

Otakar Smrž. 2007. Functional Arabic Morphology.Formal System and Implementation. Ph.D. thesis,Charles University, Prague.

Benjamin Snyder and Regina Barzilay. 2010. Climb-ing the tower of Babel: Unsupervised multilinguallearning. In Proceedings of the International Con-ference on Machine Learning (ICML-10), Haifa, Is-rael.

Radu Soricut and Franz Och. 2015. Unsupervised mor-phology induction using word embeddings. In Pro-ceedings of the 2015 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages1627–1637.

Andrew Trask, David Gilmore, and Matthew Russell.2015. Modeling order in neural word embeddings atscale. arXiv preprint arXiv:1506.02338.

Lifu Tu, Kevin Gimpel, and Karen Livescu. 2017.Learning to embed words in context for syntactictasks. arXiv preprint arXiv:1706.02807.

Joe H Ward Jr. 1963. Hierarchical grouping to opti-mize an objective function. Journal of the Americanstatistical association, 58(301):236–244.

Nasser Zalmout, Alexander Erdmann, and NizarHabash. 2018. Noise-robust morphological dis-ambiguation for dialectal Arabic. In Proceed-ings of the 16th Meeting of the North AmericanChapter of the Association for Computational Lin-guistics/Human Language Technologies Conference(HLT-NAACL18), New Orleans, Louisiana, USA.

Nasser Zalmout and Nizar Habash. 2017. Don’t throwthose morphological analyzers away just yet: Neuralmorphological disambiguation for Arabic. In Pro-ceedings of the 2017 Conference on Empirical Meth-ods in Natural Language Processing, pages 715–724.

Inès Zribi, Mariem Ellouze, Lamia Hadrich Belguith,and Philippe Blache. 2017. Morphological dis-ambiguation of Tunisian dialect. Journal of KingSaud University-Computer and Information Sci-ences, 29(2):147–155.