21
KLPT – Kurdish Language Processing Toolkit Sina Ahmadi Insight Centre for Data Analytics National University of Ireland Galway https://sinaahmadi.github.io The 2nd Workshop for Natural Language Processing Open Source Software (NLPOSS) EMNLP 2020 November 2020 Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 1 / 21

KLPT – Kurdish Language Processing Toolkit

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: KLPT – Kurdish Language Processing Toolkit

KLPT – Kurdish Language Processing Toolkit

Sina Ahmadi

Insight Centre for Data AnalyticsNational University of Ireland Galway

https://sinaahmadi.github.io

The 2nd Workshop for Natural Language Processing Open Source Software (NLP­OSS)EMNLP 2020

November 2020

Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 1 / 21

Page 2: KLPT – Kurdish Language Processing Toolkit

Table of Contents

1 Introduction

2 Kurdish LanguageGeneral descriptionCurrent state of Kurdish language processing (KLP)What is wrong?

3 Kurdish Language Processing Toolkit (KLPT)KLPT StructurePreprocessTransliterateStemTokenize

4 Conclusion

Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 2 / 21

Page 3: KLPT – Kurdish Language Processing Toolkit

Introduction7,117 languages are spoken in the world1a big proportion of these languages are endangered, minority or less­resourcedrecent focus on applying language­independent approaches to various tasks innatural language processing (NLP) and computational linguistics using artificialintelligencelanguage­specific tools are still essential to process a language in a viable way

1Source: https://www.ethnologue.com/guides/how-many-languages

* Image source: https://ruder.io/unsupervised-cross-lingual-learning

Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 3 / 21

Page 4: KLPT – Kurdish Language Processing Toolkit

Table of Contents

1 Introduction

2 Kurdish LanguageGeneral descriptionCurrent state of Kurdish language processing (KLP)What is wrong?

3 Kurdish Language Processing Toolkit (KLPT)KLPT StructurePreprocessTransliterateStemTokenize

4 Conclusion

Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 4 / 21

Page 5: KLPT – Kurdish Language Processing Toolkit

Kurdish Language

an Indo­European languagespoken by 20­30 million speakersspoken in many dialects and subdialects (dialects or languages?)written in many scripts, among which the Latin­based and Arabic­based ones arestill widely in use

*

INDO­EUROPEAN

WESTERN IRANIAN

NORTH

Median

Parthian

Balochi Kurdish (kur)

Laki (lki) Southern (sdh) Central (ckb)(Sorani)

Northern (kmr)(Kurmanji)

Zaza­Gorani

Gorani (hac)Zazaki (zza)Shabaki (sdb)

Source: https://www.britannica.com/topic/Kurd

Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 5 / 21

Page 6: KLPT – Kurdish Language Processing Toolkit

using more than one script for a language,not only scatters readers but also createsfurther challenges in text processingwritten in various orthographies followingdifferent conventions

▶ di sala 2020’an | 2020­an | 2020an de“in the year 2020”

▶ hêviya, hêvîya or hêvî ya “hope of”?▶ ٠١٢٣٤٥٦٧٨٩, ۰۱۲۳۴۵۶۷۸۹ or

0123456789?although Kurdish orthographies arephonemic, there is not always aone­to­one relation between graphemes,particularly due to:

▶ double­usage characters: ی for î/y and وfor u/w

▶ variations in some orthographies such asl, ll or ł for [ì]

▶ vowel i has no equivalent in theArabic­based orthography

������� وەر�� ����ات، ��ود�� زا������ ��وو��� ����و ��ر���و ��دار، ��وەڕ، د�� و ���� ���� ��������

����ووژ ���و�� ��رە�� ������ ������ �� ��ول ��ژ��رو�� دەوران ����

Laki (Arabic script)

�� ڕا����ا ��م ��رە����ا�� ��ر �� �������ی ����������رد���ن و �������� ڕا��دوون

Sorani (Arabic script)

Ji ber barîna berfê li bajarê Wan û navçeyaTetwan a Bedlîsê dîmenên ciwan derketinholê.

Kurmanji (Latin script)

Southern Kurdish (Arabic script)

وه زاره �� �� و���� و��رو��ر�� ������ ل �� ر��� ��رد�����ل دۆر ������ دا�� �� �� ر�� ب �� ��� ��� �� ������ رۆ������ ك

ده ر��

Kurmanji (Arabic script)

Bergirî lem bwareda her le yekemîn rojekanîdamezrandinî komarî Turkyawe hate gořê.

Sorani (Latin script)

[ckb-ar][kmr-latn]

[lki-ar]

[ckb-latn]

[sdh-ar]

[kmr-ar]

Kurdish

Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 6 / 21

Page 7: KLPT – Kurdish Language Processing Toolkit

Current state of Kurdish language processing (KLP)

the earliest works in the field of KLP date back to 2009thus far, a total number of 53 publications are published in a field directly relatedto KLP (as of August 2020)two open­source volunteer­based projects:

▶ Kurdish Language Processing Project (KLPP2) in 2012▶ Kurdish Basic Language Resource Kit (Kurdish­BLARK3) in 2014

a few number of non­scientific contributions

Open­sourceDoes the paper provide the discussed resource or tool under an open­source license?

ApplicabilityDoes the paper, implicitly or explicitly, propose an approach or methodology that canbe applied to solve the same problem in the other dialects of Kurdish?

2http://klpp.github.io/3https://kurdishblark.github.io/

Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 7 / 21

Page 8: KLPT – Kurdish Language Processing Toolkit

Current state of KLP

2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020

0

2

4

6

8

10

12

Num

bero

fpub

licat

ions

total applicable open-source

Figure 2: Number of scientific publications directly related to Kurdish language processing per year

putational linguistics, we reviewed the scientificpublications that directly address an issue in thosefields. A total number of 53 publications are col-lected from the widely-used academic databasesand search engines such as Google Scholar5, andthen classified based on their discussed sub-fieldswhich are illustrated in Figure 1. The Kurdishdialects are not evenly discussed in the previousstudies, with Sorani making up a predominant pro-portion of almost 90%. Although a smaller pro-portion represents the Kurmanji dialect, no publi-cation is found with respect to processing of theSouthern Kurdish or Laki dialects. Regarding theresearch focus of the previous works, a range ofNLP sub-fields has been addressed, particularly intext mining, morphological and syntactic analysisand, creation of lexical resources. We exception-ally included optical character recognition as it isof importance for converting printed material toelectronic forms (Ahmadi et al., 2019). The fulllist of the surveyed papers can be found in Ap-pendix A.2.

More importantly, we analyze previous publica-tions from the following two perspectives:

• Open-source: Does the paper provide the dis-cussed resource or tool under an open-sourcelicense? To this end, we verified the con-tent of the papers and also, checked the Web,particularly major distributed version controlsystems such as GitHub6, GitLab7 and Bit-Bucket8.

5https://scholar.google.com

6https://github.com

7https://gitlab.com

8https://bitbucket.org

• Applicability: Does the paper, implicitly orexplicitly, propose an approach or method-ology that can be applied to solve the sameproblem in the other dialects of Kurdish? Forthe choice of the word, we were inspiredby (Årdal et al., 2011) where the possibil-ity of applying common practices of softwaredevelopment for drug discovery are investi-gated. For instance, (Ahmadi et al., 2019)is deemed an applicable contribution wherelexicographical resources can be created forother dialects. On the other hand, (Ahmadi,2019) is not applicable to other dialects dueto its ad-hoc solution for transliterating So-rani texts according to its phonological andphonetic rules.

Figure 2 provides the number of previous pub-lications in the Kurdish language processing fieldper year, and specifies their open-source status andtheir applicability. Although most of these pub-lications are applicable to other dialects, only 18out of 53 of them provide their resources or toolsunder an open-source license. Among the open-source ones, 11 are outcomes of volunteeringprojects, KLPP and Kurdish-BLARK. Given thesmall number of non-scientific contributions, wedid not include them in this survey. A few notableexamples of such contributions are Kurdînûs9, Ve-jin Dictionaries10 and VejinBooks11 which mostlyfocus on Sorani Kurdish and script conversiontasks.

9https://github.com/aso-mehmudi/

kurdinus

10https://lex.vejinbooks.com

11https://books.vejin.net

9.25 %

Dialectology

22.22 %

Information retrieval and Text mining

18.51 %

Lexical resources

3.70 %Machine Translation

12.96 %

Speech recognition18.8 %

Morphological and syntactic analysis

1.85 %

Other

11.11 %

Optical character recognition

3.70 % Sign language recognition

(a) per sub-field in NLP

Sorani

Kurmanji

Sorani and Kurmanji

Sorani, Kurmanji and Gorani

78.18%

7.27%

10.90%

3.63%

(b) per dialect

Figure 1: Proportion of publications related to Kurdish language processing

verbs, Kurdish has a few number of around 300single-word verbs (Walther and Sagot, 2010), e.g.kirdin/kirin “to do”, which are inflected based onperson (1,2,3, SG, PL), tense (past, present, future),aspect (indefinite, perfect, progressive, imperfec-tive) and mood (indicative, subjunctive, condi-tional). Unlike Kurmanji, Sorani Kurdish doesnot have future tense and uses adverbs for thispurpose. However, Kurdish extensively takes useof compound constructions for creating new verbforms, particularly with (Noun + Verb), (Adjective+ Verb) and (Preposition + Verb) forms (Traida,2007). For instance, siław ‘hi (n)’, pîroz ‘holy’(adj) and heł (verbal particle denoting ‘up’) withthe single-word verb kirdin can respectively formcompound verbs siław kirdin “to greet”, pîrozkirdin “to congratulate” and heł kirdin “to turnon”. The stringing characteristic of the Arabic-based script of Kurdish further adds to this mor-phological complexity in such a way that severalword forms may be concatenated together (Ah-madi, 2020b).

Regarding syntax, Kurdish has a sub-ject–object–verb word order and is a null-subject(or pro-drop) language. The presence of gram-matical markers for nominative and obliquecases varies within dialects and subdialects. Forinstance, in the Sorani subdialects of Sulay-maniyah and Erbil, respectively categorized asSouthern Sorani and Northern Sorani by (Matras,2017), the oblique case is marked differently.Another particularity of the Kurdish languageis its morphosyntactic alignment in the pasttense of transitive verbs. In such tenses, anergative–absolutive alignment occurs where thesubject of intransitive verbs behaves like thepatient of the transitive verb in the past (Haig,

1998; Karimi, 2014). Unlike Kurmanji whichuses oblique cases for this purpose, Sorani onlyuses different pronominal markers to specify erga-tivity, therefore it is called split-ergative (Esmailiand Salavati, 2013). Except the past tenses, anominative-accusative alignment is observed inother tenses.

Not being equally documented and used, Kur-dish dialects have different levels of linguisticresourcefulness. In comparison to Sorani andKurmanji which are widely used by the mediaand press, Southern Kurdish and Laki are under-documented and lack basic language resourcessuch as electronic dictionaries and corpora (Fat-tah, 2000; Ahmadi et al., 2019; Ahmadi, 2020c).

3 Current State of Kurdish LanguageProcessing

The earliest works in the field of Kurdish languageprocessing date back to 2009. Our literature re-view indicates that some of these contributionsfail to provide open-source solutions. Despite fi-nancial and scientific constraints in Kurdish lan-guage processing, the Kurdish Language Process-ing Project (KLPP) (Esmaili et al., 2013) in 20123

and Kurdish Basic Language Resource Kit (Kur-dish BLARK) (Hassani, 2018) in 20144 have suc-ceeded to promote an open-source vision basedon research volunteering within the Kurdish sci-entific communities. However, the outcomes ofthese projects are mostly released in an unorga-nized manner for individual tasks.

In order to understand the current state of theKurdish language in the realm of NLP and com-

3http://klpp.github.io

4https://kurdishblark.github.io

Figure: Number of scientific publications directly related to KLP per year and field

most of these publications are applicableonly 18 provide their resources or tools under an open­source licenseamong the open­source ones, 11 are outcomes KLPP and Kurdish­BLARKSorani makes up a predominant proportion of almost 90% of publicationsno publication addresses the processing of Southern Kurdish or LakiKurdish still lacks basic language processing tools such as part­of­speech tagger,stemmer, lemmatizer and so on

Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 8 / 21

Page 9: KLPT – Kurdish Language Processing Toolkit

Current state of KLP: What is wrong?

Many projects overlap significantly, yetnone of them provide a solution under anyopen­source license

▶ Stemming is addressed at least five times[Jaff, 2014, Salavati and Ahmadi, 2018,Mustafa and Rashid, 2018,Saeed et al., 2018, Hawezi et al., 2019]

*Some are hardly integrable or inter­operable

▶ A large­scale morphological lexicon and a part­of­speech tagger for Kurdish withinthe Alexina framework [Walther and Sagot, 2010, Walther et al., 2010]

Released in an unorganized manner for individual tasks▶ Example: a transliteration tool for Kurdish [Ahmadi, 2019a]

Further progress is hindered in the fieldKurdish is still a less­resourced language

* Image source: https://www.aic.cuhk.edu.hk/web8/Reinventingthewheel.htm

Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 9 / 21

Page 10: KLPT – Kurdish Language Processing Toolkit

Table of Contents

1 Introduction

2 Kurdish LanguageGeneral descriptionCurrent state of Kurdish language processing (KLP)What is wrong?

3 Kurdish Language Processing Toolkit (KLPT)KLPT StructurePreprocessTransliterateStemTokenize

4 Conclusion

Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 10 / 21

Page 11: KLPT – Kurdish Language Processing Toolkit

Kurdish Language Processing Toolkit (KLPT)

a basic but extendable language processing toolkitan effort to standardize Kurdish language with all its dialects and scriptsimplemented in Pythoninspired by the functionality of relevant NLP toolkits, e.g. NLTK and spaCyno external NLP library is used in this toolkitcomposed of core modules for Sorani andKurmanji for the following tasks:

▶ text preprocessing▶ stemming▶ lemmatization▶ spelling error detection and correction▶ transliteration▶ morphological analyzer and generator▶ tokenization

it is open­source!→ https://github.com/sinaahmadi/klpt

Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 11 / 21

Page 12: KLPT – Kurdish Language Processing Toolkit

KLPT: Structure

no hard­codingeasily extendable for otherdialects and taskseach package corresponds to a setof related taskscomposed of four core NLPpackages as follows:

▶ preprocess▶ transliterate

([Ahmadi, 2019a])▶ stem ([Ahmadi, 2020e])▶ tokenize ([Ahmadi, 2020b,

Ahmadi, 2020c])

4 KLPT Architecture

KLPT is implemented in Python and is composedof four core modules with specific tasks. Althoughwe were inspired by the functionality of relevantNLP toolkits, particularly NLTK and spaCy, noexternal library is used in this toolkit. Regardingthe toolkit design, we followed the rules of sci-entific software development suggested by (Prlicand Procter, 2012) along with common practicesin Python programming language. Figure 3 pro-vides the structure of the toolkit. In order to fa-cilitate the integration of variations specific to di-alects and scripts and more importantly, to avoidhard-coding, required files are provided in thedata folder. For instance, the data requiredfor the preprocess module is imported frompreprocess.json. In addition, third-partyprograms can be provided in bin. test anddocs respectively contain test cases and projectdocumentation. Regarding the latter, we useSphinx documentation generator12.

It is worth noting that each module within theklpt package has been previously studied andevaluated separately. Our goal is to introduce thefunctionality of the modules within the toolkit inthis section.

4.1 PreprocessMany keyboard layouts are specifically designedfor Kurdish where different character encoding areassigned to visually-similar graphemes. In addi-tion to the usage of non-Kurdish keyboards, suchas Arabic, Turkish and Persian keyboards, suchdiversity creates abnormality across texts in Kur-dish writing. For instance, the grapheme � (î/y),can be represented as � (U+064A), � (U+0649),�� (U+FEF2), � (U+FEF1) and � (U+06CC),among which only the latter should be used in theArabic-based script of Kurdish. Moreover, variouswriting conventions are used for each dialect andscript. For instance, in Kurmanji, when dates areaffixed with a morpheme, the suffix may be sepa-rated by ’, - or without any marker as in 2020’an,2020-an and 2020an.

To remedy such issues in an automatic andstructured manner, the preprocess moduleprovides two main functions: normalize()

for normalizing encoding abnormalities by unify-ing characters in such a way that only one spe-cific encoding is used for each grapheme and,

12https://www.sphinx-doc.org

standardize() which applies orthographicconventions to the text. For example, when hêvî‘hope’ is suffixed with the vowel a (Izafa, mean-ing ‘of’), a semi-vowel y appears between the twovowels and is usually written as hêviya or hêvîya‘hope of’. As the latter form is considered less am-biguous, this function converts the first form ac-cordingly. Although defining a universal orthogra-phy for Kurdish is out of scope of our project, webelieve that writing conventions and orthographiesshould be addressed to some extent. Therefore, inthis initial version, we follow the writing conven-tions proposed by (Aydogan, 2012) for Kurmanjiand (Hashemi, 2016) for Sorani.

In addition to these two functions,unify_numeral() is provided to convertnumerals, namely in Farsi (����������),Eastern Arabic (����������) and Western Ara-bic (0123456789). Although we set the latteras default for all scripts, users will have the

klptbin

data

ckb-morphemes.json

ckb_Hunspell.aff

ckb_Hunspell.dic

default-options.json

kmr-morphemes.json

kmr_Hunspell.aff

kmr_Hunspell.dic

lexicon_ckb_arab.json

lexicon_ckb_latn.json

lexicon_kmr_latn.json

preprocess.json

stopwords_kmr.txt

stopwords_ckb.txt

tokenize.json

transliterate.json

docs

klpt

__init__.py

configuration.py

preprocess.py

stem.py

tokenize.py

transliterate.py

test

setup.py

requirements.txt

Figure 3: Structure of KLPTSina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 12 / 21

Page 13: KLPT – Kurdish Language Processing Toolkit

KLPT Packages: Preprocess

Goal: Handle diversities in scripts and orthographies in an automatic and formalizedway

1 normalize(): normalize text by unifying character encodings▶ Example: the grapheme ی (U+06CC, î/y), may be represented as

ي (U+064A), ى (U+0649), ي (U+FEF2) or ي (U+FEF1)2 standardize(): standardize scripts and orthographies by using writing

conventions based on dialects and scripts3 unify_numeral(): convert Farsi, Eastern and Western Arabic numerals

Example>>> from klpt.preprocess import Preprocess >>> preprocessor = Preprocess("Sorani", "Arabic", numeral="Latin") >>> preprocessor.normalize("لە ســـا,ەکانی ١٩٥٠دا") لە سا,ەکانی 1950دا>>> preprocessor.standardize("راستە لەو وو1تەدا") ڕاستە لەو و1تەدا

>>> from klpt.transliterator import Transliterate >>> transliterator = Transliterate("Kurmanji", "Latin", target_script="Arabic") >>> transliterator.transliterate("rojhilata navîn") 'رۆژھلاتا ناڤین'

>>> from klpt.stem import Stem >>> stemmer = Stem("Sorani", "Arabic") >>> stemmer.check_spelling("سوتاندبووت") False >>> stemmer.correct_spelling("سوتاندبووت") ('سووتاندبووت', 'سووتاندت', 'سووتاندن', 'سووتاند')>>> stemmer.stem("سووتاندبووت") (,'سووت')>>> stemmer.analyze("دیتبامن") {'pos': 'verb', 'is': 'past_intransitive', 'stem': 'دی', ‘verb_stem': 'دیت', 'terminal_suffix': 'بامن'}

>>> from klpt.tokenize import Tokenize # Tokenize module >>> tokenizer = Tokenize("Kurmanji", "Latin") >>> tokenizer.word_tokenize("endamên encûmena wezîrên") ['▁endam▁ên', '▁encûmen▁a', '▁wezîr▁ên']

Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 13 / 21

Page 14: KLPT – Kurdish Language Processing Toolkit

KLPT Packages: Transliterate

transliterating the Arabic­based and Latin­based scripts of Kurdish to oneanother, e.g. برا → bira ‘brother’based on the rule­based approach of [Ahmadi, 2019a] which

▶ detects double usage characters▶ predicts the presence of the missing i, a.k.a Bizroke▶ finds the syllabic pattern of a given word based on Kurdish phonetics

beneficial to many NLP tasks such as named­entity recognition

Example

>>> from klpt.preprocess import Preprocess >>> preprocessor = Preprocess("Sorani", "Arabic", numeral="Latin") >>> preprocessor.normalize("لە ســـا,ەکانی ١٩٥٠دا") لە سا,ەکانی 1950دا>>> preprocessor.standardize("راستە لەو وو1تەدا") ڕاستە لەو و1تەدا

>>> from klpt.transliterator import Transliterate >>> transliterator = Transliterate("Kurmanji", "Latin", target_script="Arabic") >>> transliterator.transliterate("rojhilata navîn") 'رۆژھلاتا ناڤین'

>>> from klpt.stem import Stem >>> stemmer = Stem("Sorani", "Arabic") >>> stemmer.check_spelling("سوتاندبووت") False >>> stemmer.correct_spelling("سوتاندبووت") ('سووتاندبووت', 'سووتاندت', 'سووتاندن', 'سووتاند')>>> stemmer.stem("سووتاندبووت") (,'سووت')>>> stemmer.analyze("دیتبامن") {'pos': 'verb', 'is': 'past_intransitive', 'stem': 'دی', ‘verb_stem': 'دیت', 'terminal_suffix': 'بامن'}

>>> from klpt.tokenize import Tokenize # Tokenize module >>> tokenizer = Tokenize("Kurmanji", "Latin") >>> tokenizer.word_tokenize("endamên encûmena wezîrên") ['▁endam▁ên', '▁encûmen▁a', '▁wezîr▁ên']

Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 14 / 21

Page 15: KLPT – Kurdish Language Processing Toolkit

KLPT Packages: Stem

an annotated lexicon + morphological rules using Hunspell4 for:▶ spelling error detection and correction→ also usable in text editors such as

LibreOffice▶ morphological analyzer and generator▶ stemmer

a rule­based lemmatization systembased on [Ahmadi, 2020c, Ahmadi, 2020e]

Example

>>> from klpt.preprocess import Preprocess >>> preprocessor = Preprocess("Sorani", "Arabic", numeral="Latin") >>> preprocessor.normalize("لە ســـا,ەکانی ١٩٥٠دا") لە سا,ەکانی 1950دا>>> preprocessor.standardize("راستە لەو وو1تەدا") ڕاستە لەو و1تەدا

>>> from klpt.transliterator import Transliterate >>> transliterator = Transliterate("Kurmanji", "Latin", target_script="Arabic") >>> transliterator.transliterate("rojhilata navîn") 'رۆژھلاتا ناڤین'

>>> from klpt.stem import Stem >>> stemmer = Stem("Sorani", "Arabic") >>> stemmer.check_spelling("سوتاندبووت") False >>> stemmer.correct_spelling("سوتاندبووت") ('سووتاندبووت', 'سووتاندت', 'سووتاندن', 'سووتاند')>>> stemmer.stem("سووتاندبووت") (,'سووت')>>> stemmer.analyze("دیتبامن") {'pos': 'verb', 'is': 'past_intransitive', 'stem': 'دی', ‘verb_stem': 'دیت', 'terminal_suffix': 'بامن'}

>>> from klpt.tokenize import Tokenize # Tokenize module >>> tokenizer = Tokenize("Kurmanji", "Latin") >>> tokenizer.word_tokenize("endamên encûmena wezîrên") ['▁endam▁ên', '▁encûmen▁a', '▁wezîr▁ên']

4http://hunspell.github.io

Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 15 / 21

Page 16: KLPT – Kurdish Language Processing Toolkit

KLPT Packages: Tokenize

detect word and sentence boundaries→ a non trivial task:▶ orthographic inconsistencies, e.g. how compounds words are separated?▶ excessive concatenation, e.g. لەوێشدایە (lewêşdaye) “(it) is also there” is written as a

word but is composed of five tokens le, wê, ş, da, ye

split a text into sentences or tokensidentify compound forms such as kar­û­bar (word­and­load) “affaires”based on the [Ahmadi, 2020b]’s approach using a morphological analyzer and alexicon

Example

>>> from klpt.preprocess import Preprocess >>> preprocessor = Preprocess("Sorani", "Arabic", numeral="Latin") >>> preprocessor.normalize("لە ســـا,ەکانی ١٩٥٠دا") لە سا,ەکانی 1950دا>>> preprocessor.standardize("راستە لەو وو1تەدا") ڕاستە لەو و1تەدا

>>> from klpt.transliterator import Transliterate >>> transliterator = Transliterate("Kurmanji", "Latin", target_script="Arabic") >>> transliterator.transliterate("rojhilata navîn") 'رۆژھلاتا ناڤین'

>>> from klpt.stem import Stem >>> stemmer = Stem("Sorani", "Arabic") >>> stemmer.check_spelling("سوتاندبووت") False >>> stemmer.correct_spelling("سوتاندبووت") ('سووتاندبووت', 'سووتاندت', 'سووتاندن', 'سووتاند')>>> stemmer.stem("سووتاندبووت") (,'سووت')>>> stemmer.analyze("دیتبامن") {'pos': 'verb', 'is': 'past_intransitive', 'stem': 'دی', ‘verb_stem': 'دیت', 'terminal_suffix': 'بامن'}

>>> from klpt.tokenize import Tokenize # Tokenize module >>> tokenizer = Tokenize("Kurmanji", "Latin") >>> tokenizer.word_tokenize("endamên encûmena wezîrên") ['▁endam▁ên', '▁encûmen▁a', '▁wezîr▁ên']

Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 16 / 21

Page 17: KLPT – Kurdish Language Processing Toolkit

Table of Contents

1 Introduction

2 Kurdish LanguageGeneral descriptionCurrent state of Kurdish language processing (KLP)What is wrong?

3 Kurdish Language Processing Toolkit (KLPT)KLPT StructurePreprocessTransliterateStemTokenize

4 Conclusion

Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 17 / 21

Page 18: KLPT – Kurdish Language Processing Toolkit

Conclusion

Lessons learned:▶ release your project under an open source license→ essential to ensure gradual

but efficient progress in resource and technology development for a less­resourcedlanguage

▶ community­driven initiatives: bring together users, developers, researchers,language activists and policy makers

⋆ Vejîn Books (https://books.vejin.net/en)⋆ Vejîn Dictionaries (https://lex.vejin.net/en)

▶ raise awareness by promoting good practices in content creation on the Web,particularly collaboratively­curated resources such as Wiktionary5 and Wikipedia6

▶ every single user is a contributor tooFuture directions:

▶ promote the usage of KLPT in the Kurdish communities▶ create a community of developers and linguists for KLP▶ extend the current version of KLPT to include further advanced tasks

5https://en.wiktionary.org6https://www.wikipedia.org/

Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 18 / 21

Page 19: KLPT – Kurdish Language Processing Toolkit

Join KLPT

*https://github.com/sinaahmadi/klpt

* Image credit: Milad Ghaderpanah

Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 19 / 21

Page 20: KLPT – Kurdish Language Processing Toolkit

References

Sardar Jaf, Allan Ramsay (2014)

Stemmer and a POS tagger for Sorani Kurdish.

6th International Conference on Corpus Linguistics ­ Spain.

Shahin Salavati and Sina Ahmadi (2018)

Building a Lemmatizer and a Spell­checker for Sorani Kurdish.

arXiv preprint arXiv:1809.10763.

Mustafa, Arazo M., and Tarik A. Rashid. (2018)

Kurdish stemmer pre­processing steps for improving information

retrieval

Journal of Information Science, 44.1: 15­27.

Saeed, A. M., Rashid, T. A., Mustafa, A. M., Agha, R. A. A. R.,

Shamsaldin, A. S., & Al­Salihi, N. K. (2018)

An evaluation of Reber stemmer with longest match stemmer

technique in Kurdish Sorani text classification

Iran Journal of Computer Science, 1(2), 99­107.

Hawezi, R. S., Azeez, M. Y., & Qadir, A. A. (2019)

Spell checking algorithm for agglutinative languages Central

Kurdish as an example

International Engineering Conference (IEC)(pp. 142­146). IEEE.

Sina Ahmadi (2019)

A Rule­based Kurdish Text Transliteration System

Asian and Low­Resource Language Information Processing

(TALLIP) 18(2):18:1–18:8.

Sina Ahmadi (2020)

A Tokenization System for the Kurdish Language

Proceedings of the Seventh Workshop on NLP for Similar

Languages, Varieties and Dialects (VarDial 2020).

Sina Ahmadi (2020)

A Lemmatization System for Sorani Kurdish

Under review.

Sina Ahmadi (2020)

Building a Corpus for the Zaza–Gorani Language Family

Proceedings of the Seventh Workshop on NLP for Similar

Languages, Varieties and Dialects (VarDial 2020).

Sina Ahmadi (2020)

Hunspell for Sorani Kurdish Spell­checking and Morphological

Analysis.

Under review.

Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 20 / 21

Page 21: KLPT – Kurdish Language Processing Toolkit

Walther, G., & Sagot, B. (2010)

Developing a large­scale lexicon for a less­resourced language:

General methodology and preliminary experiments on Sorani

Kurdish.

7th SaLTMiL Workshop on Creation and use of basic lexical

resources for less­resourced languages (LREC 2010 Workshop).

Géraldine Walther, Benoît Sagot, and Karën Fort. (2010)

Fast development of basic NLP tools: Towards a lexicon and a POS

tagger for Kurmanji Kurdish

International conference on lexis and grammar.

Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 21 / 21