Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
KLPT – Kurdish Language Processing Toolkit
Sina Ahmadi
Insight Centre for Data AnalyticsNational University of Ireland Galway
https://sinaahmadi.github.io
The 2nd Workshop for Natural Language Processing Open Source Software (NLPOSS)EMNLP 2020
November 2020
Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 1 / 21
Table of Contents
1 Introduction
2 Kurdish LanguageGeneral descriptionCurrent state of Kurdish language processing (KLP)What is wrong?
3 Kurdish Language Processing Toolkit (KLPT)KLPT StructurePreprocessTransliterateStemTokenize
4 Conclusion
Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 2 / 21
Introduction7,117 languages are spoken in the world1a big proportion of these languages are endangered, minority or lessresourcedrecent focus on applying languageindependent approaches to various tasks innatural language processing (NLP) and computational linguistics using artificialintelligencelanguagespecific tools are still essential to process a language in a viable way
1Source: https://www.ethnologue.com/guides/how-many-languages
* Image source: https://ruder.io/unsupervised-cross-lingual-learning
Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 3 / 21
Table of Contents
1 Introduction
2 Kurdish LanguageGeneral descriptionCurrent state of Kurdish language processing (KLP)What is wrong?
3 Kurdish Language Processing Toolkit (KLPT)KLPT StructurePreprocessTransliterateStemTokenize
4 Conclusion
Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 4 / 21
Kurdish Language
an IndoEuropean languagespoken by 2030 million speakersspoken in many dialects and subdialects (dialects or languages?)written in many scripts, among which the Latinbased and Arabicbased ones arestill widely in use
*
INDOEUROPEAN
WESTERN IRANIAN
NORTH
Median
Parthian
Balochi Kurdish (kur)
Laki (lki) Southern (sdh) Central (ckb)(Sorani)
Northern (kmr)(Kurmanji)
ZazaGorani
Gorani (hac)Zazaki (zza)Shabaki (sdb)
Source: https://www.britannica.com/topic/Kurd
Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 5 / 21
using more than one script for a language,not only scatters readers but also createsfurther challenges in text processingwritten in various orthographies followingdifferent conventions
▶ di sala 2020’an | 2020an | 2020an de“in the year 2020”
▶ hêviya, hêvîya or hêvî ya “hope of”?▶ ٠١٢٣٤٥٦٧٨٩, ۰۱۲۳۴۵۶۷۸۹ or
0123456789?although Kurdish orthographies arephonemic, there is not always aonetoone relation between graphemes,particularly due to:
▶ doubleusage characters: ی for î/y and وfor u/w
▶ variations in some orthographies such asl, ll or ł for [ì]
▶ vowel i has no equivalent in theArabicbased orthography
������� وەر�� ����ات، ��ود�� زا������ ��وو��� ����و ��ر���و ��دار، ��وەڕ، د�� و ���� ���� ��������
����ووژ ���و�� ��رە�� ������ ������ �� ��ول ��ژ��رو�� دەوران ����
Laki (Arabic script)
�� ڕا����ا ��م ��رە����ا�� ��ر �� �������ی ����������رد���ن و �������� ڕا��دوون
Sorani (Arabic script)
Ji ber barîna berfê li bajarê Wan û navçeyaTetwan a Bedlîsê dîmenên ciwan derketinholê.
Kurmanji (Latin script)
Southern Kurdish (Arabic script)
وه زاره �� �� و���� و��رو��ر�� ������ ل �� ر��� ��رد�����ل دۆر ������ دا�� �� �� ر�� ب �� ��� ��� �� ������ رۆ������ ك
ده ر��
Kurmanji (Arabic script)
Bergirî lem bwareda her le yekemîn rojekanîdamezrandinî komarî Turkyawe hate gořê.
Sorani (Latin script)
[ckb-ar][kmr-latn]
[lki-ar]
[ckb-latn]
[sdh-ar]
[kmr-ar]
Kurdish
Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 6 / 21
Current state of Kurdish language processing (KLP)
the earliest works in the field of KLP date back to 2009thus far, a total number of 53 publications are published in a field directly relatedto KLP (as of August 2020)two opensource volunteerbased projects:
▶ Kurdish Language Processing Project (KLPP2) in 2012▶ Kurdish Basic Language Resource Kit (KurdishBLARK3) in 2014
a few number of nonscientific contributions
OpensourceDoes the paper provide the discussed resource or tool under an opensource license?
ApplicabilityDoes the paper, implicitly or explicitly, propose an approach or methodology that canbe applied to solve the same problem in the other dialects of Kurdish?
2http://klpp.github.io/3https://kurdishblark.github.io/
Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 7 / 21
Current state of KLP
2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
0
2
4
6
8
10
12
Num
bero
fpub
licat
ions
total applicable open-source
Figure 2: Number of scientific publications directly related to Kurdish language processing per year
putational linguistics, we reviewed the scientificpublications that directly address an issue in thosefields. A total number of 53 publications are col-lected from the widely-used academic databasesand search engines such as Google Scholar5, andthen classified based on their discussed sub-fieldswhich are illustrated in Figure 1. The Kurdishdialects are not evenly discussed in the previousstudies, with Sorani making up a predominant pro-portion of almost 90%. Although a smaller pro-portion represents the Kurmanji dialect, no publi-cation is found with respect to processing of theSouthern Kurdish or Laki dialects. Regarding theresearch focus of the previous works, a range ofNLP sub-fields has been addressed, particularly intext mining, morphological and syntactic analysisand, creation of lexical resources. We exception-ally included optical character recognition as it isof importance for converting printed material toelectronic forms (Ahmadi et al., 2019). The fulllist of the surveyed papers can be found in Ap-pendix A.2.
More importantly, we analyze previous publica-tions from the following two perspectives:
• Open-source: Does the paper provide the dis-cussed resource or tool under an open-sourcelicense? To this end, we verified the con-tent of the papers and also, checked the Web,particularly major distributed version controlsystems such as GitHub6, GitLab7 and Bit-Bucket8.
5https://scholar.google.com
6https://github.com
7https://gitlab.com
8https://bitbucket.org
• Applicability: Does the paper, implicitly orexplicitly, propose an approach or method-ology that can be applied to solve the sameproblem in the other dialects of Kurdish? Forthe choice of the word, we were inspiredby (Årdal et al., 2011) where the possibil-ity of applying common practices of softwaredevelopment for drug discovery are investi-gated. For instance, (Ahmadi et al., 2019)is deemed an applicable contribution wherelexicographical resources can be created forother dialects. On the other hand, (Ahmadi,2019) is not applicable to other dialects dueto its ad-hoc solution for transliterating So-rani texts according to its phonological andphonetic rules.
Figure 2 provides the number of previous pub-lications in the Kurdish language processing fieldper year, and specifies their open-source status andtheir applicability. Although most of these pub-lications are applicable to other dialects, only 18out of 53 of them provide their resources or toolsunder an open-source license. Among the open-source ones, 11 are outcomes of volunteeringprojects, KLPP and Kurdish-BLARK. Given thesmall number of non-scientific contributions, wedid not include them in this survey. A few notableexamples of such contributions are Kurdînûs9, Ve-jin Dictionaries10 and VejinBooks11 which mostlyfocus on Sorani Kurdish and script conversiontasks.
9https://github.com/aso-mehmudi/
kurdinus
10https://lex.vejinbooks.com
11https://books.vejin.net
9.25 %
Dialectology
22.22 %
Information retrieval and Text mining
18.51 %
Lexical resources
3.70 %Machine Translation
12.96 %
Speech recognition18.8 %
Morphological and syntactic analysis
1.85 %
Other
11.11 %
Optical character recognition
3.70 % Sign language recognition
(a) per sub-field in NLP
Sorani
Kurmanji
Sorani and Kurmanji
Sorani, Kurmanji and Gorani
78.18%
7.27%
10.90%
3.63%
(b) per dialect
Figure 1: Proportion of publications related to Kurdish language processing
verbs, Kurdish has a few number of around 300single-word verbs (Walther and Sagot, 2010), e.g.kirdin/kirin “to do”, which are inflected based onperson (1,2,3, SG, PL), tense (past, present, future),aspect (indefinite, perfect, progressive, imperfec-tive) and mood (indicative, subjunctive, condi-tional). Unlike Kurmanji, Sorani Kurdish doesnot have future tense and uses adverbs for thispurpose. However, Kurdish extensively takes useof compound constructions for creating new verbforms, particularly with (Noun + Verb), (Adjective+ Verb) and (Preposition + Verb) forms (Traida,2007). For instance, siław ‘hi (n)’, pîroz ‘holy’(adj) and heł (verbal particle denoting ‘up’) withthe single-word verb kirdin can respectively formcompound verbs siław kirdin “to greet”, pîrozkirdin “to congratulate” and heł kirdin “to turnon”. The stringing characteristic of the Arabic-based script of Kurdish further adds to this mor-phological complexity in such a way that severalword forms may be concatenated together (Ah-madi, 2020b).
Regarding syntax, Kurdish has a sub-ject–object–verb word order and is a null-subject(or pro-drop) language. The presence of gram-matical markers for nominative and obliquecases varies within dialects and subdialects. Forinstance, in the Sorani subdialects of Sulay-maniyah and Erbil, respectively categorized asSouthern Sorani and Northern Sorani by (Matras,2017), the oblique case is marked differently.Another particularity of the Kurdish languageis its morphosyntactic alignment in the pasttense of transitive verbs. In such tenses, anergative–absolutive alignment occurs where thesubject of intransitive verbs behaves like thepatient of the transitive verb in the past (Haig,
1998; Karimi, 2014). Unlike Kurmanji whichuses oblique cases for this purpose, Sorani onlyuses different pronominal markers to specify erga-tivity, therefore it is called split-ergative (Esmailiand Salavati, 2013). Except the past tenses, anominative-accusative alignment is observed inother tenses.
Not being equally documented and used, Kur-dish dialects have different levels of linguisticresourcefulness. In comparison to Sorani andKurmanji which are widely used by the mediaand press, Southern Kurdish and Laki are under-documented and lack basic language resourcessuch as electronic dictionaries and corpora (Fat-tah, 2000; Ahmadi et al., 2019; Ahmadi, 2020c).
3 Current State of Kurdish LanguageProcessing
The earliest works in the field of Kurdish languageprocessing date back to 2009. Our literature re-view indicates that some of these contributionsfail to provide open-source solutions. Despite fi-nancial and scientific constraints in Kurdish lan-guage processing, the Kurdish Language Process-ing Project (KLPP) (Esmaili et al., 2013) in 20123
and Kurdish Basic Language Resource Kit (Kur-dish BLARK) (Hassani, 2018) in 20144 have suc-ceeded to promote an open-source vision basedon research volunteering within the Kurdish sci-entific communities. However, the outcomes ofthese projects are mostly released in an unorga-nized manner for individual tasks.
In order to understand the current state of theKurdish language in the realm of NLP and com-
3http://klpp.github.io
4https://kurdishblark.github.io
Figure: Number of scientific publications directly related to KLP per year and field
most of these publications are applicableonly 18 provide their resources or tools under an opensource licenseamong the opensource ones, 11 are outcomes KLPP and KurdishBLARKSorani makes up a predominant proportion of almost 90% of publicationsno publication addresses the processing of Southern Kurdish or LakiKurdish still lacks basic language processing tools such as partofspeech tagger,stemmer, lemmatizer and so on
Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 8 / 21
Current state of KLP: What is wrong?
Many projects overlap significantly, yetnone of them provide a solution under anyopensource license
▶ Stemming is addressed at least five times[Jaff, 2014, Salavati and Ahmadi, 2018,Mustafa and Rashid, 2018,Saeed et al., 2018, Hawezi et al., 2019]
*Some are hardly integrable or interoperable
▶ A largescale morphological lexicon and a partofspeech tagger for Kurdish withinthe Alexina framework [Walther and Sagot, 2010, Walther et al., 2010]
Released in an unorganized manner for individual tasks▶ Example: a transliteration tool for Kurdish [Ahmadi, 2019a]
Further progress is hindered in the fieldKurdish is still a lessresourced language
* Image source: https://www.aic.cuhk.edu.hk/web8/Reinventingthewheel.htm
Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 9 / 21
Table of Contents
1 Introduction
2 Kurdish LanguageGeneral descriptionCurrent state of Kurdish language processing (KLP)What is wrong?
3 Kurdish Language Processing Toolkit (KLPT)KLPT StructurePreprocessTransliterateStemTokenize
4 Conclusion
Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 10 / 21
Kurdish Language Processing Toolkit (KLPT)
a basic but extendable language processing toolkitan effort to standardize Kurdish language with all its dialects and scriptsimplemented in Pythoninspired by the functionality of relevant NLP toolkits, e.g. NLTK and spaCyno external NLP library is used in this toolkitcomposed of core modules for Sorani andKurmanji for the following tasks:
▶ text preprocessing▶ stemming▶ lemmatization▶ spelling error detection and correction▶ transliteration▶ morphological analyzer and generator▶ tokenization
it is opensource!→ https://github.com/sinaahmadi/klpt
Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 11 / 21
KLPT: Structure
no hardcodingeasily extendable for otherdialects and taskseach package corresponds to a setof related taskscomposed of four core NLPpackages as follows:
▶ preprocess▶ transliterate
([Ahmadi, 2019a])▶ stem ([Ahmadi, 2020e])▶ tokenize ([Ahmadi, 2020b,
Ahmadi, 2020c])
4 KLPT Architecture
KLPT is implemented in Python and is composedof four core modules with specific tasks. Althoughwe were inspired by the functionality of relevantNLP toolkits, particularly NLTK and spaCy, noexternal library is used in this toolkit. Regardingthe toolkit design, we followed the rules of sci-entific software development suggested by (Prlicand Procter, 2012) along with common practicesin Python programming language. Figure 3 pro-vides the structure of the toolkit. In order to fa-cilitate the integration of variations specific to di-alects and scripts and more importantly, to avoidhard-coding, required files are provided in thedata folder. For instance, the data requiredfor the preprocess module is imported frompreprocess.json. In addition, third-partyprograms can be provided in bin. test anddocs respectively contain test cases and projectdocumentation. Regarding the latter, we useSphinx documentation generator12.
It is worth noting that each module within theklpt package has been previously studied andevaluated separately. Our goal is to introduce thefunctionality of the modules within the toolkit inthis section.
4.1 PreprocessMany keyboard layouts are specifically designedfor Kurdish where different character encoding areassigned to visually-similar graphemes. In addi-tion to the usage of non-Kurdish keyboards, suchas Arabic, Turkish and Persian keyboards, suchdiversity creates abnormality across texts in Kur-dish writing. For instance, the grapheme � (î/y),can be represented as � (U+064A), � (U+0649),�� (U+FEF2), � (U+FEF1) and � (U+06CC),among which only the latter should be used in theArabic-based script of Kurdish. Moreover, variouswriting conventions are used for each dialect andscript. For instance, in Kurmanji, when dates areaffixed with a morpheme, the suffix may be sepa-rated by ’, - or without any marker as in 2020’an,2020-an and 2020an.
To remedy such issues in an automatic andstructured manner, the preprocess moduleprovides two main functions: normalize()
for normalizing encoding abnormalities by unify-ing characters in such a way that only one spe-cific encoding is used for each grapheme and,
12https://www.sphinx-doc.org
standardize() which applies orthographicconventions to the text. For example, when hêvî‘hope’ is suffixed with the vowel a (Izafa, mean-ing ‘of’), a semi-vowel y appears between the twovowels and is usually written as hêviya or hêvîya‘hope of’. As the latter form is considered less am-biguous, this function converts the first form ac-cordingly. Although defining a universal orthogra-phy for Kurdish is out of scope of our project, webelieve that writing conventions and orthographiesshould be addressed to some extent. Therefore, inthis initial version, we follow the writing conven-tions proposed by (Aydogan, 2012) for Kurmanjiand (Hashemi, 2016) for Sorani.
In addition to these two functions,unify_numeral() is provided to convertnumerals, namely in Farsi (����������),Eastern Arabic (����������) and Western Ara-bic (0123456789). Although we set the latteras default for all scripts, users will have the
klptbin
data
ckb-morphemes.json
ckb_Hunspell.aff
ckb_Hunspell.dic
default-options.json
kmr-morphemes.json
kmr_Hunspell.aff
kmr_Hunspell.dic
lexicon_ckb_arab.json
lexicon_ckb_latn.json
lexicon_kmr_latn.json
preprocess.json
stopwords_kmr.txt
stopwords_ckb.txt
tokenize.json
transliterate.json
docs
klpt
__init__.py
configuration.py
preprocess.py
stem.py
tokenize.py
transliterate.py
test
setup.py
requirements.txt
Figure 3: Structure of KLPTSina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 12 / 21
KLPT Packages: Preprocess
Goal: Handle diversities in scripts and orthographies in an automatic and formalizedway
1 normalize(): normalize text by unifying character encodings▶ Example: the grapheme ی (U+06CC, î/y), may be represented as
ي (U+064A), ى (U+0649), ي (U+FEF2) or ي (U+FEF1)2 standardize(): standardize scripts and orthographies by using writing
conventions based on dialects and scripts3 unify_numeral(): convert Farsi, Eastern and Western Arabic numerals
Example>>> from klpt.preprocess import Preprocess >>> preprocessor = Preprocess("Sorani", "Arabic", numeral="Latin") >>> preprocessor.normalize("لە ســـا,ەکانی ١٩٥٠دا") لە سا,ەکانی 1950دا>>> preprocessor.standardize("راستە لەو وو1تەدا") ڕاستە لەو و1تەدا
>>> from klpt.transliterator import Transliterate >>> transliterator = Transliterate("Kurmanji", "Latin", target_script="Arabic") >>> transliterator.transliterate("rojhilata navîn") 'رۆژھلاتا ناڤین'
>>> from klpt.stem import Stem >>> stemmer = Stem("Sorani", "Arabic") >>> stemmer.check_spelling("سوتاندبووت") False >>> stemmer.correct_spelling("سوتاندبووت") ('سووتاندبووت', 'سووتاندت', 'سووتاندن', 'سووتاند')>>> stemmer.stem("سووتاندبووت") (,'سووت')>>> stemmer.analyze("دیتبامن") {'pos': 'verb', 'is': 'past_intransitive', 'stem': 'دی', ‘verb_stem': 'دیت', 'terminal_suffix': 'بامن'}
>>> from klpt.tokenize import Tokenize # Tokenize module >>> tokenizer = Tokenize("Kurmanji", "Latin") >>> tokenizer.word_tokenize("endamên encûmena wezîrên") ['▁endam▁ên', '▁encûmen▁a', '▁wezîr▁ên']
Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 13 / 21
KLPT Packages: Transliterate
transliterating the Arabicbased and Latinbased scripts of Kurdish to oneanother, e.g. برا → bira ‘brother’based on the rulebased approach of [Ahmadi, 2019a] which
▶ detects double usage characters▶ predicts the presence of the missing i, a.k.a Bizroke▶ finds the syllabic pattern of a given word based on Kurdish phonetics
beneficial to many NLP tasks such as namedentity recognition
Example
>>> from klpt.preprocess import Preprocess >>> preprocessor = Preprocess("Sorani", "Arabic", numeral="Latin") >>> preprocessor.normalize("لە ســـا,ەکانی ١٩٥٠دا") لە سا,ەکانی 1950دا>>> preprocessor.standardize("راستە لەو وو1تەدا") ڕاستە لەو و1تەدا
>>> from klpt.transliterator import Transliterate >>> transliterator = Transliterate("Kurmanji", "Latin", target_script="Arabic") >>> transliterator.transliterate("rojhilata navîn") 'رۆژھلاتا ناڤین'
>>> from klpt.stem import Stem >>> stemmer = Stem("Sorani", "Arabic") >>> stemmer.check_spelling("سوتاندبووت") False >>> stemmer.correct_spelling("سوتاندبووت") ('سووتاندبووت', 'سووتاندت', 'سووتاندن', 'سووتاند')>>> stemmer.stem("سووتاندبووت") (,'سووت')>>> stemmer.analyze("دیتبامن") {'pos': 'verb', 'is': 'past_intransitive', 'stem': 'دی', ‘verb_stem': 'دیت', 'terminal_suffix': 'بامن'}
>>> from klpt.tokenize import Tokenize # Tokenize module >>> tokenizer = Tokenize("Kurmanji", "Latin") >>> tokenizer.word_tokenize("endamên encûmena wezîrên") ['▁endam▁ên', '▁encûmen▁a', '▁wezîr▁ên']
Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 14 / 21
KLPT Packages: Stem
an annotated lexicon + morphological rules using Hunspell4 for:▶ spelling error detection and correction→ also usable in text editors such as
LibreOffice▶ morphological analyzer and generator▶ stemmer
a rulebased lemmatization systembased on [Ahmadi, 2020c, Ahmadi, 2020e]
Example
>>> from klpt.preprocess import Preprocess >>> preprocessor = Preprocess("Sorani", "Arabic", numeral="Latin") >>> preprocessor.normalize("لە ســـا,ەکانی ١٩٥٠دا") لە سا,ەکانی 1950دا>>> preprocessor.standardize("راستە لەو وو1تەدا") ڕاستە لەو و1تەدا
>>> from klpt.transliterator import Transliterate >>> transliterator = Transliterate("Kurmanji", "Latin", target_script="Arabic") >>> transliterator.transliterate("rojhilata navîn") 'رۆژھلاتا ناڤین'
>>> from klpt.stem import Stem >>> stemmer = Stem("Sorani", "Arabic") >>> stemmer.check_spelling("سوتاندبووت") False >>> stemmer.correct_spelling("سوتاندبووت") ('سووتاندبووت', 'سووتاندت', 'سووتاندن', 'سووتاند')>>> stemmer.stem("سووتاندبووت") (,'سووت')>>> stemmer.analyze("دیتبامن") {'pos': 'verb', 'is': 'past_intransitive', 'stem': 'دی', ‘verb_stem': 'دیت', 'terminal_suffix': 'بامن'}
>>> from klpt.tokenize import Tokenize # Tokenize module >>> tokenizer = Tokenize("Kurmanji", "Latin") >>> tokenizer.word_tokenize("endamên encûmena wezîrên") ['▁endam▁ên', '▁encûmen▁a', '▁wezîr▁ên']
4http://hunspell.github.io
Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 15 / 21
KLPT Packages: Tokenize
detect word and sentence boundaries→ a non trivial task:▶ orthographic inconsistencies, e.g. how compounds words are separated?▶ excessive concatenation, e.g. لەوێشدایە (lewêşdaye) “(it) is also there” is written as a
word but is composed of five tokens le, wê, ş, da, ye
split a text into sentences or tokensidentify compound forms such as karûbar (wordandload) “affaires”based on the [Ahmadi, 2020b]’s approach using a morphological analyzer and alexicon
Example
>>> from klpt.preprocess import Preprocess >>> preprocessor = Preprocess("Sorani", "Arabic", numeral="Latin") >>> preprocessor.normalize("لە ســـا,ەکانی ١٩٥٠دا") لە سا,ەکانی 1950دا>>> preprocessor.standardize("راستە لەو وو1تەدا") ڕاستە لەو و1تەدا
>>> from klpt.transliterator import Transliterate >>> transliterator = Transliterate("Kurmanji", "Latin", target_script="Arabic") >>> transliterator.transliterate("rojhilata navîn") 'رۆژھلاتا ناڤین'
>>> from klpt.stem import Stem >>> stemmer = Stem("Sorani", "Arabic") >>> stemmer.check_spelling("سوتاندبووت") False >>> stemmer.correct_spelling("سوتاندبووت") ('سووتاندبووت', 'سووتاندت', 'سووتاندن', 'سووتاند')>>> stemmer.stem("سووتاندبووت") (,'سووت')>>> stemmer.analyze("دیتبامن") {'pos': 'verb', 'is': 'past_intransitive', 'stem': 'دی', ‘verb_stem': 'دیت', 'terminal_suffix': 'بامن'}
>>> from klpt.tokenize import Tokenize # Tokenize module >>> tokenizer = Tokenize("Kurmanji", "Latin") >>> tokenizer.word_tokenize("endamên encûmena wezîrên") ['▁endam▁ên', '▁encûmen▁a', '▁wezîr▁ên']
Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 16 / 21
Table of Contents
1 Introduction
2 Kurdish LanguageGeneral descriptionCurrent state of Kurdish language processing (KLP)What is wrong?
3 Kurdish Language Processing Toolkit (KLPT)KLPT StructurePreprocessTransliterateStemTokenize
4 Conclusion
Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 17 / 21
Conclusion
Lessons learned:▶ release your project under an open source license→ essential to ensure gradual
but efficient progress in resource and technology development for a lessresourcedlanguage
▶ communitydriven initiatives: bring together users, developers, researchers,language activists and policy makers
⋆ Vejîn Books (https://books.vejin.net/en)⋆ Vejîn Dictionaries (https://lex.vejin.net/en)
▶ raise awareness by promoting good practices in content creation on the Web,particularly collaborativelycurated resources such as Wiktionary5 and Wikipedia6
▶ every single user is a contributor tooFuture directions:
▶ promote the usage of KLPT in the Kurdish communities▶ create a community of developers and linguists for KLP▶ extend the current version of KLPT to include further advanced tasks
5https://en.wiktionary.org6https://www.wikipedia.org/
Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 18 / 21
Join KLPT
*https://github.com/sinaahmadi/klpt
* Image credit: Milad Ghaderpanah
Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 19 / 21
References
Sardar Jaf, Allan Ramsay (2014)
Stemmer and a POS tagger for Sorani Kurdish.
6th International Conference on Corpus Linguistics Spain.
Shahin Salavati and Sina Ahmadi (2018)
Building a Lemmatizer and a Spellchecker for Sorani Kurdish.
arXiv preprint arXiv:1809.10763.
Mustafa, Arazo M., and Tarik A. Rashid. (2018)
Kurdish stemmer preprocessing steps for improving information
retrieval
Journal of Information Science, 44.1: 1527.
Saeed, A. M., Rashid, T. A., Mustafa, A. M., Agha, R. A. A. R.,
Shamsaldin, A. S., & AlSalihi, N. K. (2018)
An evaluation of Reber stemmer with longest match stemmer
technique in Kurdish Sorani text classification
Iran Journal of Computer Science, 1(2), 99107.
Hawezi, R. S., Azeez, M. Y., & Qadir, A. A. (2019)
Spell checking algorithm for agglutinative languages Central
Kurdish as an example
International Engineering Conference (IEC)(pp. 142146). IEEE.
Sina Ahmadi (2019)
A Rulebased Kurdish Text Transliteration System
Asian and LowResource Language Information Processing
(TALLIP) 18(2):18:1–18:8.
Sina Ahmadi (2020)
A Tokenization System for the Kurdish Language
Proceedings of the Seventh Workshop on NLP for Similar
Languages, Varieties and Dialects (VarDial 2020).
Sina Ahmadi (2020)
A Lemmatization System for Sorani Kurdish
Under review.
Sina Ahmadi (2020)
Building a Corpus for the Zaza–Gorani Language Family
Proceedings of the Seventh Workshop on NLP for Similar
Languages, Varieties and Dialects (VarDial 2020).
Sina Ahmadi (2020)
Hunspell for Sorani Kurdish Spellchecking and Morphological
Analysis.
Under review.
Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 20 / 21
Walther, G., & Sagot, B. (2010)
Developing a largescale lexicon for a lessresourced language:
General methodology and preliminary experiments on Sorani
Kurdish.
7th SaLTMiL Workshop on Creation and use of basic lexical
resources for lessresourced languages (LREC 2010 Workshop).
Géraldine Walther, Benoît Sagot, and Karën Fort. (2010)
Fast development of basic NLP tools: Towards a lexicon and a POS
tagger for Kurmanji Kurdish
International conference on lexis and grammar.
Sina Ahmadi (NUIG) Kurdish Language Processing Toolkit November 2020 21 / 21