Master Thesis: Developing a Cross-Lingual Named Entity … · 2020-06-22 · To build a Cross-lingual Named Entity Recognition system (NER), we ﬁrst need to transfer successfully

Master Thesis:Developing a Cross-Lingual

Named Entity Recognition Model

Jowita Podolak1, Philine Zeinert2

[email protected]@itu.dk

June 1, 2020

Course Code: KISPECI1SE

Abstract

To build a Cross-lingual Named Entity Recognition system (NER), we first need totransfer successfully between di↵erent languages. Hence, we need a language-agnosticmodel - one not learning language-specific features. From previous work, we knowthat multilingual Transformer models are not truly language-agnostic. This paperexplores to what extend language-specific features are incorporated in Transformerweights. We group languages by the absence and presence of certain features as well asarrange language-pairs by varying levels of similarity between fine-tuned and evaluationlanguage. In our experiments, we leverage vast WikiANN datasets to fine-tune mBERTand XLM-RoBERTa on NER task. We find four features which, if present in training,improve results for evaluation languages containing them, indicating that those featuresare indeed learnt. Thus, we single out sources of language-dependence in Transformers’architecture. Our thesis presents existing cross-lingual dependencies whose removal canimprove multilingual language models, particularly for NER.

1

Acknowledgement

We want to express our special gratitude to our supervisor, Leon Derczynski. Thisthesis would not be possible without insightful conversations and guidance throughoutthe whole project. Moreover, thank you for always being available, engaged, and forsparking our interest in NLP.

2

Contents

1 Introduction 91.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.2 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.3 Research question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Named Entity Recognition 112.1 Traditional techniques for building NER systems . . . . . . . . . . . . . . . . 11

2.1.1 Expert systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 NER for resource-poor languages . . . . . . . . . . . . . . . . . . . . . . . . 132.2.1 Annotation Projection . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.2 Direct Model Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Cross-lingual Natural Language Processing 163.1 Word representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1.1 Static word embeddings . . . . . . . . . . . . . . . . . . . . . . . . . 173.1.2 Dynamic word embeddings . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Cross-lingual learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2.1 Aligned static word embeddings and cross-lingual language models . . 193.2.2 Cross-lingual versus multilingual models . . . . . . . . . . . . . . . . 203.2.3 Language-agnostic system . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Transfer learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3.1 Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.4 Cross-lingual NER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4 Transformer Models 264.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.1.1 Delimitation of Transformers from RNN and CNN models . . . . . . 264.1.2 The learning process in Transformer models . . . . . . . . . . . . . . 274.1.3 Transformers for multilingual NLP tasks . . . . . . . . . . . . . . . . 29

4.2 Multilingual Transformer models . . . . . . . . . . . . . . . . . . . . . . . . 294.2.1 mBERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2.2 XLM-RoBERTa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2.3 Multilingual Transformers for NER . . . . . . . . . . . . . . . . . . . 32

5 Previous research and positioning of our work 345.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.2 Generalisation between languages in cross-lingual transfer . . . . . . . . . . . 35

3

5.3 Cross-lingual NER in Transformer models . . . . . . . . . . . . . . . . . . . 37

6 Experimental setup 406.1 Typological properties of Indo-European languages . . . . . . . . . . . . . . 40

6.1.1 Twelve features characterising the European Sprachbund . . . . . . . 406.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.2.1 WikiANN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436.2.2 Dataset preparation and cross-validation . . . . . . . . . . . . . . . . 44

6.3 Model and tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456.3.1 Choice of parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 456.3.2 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456.3.3 NERModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466.3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.4 Exploratory work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.5 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.6 Design of experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.6.1 Typological similarity between languages experiment . . . . . . . . . 506.6.2 Impact of individual typological features experiment . . . . . . . . . . 51

7 Results and analysis 557.1 Language-pairs experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7.1.1 Monolingual results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557.1.2 Cross-lingual results . . . . . . . . . . . . . . . . . . . . . . . . . . . 567.1.3 Correlation between shared ES features and the transfer quality . . . 567.1.4 Case Study: Tatar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587.1.5 Case Study: Di↵erent scripts . . . . . . . . . . . . . . . . . . . . . . . 59

7.2 ES Features experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617.2.1 Results for each feature . . . . . . . . . . . . . . . . . . . . . . . . . . 617.2.2 Repeating language pair experiment with 4 identified features . . . . 63

7.3 Di↵erences across multilingual transformer models . . . . . . . . . . . . . . . 637.3.1 Monolingual results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637.3.2 Language-pairs experiment . . . . . . . . . . . . . . . . . . . . . . . . 647.3.3 ES features results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

7.4 Summary of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

8 Conclusion 678.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

9 Addendum 69

4

A Appendices 71A.1 Original dataset sizes (in lines) . . . . . . . . . . . . . . . . . . . . . . . . . 71A.2 Example: Preprocessed NER dataset . . . . . . . . . . . . . . . . . . . . . . 71A.3 Languages with script and dataset size . . . . . . . . . . . . . . . . . . . . . 72A.4 Language vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5

List of Figures

1 Training an NER model for resource-poor languages and resource-rich languages 142 Words as vector representations (distributed) . . . . . . . . . . . . . . . . . . 163 Masked language modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 Word embedding systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 Static, dynamic, multilingual and cross-lingual word representations . . . . . 196 Aligning word embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Training techniques of cross-lingual word representations . . . . . . . . . . . 218 Cross-lingual pre-trained systems . . . . . . . . . . . . . . . . . . . . . . . . 229 Traditional learning vs. transfer learning . . . . . . . . . . . . . . . . . . . . 2310 Transfer learning in NER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2411 Encoding and decoding in Transformer models . . . . . . . . . . . . . . . . . 2712 Encoders and decoders in Transformer models . . . . . . . . . . . . . . . . . 2813 Self-Attention in Transformer models . . . . . . . . . . . . . . . . . . . . . . 2814 Pre-Training and fine-Tuning procedure in BERT . . . . . . . . . . . . . . . 3015 Input format in BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3016 9 characteristic features of SAE . . . . . . . . . . . . . . . . . . . . . . . . . 4217 Experiment design for feature experiments . . . . . . . . . . . . . . . . . . . 5318 Monolingual results for mBERT and WikiANN dataset sizes (black line) . . 5519 F1 scores and amount of shared ES features . . . . . . . . . . . . . . . . . . 5720 Features (F1 scores for Fine-Tuning 1 and Fine-Tuning 2) . . . . . . . . . . . 6121 Features (rel. di↵erence F1 scores Fine-Tuning1 and Fine-Tuning2) . . . . . 6222 Monolingual results mBERT and XLMR . . . . . . . . . . . . . . . . . . . . 6323 Language distance Results mBERT and XLMR . . . . . . . . . . . . . . . . 6424 Features (F1 scores Fine-Tuning1 and Fine-Tuning2) mBERT and XLM-R . 6525 Features (rel. di↵erence F1 scores Fine-Tuning1 and Fine-Tuning2) mBERT

and XLMR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6

List of Tables

1 Results on Named Entity Recognition for di↵erent multilingual models . . . 322 Contribution of linguistic characteristics for choosing the right transfer language 363 POS accuracies when transferring SVO/SOV languages or AN/NA languages.

Row = fine-tuning, column = evaluation . . . . . . . . . . . . . . . . . . . . 384 WikiANN datasets and their length in lines. Cut-o↵ is after Breton. . . . . . 445 Exemplary results from eval results.txt . . . . . . . . . . . . . . . . . . . . . 476 F1 scores for CoNLL dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 487 F1 scores for CoNLL dataset from Pires et al. (2019) . . . . . . . . . . . . . 488 F1 scores for WikiANN dataset - large datasets . . . . . . . . . . . . . . . . 489 F1 scores for WikiANN dataset evaluated on CoNLL . . . . . . . . . . . . . 4910 Amount of tested language pairs per distance . . . . . . . . . . . . . . . . . 5111 Chosen k-value for each feature and average distance for each k . . . . . . . 5212 Language groups for feature 1 - experiment 1 and 2, k = 15 . . . . . . . . . 5413 F1 scores for WikiANN dataset for CoNLL languages . . . . . . . . . . . . . 5614 Correlation between the number of shared ES features and the quality of the

transfer between language pairs. . . . . . . . . . . . . . . . . . . . . . . . . . 5715 Results Tatar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5816 Results di↵erent scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6018 Language pair results with filtered features . . . . . . . . . . . . . . . . . . . 6319 Number of lines of the original datasets . . . . . . . . . . . . . . . . . . . . . 7120 Example preprocessed NER dataset fragment from English (note: sentences

are su↵eled and do not belong to one article here) . . . . . . . . . . . . . . . 7121 Languages and their WikiANN dataset sizes . . . . . . . . . . . . . . . . . . 7322 Languages and ES-features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

7

List of Abbreviations

Avg AverageBiLSTM Bidirectional Neural Network with

Long-Short-Term-Memory cellsBIO Beginning-inside-outside taggingCNN Convolutional Neural NetworkCRF Conditional Random FieldDMT Direct Model TransferES European SprachbundF-T Fine-TuningmBERT multilingual BERTLOC LocationMLM Masked Language ModelNER Named Entity RecognitionNLP Natural Language ProcessingORG OrganisationPER PersonPOS Part of Speech taggingRNN Recurrent Neural NetworkSAE Standard Average EuropeanTLM Translated Language ModelWALS World Atlas of Language StructuresXLM-R XLM-RoBERTaXLU Cross-Lingual Language Understanding

8

1 Introduction

1.1 Motivation

Almost half of the most visited websites on the Internet are in a language other than En-glish [13]. Providing seamless experience in everyone’s preferred language would serve asa significant improvement in the life of communities around the world. The advantage liesnot only in social media platforms by o↵ering features like recommendations, suggestions orpolicy-violating content, but also in much more significant issues, even saving human life.

The variety of languages in the world is its beauty but also its curse - if we do notunderstand each other, we cannot communicate, and, as a result, we cannot solve problemsthat need our collaboration. And those tend to be most urgent - global warming, refugeecrisis, pandemics. However, language is still not seen as a priority in emergency responses. Asa result, misinformation, mistrust, fear and panic can spread quickly. For instance, languagebarrier is reported to be among main di�culties faced by humanitarian workers respondingto the Ebola crisis in 2014.1 Cross-lingual systems are critical because they help to breakthose language barriers.

In research of cross-lingual systems, we focused on Named Entity Recognition (NER)since named entities are parts of the written documents that carry most of the semanticinformation. For example, in the sentence: “Greta Thunberg is Time Person of the Year for2019”, we can identify two named entities: Greta Thunberg - Person, and Time Person of theYear - Organization. In the world of data abundance, it is crucial to derive little units withmost meaning [32]. Unfortunately, NER systems are highly dependent on a large amountof annotated data, which is scarce or even completely unavailable. Developing sophisticatedNER models for each language is comparable to building a brand-new application and solvingthe problem from scratch (including the cost-intense creation of training data)2. Therefore,developing cross-linguality is especially important. [49]

The goal is to apply existing NER models to di↵erent languages, especially with low-resources (no annotated datasets). One way of achieving cross-linguality is through devel-oping systems that are language-independent. Current state-of-art models show significantdi�culties to project learned entity labels from one language to another even if entities seemto have very similar words across di↵erent languages.

1.2 Structure of the thesis

In this thesis, we start by describing NER systems and the challenging issue of developingthem for resource-poor languages. We discuss cross-lingual transfer and identify transformermodels as the state-of-the-art in this field. By looking at previous research, we investigate

1https://odihpn.org/magazine/ebola-a-crisis-of-language/2https://engineering.fb.com/ml-applications/under-the-hood-multilingual-embeddings/

9

Transformers’ ability to project entity labels from one language to another, focusing onpre-trained multilingual models.

Afterwards in the experimental setup, we introduce WikiANN annotated NER datasetand give an insight into our chosen language subset and linguistic properties. We explain ourexperimental design for testing the validation of our hypotheses and use datasets of mainlyIndo-European languages to fine-tune mBERT and XLM-RoBERTa on NER task. The firstexperiment - language-pairs, checks for the language-dependence between languages, andfeatures dives deeper into precisely which features Transformers learn.

We finish the thesis with analysing both experiments results, including aspects like zero-shot transfer, comparison of monolingual results with cross-lingual results, the dependenceon WikiANN size and domain, and impact of features on Transformers. We summarize anddiscuss contribution in chapter eight - Conclusion. Source code for the experiments andother files from this thesis project are publicly available on GitHub.3

1.3 Research question

The thesis addresses the question:

How language-dependent are multilingual Transformer models forcross-lingual Named Entity Recognition?

We investigate experimentally how multilingual Transformer models perform in zero-shotsettings for cross-lingual named entity labelling. Based on the question, four more detailedhypotheses are defined in Experimental setup (chapter 6) our experiments are tested on.

3https://github.com/jowitapodolak/finetuning BERT NER

10

2 Named Entity Recognition

Almost every text document contains particular terms which are more informative than therest of the text. These are the named entities - terms that represent real-world objects likepeople, places, and organisations. [43]

Automatically detecting named entities in text and labelling them into types is a fun-damental information extraction task. Named Entity Recognition is an essential processfor many applications, including relation extraction, entity linking, question answering andtext mining. Building a fast NER model is a crucial step towards large-scale automatedinformation extraction and knowledge discovery in the huge amounts of online documentswe deal with today. [35]

While talking about NER, there are two vital topics to mention: named entity tags andannotation schemes. Most of the datasets have at least three tags: PER, LOC, and ORG.People, e.g., politicians, writers, family names, are labelled with PER type. LOC standsfor politically or geographically defined location (cities, provinces, countries, internationalregions, mountains). ORG includes companies, institutions, political parties, etc. Othersare more fine-grained and may include TIME, DATE WORK OF ART, QUANTITY amongothers A further distinction of entity phrases is described by annotation schemes, includingBIO, BIOES, IOB. They describe whether the word is at the beginning of the whole entityname or the end. Some of them also describe the middle part. “B” usually stands forthe “Beginning”, “I” for the “Inside”, and “O” for the “Outside”. For instance, in thename “Time Person of The Year” from the previous chapter, and with the BIO schemewe would obtain following annotations: Time (B-ORG), Person(I-ORG), of(I-ORG), the(I-ORG), Year(I-ORG) as all the words are parts of one named entity.4 [40]

2.1 Traditional techniques for building NER systems

In the past, NER strongly relied on domain-specific knowledge, with linguistic grammar-based techniques in rule-based systems or feature engineering. Expert knowledge, however,can be discarded entirely by using big datasets and deep learning. Sometimes, the combina-tion of those two is used.

2.1.1 Expert systems

Expert systems include rule-based systems and feature engineering. Rule-based NER systemsrely on hand-crafted rules, emulating thinking of human experts (and written by them).Due to domain-specificity, rarely no-entity words are labelled as named entities (little falsepositives). On the other hand, many name entities are not recognized (a lot of false negatives)

4https://spacy.io/api/annotation

11

due to incomplete rules. Thus, high precision and low recall can be observed from suchsystems. [26]

In feature engineering, given annotated data, features are carefully designed to representeach training instance. Machine learning algorithms, using standard feature-based classifi-cation approaches such as Conditional Random Fields (CRFs) and Hidden Markov Models(HMMs), are then utilized to learn a model. The feature vector representation (also calledword embedding) is an abstraction over text where a word is represented by one or manyBoolean, numeric, or nominal values. “Word-level features (e.g., case, morphology, and part-of-speech tag), list lookup features (e.g., Wikipedia gazetteer and DBpedia gazetteer), anddocument and corpus features (e.g., local syntax and multiple occurrences) have been widelyused in various supervised NER systems.”[26]

Systems based on experts knowledge were in demand even after the emergence of deepneural networks because of the lack of training instances - not only is there little trainingdatasets available but also, even if a dataset is obtained, named entities are scarce withinsentences. A proper noun, which is a part of speech representing a named entity, is one ofthe most infrequent parts of speech in the text. There is not a single proper noun in a list ofmost frequent nouns in English text5. Nevertheless, rule-based NERs and feature engineeringcomputation and design are very time-consuming. Moreover, those systems cannot be easilytransferred to other domains due to their specific exhibited knowledge. [22]

2.1.2 Deep Learning

Deep neural network architectures can release researchers from time-consuming manual fea-ture engineering by learning automatically and discovering hidden (previously unknown)features [48]. Those classification models are trained by showing a neural network largeamounts of sentences annotated with named entities. Through this process, they learn howto categorise new examples, and then use those categories to make predictions for unseensentences. Capturing dependencies between far-away words (long-term dependencies) is es-sential in this task as many words within a sentence can influence each other, not onlythe closest neighbours. Two popular deep architectures can capture those long-term depen-dencies in a word sequence: Convolutional Neural Networks (CNN) and Recurrent NeuralNetworks (RNN). CNN consider the current input while RNN consider the current inputand also previously received inputs. RNN and its improved version - LSTM (Long ShortTerm Memory Networks) allow information to persist6. A further improvement is bidirec-tional LSTM (BiLSTM), in which left and right context representation (forward LSTM andbackward LSTM consecutively) are concatenated. CNN and RNN developed using onlyword embeddings achieved similar results as state-of-the-art CRFs with human-engineered

5https://web.archive.org/web/20111226085859/http://oxforddictionaries.com/words/the-oec-facts-about-the-language

6In section 4.1.1 we discuss limitations and improvements of those techniques

12

features and knowledge from dictionaries. [48]There is no doubt that the introduction of neural architectures has significantly advanced

NER. However, the success of these methods is highly dependent on large annotated datasets.Because most of the world languages are resource-poor, it remains a challenge to apply thesemodels to them.

2.2 NER for resource-poor languages

To build the NER model for low- and medium-resource languages, we either need to 1)annotate large amounts of data or 2) develop a model that produces good results for theresource-poor language while using the dataset in another language.

Annotation Projection could solve 1) in an automated fashion. This concept implies gen-erating training data by projecting labels from a monolingual dataset in the source language(resource-rich) directly to the targeted language (resource-poor).

On the other hand, in the Direct Model Transfer concept (solution 2), a model is trainedin one or more languages, and it is directly applied to the target language [33]. The idea isto develop universal features, that the model can apply to unseen languages [35]. Of course,combining the two methods (Annotation Projection and DMT) is also possible.

We discuss the achievements of both solutions below.7

2.2.1 Annotation Projection

The quality of the annotations highly depends on the quality of its parallel training corporaand the corresponding creation of word alignments. This method assumes that sentences inone language can be translated flawlessly to another language. Another big challenge forAnnotation Projection is the scarcity of translation services and cross-lingual resources (likeparallel datasets). As mentioned, in the field of Annotation Projection several approachesbase on lexicons and translation services (see e.g. work of Mayhew et al. (2017) [33]), orcombining dictionaries and bilingual word embeddings (Xie et al. (2018) [49]), or by usingthe “Translate-Match-Projection” method (see more in Jain et al. (2019) [18]). Each workdepends on common characters between source and target language. Xie et al. address thechallenge of word order within the cross-lingual transfer with an added “self-attention” layer.

2.2.2 Direct Model Transfer

Direct Model Transfer (DMT) learning grasps to the idea of generating language-independentfeatures for cross-lingual NER labelling. One known publication within this field comes fromTsai et al. (2016). Their approach is called Wikification using Multilingual Embeddings

7A more detailed review can be found in our Research Project: Jowita Podolak and Philine Zeinert,Cross-Lingual Named Entity Recognition, 2019.

13

Figure 1: Training an NER model for resource-poor languages and resource-rich languages

(WikiME) and it connects words to categories of Wikipedia entries. Monolingual wordembeddings are trained by taking Wikipedia articles and creating embeddings for words, thetitle and integrated hyperlinks referencing to other Wikipedia entries. With small supervisionand the aid of a title embedding, word embeddings are aligned between two languages. Abenefit of this method is that it enables alignment for languages for which no machinetranslation exists. Finally, given the word the NER model can derive for example possibleEnglish titles and choose the best-matching English title based on contextual factors andtheir frequency in the corpus. The title allows then the entity labelling [44]. This systemhas the potential for using di↵erent source languages (the ones on Wikipedia) or using acombination of them to train their NER model.

Tsai et al. show how shared features between languages can be created by usingWikipediaas a resource. Other approaches are made by Bharadwaj et al. (2016), who use the pho-netic alphabet as a link between source languages and target language [5]. Tackstrom et al.

14

(2012) employs a large number of monolingual word clusters and align them by identifyingconsistent words over languages [45]. Cottrel and Duh (2017) create connections betweenclosely related languages based on their character representation. It allows applying theirneural CRF model to low-resource languages sharing a similar alphabet [9].

Direct Model Transfer approaches still seem challenging to implement in a way that com-petes with the mentioned Annotation Projection solutions (which are, in fact, monolingual).Nonetheless, DMT step away from creating large training datasets and building models foreach language. Instead, they o↵er a more e�cient, cross-lingual approach.

15

3 Cross-lingual Natural Language Processing

3.1 Word representations

Figure 2: Words as vector representations (distributed)Source: https://neptune.ai/blog/document-classification-small-datasets

The choice of word representation, i.e., a transformation of raw text into a format under-standable by machines, is the very first step in any NLP experiment. There is a variety oftechniques for obtaining them, some of them more applicable to our purpose than others.For instance, simple word representations, like a dictionary (i.e., assigning ids to each word)or one-hot encoding (creating a vocabulary size vector filled with zeros except one), are notsuitable for deep learning methods. Dictionaries may result in a model incorrectly assumingthe existence of natural ordering, and one-hot encoding has huge and sparse vectors that re-quire a lot of memory. Moreover, none of those techniques has a notion of similarity betweenwords. A big change has come with the introduction of Word2vec [34] and its distributedword representations. Not only are they compact (with each value within representing somedistinct informative property8) but also capable of capturing similarities. For distributedword representation, if words in the vector space are close to one another, it means they aresimilar to one another (see Figure 2). Basing on Word2vec, research followed with more, im-proved packages and in all of them, vectors are generated with unsupervised training. This isvital for time-e�ciency. In contrast to supervised learning which usually uses human-labelled

8http://www.davidsbatista.net/blog/2018/12/06/Word Embeddings/

16

data, unsupervised training requires a minimum of human attention and no labelling. Thereare two types of distributed word representations: static and dynamic (contextualised).

3.1.1 Static word embeddings

We describe this group of embeddings as ‘static’ because the same word will always havethe same representation regardless of the context in which it occurs. There are three mostpopular ‘classical’ word embeddings: the aforementioned Word2vec, GloVe, and fastText(see Figure 4). Each has certain flaws. Word2vec, for instance, does not fully exploit globalstatistical information, and, just as GloVe, fails to provide any representation for words thatare not provided in the in training. fastText, on the other hand, works well for words thatwere not supplied during training because of subword information. Instead of learning eachword directly, it represents each word as an n-gram of characters. It allows embeddings tocapture the meaning of short words and to understand prefixes and su�xes. [6] However,like all static word embeddings, it does not capture polysemy (a capacity for a single word ora phrase to have multiple meanings). This means that static word embeddings work simplyas lookup dictionaries - one vector belongs to each word, no matter the context. Therefore,if we try to use static word embedding to represent the following sentences:

Would you like a cup of tea?

and

The 2020 Women’s World Cup final was hold in Australia.

‘cup’ will have the same vector in both sentences. It exemplifies how much meaning weloose in text, without considering the context. It is for this reason that traditional wordembeddings fall short. They only have one representation per word; therefore, they cannotcapture how the meaning of each word can change based on the surrounding context.

3.1.2 Dynamic word embeddings

The remedy for the issue of polysemy are dynamic (contextualised) word embeddings, wherethe word vectors vary for di↵erent sentences. Rather than pre-trained word embeddings, weuse a pre-trained model which assesses each time it sees a sentence, the context of the wordscontained in it. Contextualised word embeddings address the issue of polysemous nature ofwords and thus capture an even more meaningful representation of words.9

Language models are therefore the backbone of dynamic word embeddings. The coreobjective of language modelling is language understanding through hierarchical represen-tations. Thus, the language model contains both low-level features (word representations)

9https://towardsdatascience.com/from-pre-trained-word-embeddings-to-pre-trained-language-models-focus-on-bert-343815627598

17

and high-level features (semantic meaning) whereas static word embeddings only containlow-level features.

Traditionally, a language model is a probabilistic model which assigns a probability valueto a sentence or a sequence of words by always aiming to predict the next word given aprevious sequence of words (mainly just predicting words in a blank - look at Figure 3).It can do this because language models are typically trained on massive datasets in anunsupervised manner.10

Figure 3: Masked language modeling

One limitation of dynamic embeddings is that they cannot be used without a sentence-level context. For example, dynamic embeddings cannot be directly used to solve taskssuch as word similarity or analogy.11 Known dynamic word embeddings include for instanceELMo and BERT (see Figure 4).

Figure 4: Word embedding systems

10https://towardsdatascience.com/machine-learning-text-classification-language-modelling-using-fast-ai-b1b334f2872d

11https://www.groundai.com/project/using-dynamic-embeddings-to-improve-static-embeddings/1

18

3.2 Cross-lingual learning

Cross-lingual word embeddings transfer knowledge across di↵erent languages by represent-ing words in a joint embedding space. This common representation space enables transferbetween languages, e.g., between resource-rich and low-resource languages [41]. In recentyears, researchers proposed various models for learning cross-lingual representations. In thefollowing section, we divide them into two kinds: one that originated from static word em-beddings - aligned static word embeddings, and one that essentially is a language modelbut trained with data in more than one language - Multilingual Language Models (Figure 5below).

Figure 5: Static, dynamic, multilingual and cross-lingual word representations

3.2.1 Aligned static word embeddings and cross-lingual language models

Figure 6: Aligning word embeddingsSource: https://github.com/facebookresearch/MUSE

19

Supervised training of aligned word embeddings (see supervised learning in the overviewin Figure 7), we do, e.g., by using parallel corpora (sentences in one language versus thesame sentence in the second language). However, they are costly to generate, especially forresource-poor languages, for which there is no machine translation system in place. Contrar-ily, monolingual data (for unsupervised learning) is nowadays easily available online (bottompart of Figure 7). A breakthrough was the idea of leveraging orthogonal Procrestus anal-ysis, which examines rotation for two or more objects to superimpose (overlap) in space.12

Thanks to this idea, it is enough to generate word embeddings in each language (monolin-gual) and apply a function that learns from a small parallel dictionary - even less than 100words, to align them (see Weakly-supervised learning in Figure 7).[24] The third method,an unsupervised one, is done with adversarial training and refinement. [8]

Both weakly-supervised and unsupervised methods are implemented and pre-trained forexample by Muse Python library13 for multilingual word embeddings (alligned word em-beddings files can be downloaded simply as text files) Using monolingual word embeddingsfrom fastText. [24] Unfortunately, their performance drops for languages that are not closelyrelated. The same applies as well when the number of languages grows.14 Moreover, thosesolutions su↵er from ‘hubness problem’ - some word vectors tend to be nearest neighboursof lots of other words which degrades the accuracy of finding the ‘correct’ neighbour. [11]

Pre-trained language models can learn from sources for unsupervised training, whichare available ‘in the wild’15. The only di↵erence between monolingual and multilingualversions is the number of languages in which we train: one for monolingual, two or morefor multilingual. For example, Multilingual BERT (mBERT, see Figure 8) was trained frommonolingual corpora in many languages. It has information about the input language and itdoes not have any explicit mechanism to encourage similar representations between languages[38]. The answer to the question of how it is capable of learning without parallel data androtation function is still unclear. Researchers hypothesize that having word pieces used inall languages (numbers, URLs, etc) which have to be mapped to shared space, forces theco-occurring pieces also to be mapped to a shared space, thus spreading the e↵ect to otherword pieces until di↵erent languages are close to a shared space. [38]

3.2.2 Cross-lingual versus multilingual models

Multilinguality removes the need for training multiple models for a collection of languages.A model which has access to multilingual information can learn generalizations across lan-guages, in a similar way, learning a new language can enhance the proficiency of a speaker’s

12https://github.com/facebookresearch/MUSE13See footnote 1214https://engineering.fb.com/ml-applications/under-the-hood-multilingual-embeddings/15Online services such as Wikipedia, social networking, gaming, Internet fora provide large volumes of

data

20

Figure 7: Training techniques of cross-lingual word representations

previous languages.Cross-lingual embeddings attempt to map words that mean the same thing in di↵erent

languages to almost the same vector. In contrast, multilingual embeddings do not guaranteeany interaction between di↵erent languages, i.e., vectors are not intentionally overlapped.Aligned word embeddings are explicitly constructed to be cross-lingual (e.g., with a rotationfunction as mentioned before). Pre-trained language models, on the other hand, are multi-lingual and indirectly cross-lingual, as it is likely that they may get somewhat aligned evenif they were not directly designed for it (see Figure 5).

21

Figure 8: Cross-lingual pre-trained systems

3.2.3 Language-agnostic system

A truly language-independent system works equally well on all languages. If there are vari-ations in results for di↵erent languages, the system probably makes implicit assumptionsabout the language structure, i.e., it incorporates features that are specific to certain lan-guages or group of languages. The more language-independent the model, the higher is theoverall cross-lingual performance - which is why we strive for a language-agnostic system inthis work. Moreover, developing such a system could teach us something about the natureof human language, and what human languages share in common. [4]

Systems are often wrongly deemed as ‘language-independent’, based solely on the factthat no linguistic knowledge was encoded in the training dataset (so it is not a part of themodels learnt knowledge). However, this does not ensure language independence. [4] Forinstance, at the first look, word-based n-gram models appear not to encode language-specificknowledge. However, experiments proved that their (n-gram models’) e↵ectiveness is higherfor languages with relatively less elaborate morphology and relatively fixed word order, likeEnglish. [4] If language-independence cannot be claimed based on the architecture it has to,therefore, be proved explicitly - with experiments.

3.3 Transfer learning

In this section we consider how to leverage language understanding abilities from pre-trainedlanguage models in an NER model.

Humans have an excellent ability to transfer knowledge across tasks. Learning new thingsis much easier due to similar (but not the same) tasks learnt in the past. We rarely learn

22

anything from scratch; for instance, knowing how to ice skate makes roller skating mucheasier.

In traditional machine learning, learning occurs purely on specific tasks and datasets,resulting in separate isolated models - we do not retain any knowledge which could betransferred from one model to another. Contrarily, the concept of transfer learning embodieslearning abilities of humans. By reusing a pre-trained model (features, weights) for trainingnew models, we not only overcome the problem of no or little data for classification tasksbut also speed up the process of learning (see the illustration below).

Figure 9: Traditional learning vs. transfer learningSource: https://towardsdatascience.com/a-comprehensive-hands-on-guide-to-transfer-learning-

with-real-world-applications-in-deep-learning-212bf3b2f27a

This concept is especially important in natural language processing: for every NLP task,language understanding is a foundation. Just like a child needs to first understand thelanguage before learning to describe sights or tell stories, a model has to learn the languagefirst, in order to perform further language tasks. However, learning a language requires muchmore data than what most annotated datasets for text classification can o↵er.

Since language has to be learnt by the text classification model, it is wiser to use systemsthat were built especially for that purpose, as a form of pre-training of the model (see Figure10). Provided that our dataset is not drastically di↵erent in context to the original dataset,

23

the pre-trained model will already have learned features that are relevant to our task.Earlier, we distinguished two ways of learning and representing a language - static word

embeddings and dynamic word embeddings (pre-trained language modes). In transfer learn-ing, static word embeddings, pre-trained on large amounts of unlabeled data, are used toinitialise only the first layer of a neural network and the rest of the model needs to be trainedfrom scratch (with data of a particular task, NER dataset, in our case). After such a shallowrepresentation, an NLP model still requires big datasets to achieve good performance.16

With contextualised word embeddings, we can reuse all the layers, inheriting knowledgeabout both low-level features and high-level features (semantic meaning). So the advantageof using dynamic over static word embeddings in transfer learning, lies not only in capturingpolysemy but also in including more knowledge about language. In recent years, contextu-alised word representation, such as ELMo, GPT, BERT, XLNet, and RoBERTa (see Figure4 from the previous section), has been proposed to replace static word embeddings as inputrepresentations, resulting in massive improvements in performance for many of the NLPtasks, text classification (like NER) among them.17

Another advantage of transfer learning in NLP, in addition to pre-trained language mod-els, is utilising text classifiaction models trained on resource-rich languages. Thus, the ad-vantages of transfer learning can be leveraged twice (see Figure 10).

Figure 10: Transfer learning in NERThe top approach is more e�cient and provides better language understanding.

3.3.1 Fine-tuning

Fine-tuning is a process based on the transfer learning concept. We do it by taking a modelthat has already been trained on some data and retraining that model on new data. In NLPtransfer learning, we usually train on a general language modelling task and then fine-tuneon the text classification task. Language model training is a long process, so researchersoften download pre-trained models (we experimented with fastText, GloVe, MUSE, mBERT

16https://ruder.io/nlp-imagenet/17https://towardsdatascience.com/machine-learning-text-classification-language-modelling-using-fast-ai-

b1b334f2872d

24

and RoBERTa pre-trained word embeddings and models). Next, to enable classification, atag decoder is added. ‘It takes context-dependent representations as input and producesa sequence of tags corresponding to the input sequence’ [26]. Examples of tag decodersarchitectures include a simple linear layer, softmax layer, conditional random fields (CRFs),and recurrent neural networks. Finally, we fine-tune by feeding new data into the obtainedmodel.

3.4 Cross-lingual NER

Concluding, cross-lingual NER can be built by fine-tuning pre-trained multilingual languagemodels with appropriate output layers (one or more) or by using aligned static word embed-dings feeded to the first layer of NER model. NER datasets that are used for the trainingcan be found online for resource-rich languages. If they are unavailable, they can be createdwith Annotation Projection or, by leveraging the Direct Model Transfer concept. The choiceof the best source language or group of languages can for example be derived from definedheuristics or expert knowledge (see examples sections 2.2.1 and 2.2.2).

Language model embeddings based on a novel architecture called Transformer, are be-coming a new paradigm for NER. In the next chapter, we will explore Transformers, theirarchitecture and pre-trained multilingual models.

25

4 Transformer Models

4.1 Introduction

Transformers belong to the family of neural networks architectures and are used for everytask where an input sequence is processed to a transformed output sequence (sequencetransduction). For this, it is necessary to have a memory that captures dependencies andconnections between elements (words) in a sequence (sentence). [12]

4.1.1 Delimitation of Transformers from RNN and CNN models

One of the first architectures able to capture dependencies were RNNs18. However, because ofthe vanishing or exploding gradients problem, the model is biased by the most recent inputsand forgets older inputs. RNN’s modified implementation that remembers for longer periodsis with Long-Short Term Memory (LSTM) cells: it can prioritise information in the chainof nodes by being able to evaluate information and remember or forget them within eachcell.19 Nevertheless, LSTMs do not overcome the problem entirely and still forget words ifsentences are long. Moreover, the described architecture does not allow parallel computationof elements, which is an important feature for processing large corpora e�ciently.

A further improvement of RNNs’ architecture is to include an Attention Decoder. Before,a hidden state was created for each sentence. Contrarily, in attention technique, every wordis accompanied by a hidden state (with the information processed so far from the sequencefrom right to left). It assumes that every word can keep important information, and thenetwork can pay di↵erent attention to each word. Nonetheless, processing words in parallelis still not possible with this architecture and processed information are limited just on theright side of the word within the sentence.

The alternative to RNNs are Convolutional Neural Networks (CNN), which allow paral-lelism by processing words independently from each other in the input sequence and thenmerge word-information in some kind of bottom-up tree structure. The execution time ismuch better in comparison to the approach of RNNs (O(log N)), but here just local depen-dencies of neighbouring words are stored along the way.20

Transformers’ edge lies in eliminating recurrence and convolution and replacing themwith the self-attention technique. Thus, they are a combination of parallel computation fromConvolutional Neural Networks (CNN) with the self-attention technique from the RecurrentNeural Networks (RNN).

18See section 2.1.219https://medium.com/@purnasaigudikandula/recurrent-neural-networks-and-lstm-explained-

7f51c7f6bbb920https://towardsdatascience.com/transformers-141e32e69591

26

4.1.2 The learning process in Transformer models

Transformer architecture has an encoding stage and decoding stage. The computation ofelements in a sequence is done in parallel (like in CNNs), e.g., in Figure 11 for ”Je”, ”suis”and ”etudiant”.

Figure 11: Encoding and decoding in Transformer modelsSource: https://towardsdatascience.com/transformers-141e32e69591

Each element is then processed through a sequence of six encoders and decoders havinga Self-Attention layer and a Feed-Forward-Neural-Network layer. The decoders also havean Encoder-Decoder-Attention layer between the two aforementioned layers (see Figure 12).They process the hidden state created for each word in the sequence, which allows in the de-coding phase to also take into consideration the knowledge from other words in the sequencewhile decoding a label for the computed word.21

21https://towardsdatascience.com/transformers-141e32e69591

27

Figure 12: Encoders and decoders in Transformer modelsSource: https://towardsdatascience.com/transformers-141e32e69591

In the self-attention layer itself, each word is processed together with a query, key andvalue vector. All three are much smaller than word vectors (for space-e�ciency) and can bethought of as metadata of the word. The initial values of the vectors are calculated with theaid of pre-processed matrices in the training phase. For each word in the sequence, the queryvector of the word is multiplied with each word’s key vector in the sentence (see Figure 13on the right side). Afterwards, the product is normalised and multiplied with the word’svalue vector. The output of the self-attention layer is the sum of all weighted values (z1 inthe figure presenting the hidden state), which is passed on by the Feed-Forward layer. Theextracted information can be visualised like in the sentence ”I kicked the ball” (see Figure 13on the left side). While ”I”, ”kicked” and ”ball” keep important information for the meaningof the word ”kicked”, ”the” is rather irrelevant.

Figure 13: Self-Attention in Transformer modelsSource: a simplified version of http://jalammar.github.io/illustratedtransformer/

Apart from the described computational steps in Transformers, positional encoding ofwords within sentences is taking place before feeding the sentence to the first encoder. It

28

should support a word-order independent computation. Furthermore, within Transformerarchitectures a multihead attention mechanism is used, which can address multiple querieslike ”Who” and ”Where” in parallel including several query, key and value vectors (seeFigure 13 right side again). Transformers make use of eight heads, which are initialized withrandom values and then trained depending on the task.22

4.1.3 Transformers for multilingual NLP tasks

Recent research had focused significantly on Transformer architectures. According to theGLUE benchmark23 records for English, Transformer architectures dominate over RNN andCNN solutions for language understanding tasks. Also, in the field of cross-lingual languageunderstanding (XLU), Transformers are mainly used and jointly trained on many languages.[7]

There are two dominant, massively multilingual pre-trained models. First, the maskedlanguage model BERT from Google Research has a multilingual version, called mBERTtrained on 104 languages 24. Furthermore, Facebook published at the beginning of 2019the Cross-Lingual Language Model (short XLM) providing di↵erent pre-trained models,mainly on two languages but also one trained on XNLI dataset with 15 languages [23]. Thesame year, they published later an improved version presenting a combination of RoBERTaand XLM (XLM-R). It has two available pre-trained versions (12-layer and 24-layer neuralnetwork) on 2.5TB on CommonCrawl data of 100 languages. [7] Both models are trainedwithout any cross-lingual supervision [7](see Figure 7). The next section describes thosemultilingual models in more detail.

4.2 Multilingual Transformer models

4.2.1 mBERT

Multilingual BERT is based on monolingual Transformer model BERT, a 12 layers neuralnetwork. BERT’s architecture is divided into two main sections and training phases. First,language representations are generated in an unsupervised manner with a dataset consistingof a huge amount of sentences in the pre-training. Secondly, in the fine-tuning step, labelleddata is used to train the model with the parameters from the pre-training step for a specifictask (see section 3.3.1). While the architecture of the pre-training phase is similar, it willdi↵er for the specific downstream tasks (see Figure 14). [10]

22http://jalammar.github.io/illustrated-transformer/23Leading General Linguistic Understanding Evaluation - method benchmarking models over several tasks

and available datasets, further information can be found here: https://gluebenchmark.com24See training languages here: https://github.com/google-research/bert/blob/master/multilingual.md

29

Figure 14: Pre-Training and fine-Tuning procedure in BERTSource: [10]

BERT uses the Masked Language Model for the pre-training phase learning by fillingin missing words within sentences. The input words are therefor annotated randomly with[MASK] labels, which present the words that should be predicted and [CLS] labels at thebeginning of each sentence, (see Figure 15). Usually around 15 percentage of the tokensare marked for prediction 25. Additionally for language understanding task where a rela-tionship between two sentences is relevant a [SEP] label is also used between two sentencesto allow next sentence prediction. The labeled sentences are passed through the encodingTransformer’s layers. The created vectors after training correspond to the masked wordslearnt along the pre-training phase. An existing drawback is therefore that the decoding inthe fine-tuning phase can just make limited use of the pre-trained language representationwhen the intersection of masked words and words in the annotated dataset is small.[10]

Figure 15: Input format in BERTSource: [10]

In the fine-tuning learning phase, BERT uses the usual self-attention mechanism of Trans-

25https://lena-voita.github.io/posts/emnlp19 evolution.html

30

formers, which allows using the pre-trainied model and learn weightning of words dependinghow relevant they are for the decoding. [10]

BERT’s ability to take right and left words around the target word simultaneously intoconsideration in the pre-training and therefore having both-sided information for studyingthe semantic of the word, makes it very attractive for multilingual text classification tasksdepending strongly on understanding semantics. In contrast, ELMo’s [37] bidirectional ap-proach uses a feature-based extraction, reading first from left-to-right and then right-to-left(with LSTMs) and concatenating the information afterwards. BERT is the only modelamong them taking left an right context simultaneously into consideration while learningand outperforms GPT [39] and ELMo on GLUE26 (in English) on all tasks. [10]

Multilingual BERT or its abbreviation mBERT is a model jointly pre-trained on 104languages. The whole training dataset consisted of 120k tokens from raw Wikipedia articlesand no aligned word embeddings were used as an additional input [19]. The input sentencesdo not contain any language tags. Further, the used languages share a common word piecevocabulary across the articles their training datasets are generated from. To overcome thehigh variation in training data size across the 104 languages, the ones with many articleson Wikipedia are sub-sampled, and for languages with a small size of text data exponentialsmoothing is used to super-sample them. Besides that, BERT’s authors claim that it doesnot require similar representations like a character overlap across the languages by using theMasked Language Model, where the learning is strongly focused on semantics. [10][38]

4.2.2 XLM-RoBERTa

XLM-RoBERTa is a blend of RoBERTa and the updated version of original XLM model.27

The training dataset sentences and words are split into the most used subwords in thecorpus over of all training languages. XLM-R is then pre-trained on them claiming to bean improvement over mBERT, whose training dataset is not focused on containing similarwords across languages. Moreover, in the XLM-R training phase, input samples are givenin two languages simultaneously (in mBERT samples are just in one language after eachother). It allows the network to predict and learn the missing marked word in one sentencewith the information from the other sentence, since di↵erent words are marked in the twoinput samples.

Additionally, XLM’s input consists of a language ID and an annotation with informationabout the order of words in that language. It should support the learning of semanticdependencies. Incorporating the latter two new properties to mBERT’s MLM method, thepublisher name their model ”Translated Language Modeling” (TLM).28

26https://gluebenchmark.com27https://towardsdatascience.com/xlm-roberta-the-multilingual-alternative-for-non-english-nlp-

cf0b889ccbbf28https://towardsdatascience.com/xlm-enhancing-bert-for-cross-lingual-language-model-5aeed9e6f14b

31

XLM-RoBERTa enhances XLM by using larger training set - 2.5TB (250k tokens) and byusing 100 languages for training instead of 15. The second component - RoBERTa - is arobust version of BERT but trained longer because it is trained on a larger corpora (byLiu et al.(2019)), achieving better results on next sentences prediction tasks, e.g, questionanswering.29 [25]

While mBERT was trained on Wikipedia articles, XLM-RoBERTa was made with theaid of CommonCrawl data from Wenzek et al., (2019) [25][7]. The number of intersectinglanguages between mBERT and XLM-R is 88. XLMR base version consists of 12 layers whilethe large version is built from 24-layers. [7]

4.2.3 Multilingual Transformers for NER

We saw two multilingual Transformers, pre-trained on more than 100 languages: mBERTand XLM-RoBERTa. Both models can be adapted for NER task.On CoNLL dataset 2002/2003 for the languages English, Dutch, Spanish and German, thetwo models reach the following F1 scores:

Model train en nl es de AvgXie et al. (2018) en - 71.25 72.37 57.76 67.13mBERT each 91.97 90.94 87.38 82.82 88.28

en 91.97 77.57 74.96 69.56 78.52XLM-RBase each 92.25 90.39 87.99 84.60 88.81

en 92.25 78.08 76.53 69.60 79.11XLM-RLarge each 92.92 92.53 89.72 85.81 90.24

en 92.92 80.80 78.64 71.40 80.94

Table 1: Results on Named Entity Recognition for di↵erent multilingual models

Train on ”each” corresponds to monolingual scores and ”en” for training/fine-tuning in English.Results for Xie et al. (2018) are incorporated in the table from Wu and Dredze [47] Source: [7]

The results are also benchmarked against previous state-of-art architecture from Xie etal. (2018) using a bidirectional LSTM network and a linear CRF output layer (mentioned inthe section 2.2.2). The LSTM network encodes in the first stage words on their character-level. Even if transfer for named entities look often very alike across languages (for examplecomparing Danish’s Tjajkovskij with English’s Tchaikovsky), the two Transformer modelsoutperform with their design the BiLSTM with CRF. The large version of XLM-R showsthe best results but demands higher memory capacities. Comparing mBERT and XLM-R

29https://towardsdatascience.com/xlm-roberta-the-multilingual-alternative-for-non-english-nlp-cf0b889ccbbf

32

basic version, both having 12-layers, the results go mostly along with each other and beingnot far away from XLM-RLarge.

33

5 Previous research and positioning of our work

The problem of achieving language-independence for multilingual models, as for now, remainsunsolved. As displayed in section 4.2.3, cross-lingual entity recognition accuracy with state-of-art Transformer models is still below monolingual results.

The chapter is dedicated to existing research examining reasons why cross-lingual NERtransfer is so challenging. It is divided into a general overview and the positioning of our workwithin existing research as well as two subsections providing more extensive insight into thementioned outcomes in the overview: findings on cross-lingual language dependencies acrossarchitectures and NLP tasks (first part) and the same focused on multilingual transformermodels and NER (second part).

5.1 Overview

Principally, several publications show di↵erent NER F1 scores across language combinationsbut rarely do they investigate the underlying causes for those variances. Mainly, only Lin etal. (2019) shows a more complete picture for cross-lingual Entity Linking tasks by indicatingthe importance of genetic, syntactic and geographic similarities between languages (using aLSTM architecture). Some other publications also mention the importance of shared word-order properties (see section 5.2).

Focusing our work on Transformer models, Pires et al. (2019) claim that mBERTis powerful for zero-shot cross-lingual NER, where languages share typological similarities,especially word order properties. Wu and Dredze (2019) extend Pires et al.’s experimentsto more tasks and state that projection of language knowledge depends as well on languagesimilarities but also on subword overlaps contrary to Pires at al.’s findings. Further, language-specific information are learnt in all layers. Similarly to Pires et al., K et al. (2019) proves -with a modified artificial English language - no influence of word overlap between languageson cross-lingual transfer learning but also an underlining correlation with the word orderbetween languages (see section 5.3).

Acknowledging that all three Transformer model investigations focus on BERT/mBERTmodel di↵ering notably in their experiment designs and in most cases use English (or Fake-English) as the source transfer language for their examinations - our experiments look alsoat XLM-RoBERTa as another multilingual transformer model. Further, we investigate theimpact of more linguistic attributes next to word order and word overlap on a set of 31languages using the knowledge of their main linguistic properties derived from a linguisticdatabase. We focus in this thesis on Named Entity Recognition (since a generalisation acrosstasks proves to be insu�cient, see next section).

We hope showing a more complete understanding of current cross-lingual dependenciesby our experiments on current-state-of-art Transformer models and inspire future work fordeveloping massively language-independent models for NER.

34

Precisely, we:

1. test mBERT also for other language combinations in fine-tuning and evaluating inzero-shot settings

2. investigate next to word order also the impact of other syntactical features betweenlanguages

3. compare mBERT’s language-dependence against the other pre-trained multilingualtransformer model XLM-R (similar to Libovicky et al., see section 5.3)

5.2 Generalisation between languages in cross-lingual transfer

Language-agnostic systems base on the idea of being able to generalise across languages. Asdescribed in section 3.2.3, the only way to prove their language-independence is by testingif they learn language specific-information presenting accuracy variances across language-combinations. For instance, experiments on mBERT showed that a higher vocabulary train-ing size improved the all-over accuracy between di↵erent language pairs indicating that thishelps to remove existing language-dependencies and being one motivation for the develop-ment of XLM-RLarge model (see section 4.2.2).

Evaluating language-independence is limited to languages with available annotated re-sources. Generally, it was shown i.a. in Table 1 existing findings present cross-lingual resultsbeing still away from their monolingual benchmarks [49][38]. Understanding why some lan-guage pairs perform better than others, Lin et al. (2019) studies the choice of the mostsuitable high-resource transfer languages for a specific target language and task. They cen-tre their experiments around phylogenetic and typological properties as well as word overlapand data size by using several existing linguistic feature databases as a base (e.g., Glot-tolog for geographical and genetic distances between languages or WALS for syntactical andphonological distance).

Their results in Table 2 show, determining linguistic features di↵er across NLP tasks(here Machine Translation (MT), Entity Linking (EL), Part of Speech tagging (POS) andDependency Parsing (DEP)). Moreover, the results in MT for choosing the best transferlanguage are even better when just using dataset-related decision-features (first group con-taining i.e. word overlap score between the languages). Phonological features (presentedby phonological and inventory score using di↵erent databases) seem to depend highly ontheir data source’s quality presenting di↵erent results across the tasks. For EL, using a twocharacter-level LSTM encoder, geographical, genetic and syntactic features play a crucialrole. [29]

35

Method MT EL POS DEPword overlap 28.6 30.7 13.4 52.3... ... ... ... ...size ratio 3.7 0.3 9.5 24.8genetic 24.2 50.9 14.8 32.0syntactic 14.8 46.4 4.1 22.9... ... ... ... ...phonological 3.0 4.0 9.8 43.4inventory 8.5 41.3 2.4 23.5geographic 15.1 49.5 15.7 46.4

Table 2: Contribution of linguistic characteristics for choosing the right transfer languageSource: [29]

The experiment outcomes demonstrate: Pointing at specific attributes across tasks caus-ing language-dependencies is not possible, rather doing it task-specific seem more reason-able. Further, investigations of determining linguistic properties depend significantly on theirdatabases’ qualities.

In particular, word-order is a present challenge for cross-lingual transfer across models:For example, the bidirectional LSTM with CRF layer architecture from Xie et al. (2018)(previous state-of-art in NER) working with static aligned word embeddings enriched Lampleet al. (2016) model by adding a ”self-attention” layer. It introduces a feature vector forthe words’ positions in the sentences making it independent from word order of the sourcelanguage and outperforms Lample et al.’s original model [49][22]. Ahmad et al. (2019)compares order-sensitive sequential RNN model against non-order-sensitive self-attentionmodel and shows that the latter one outperforms the RNN [1].

The aspects mentioned above concentrate on the idea of understanding the language-dependencies in cross-lingual learning of neural networks. Conneau et al. (2020) claimthat generalisation across languages can also be influenced by the training process itself.The training size, especially for massive multilingual systems, cannot be simply increasedwith the higher amount of languages - it is limited by the amount of time for processingthe data and the available space. It is a significant cause pre-trained multilingual modelsunderperform on monolingual results in comparison to systems that are pre-trained on justone language. When upgrading the Transformer model XLM trained on 7 languages to 100languages the F1 score on XLNI dataset drops by 4% [7]. Nonetheless, massive multilingualsystems are especially attractive and demanded for cross-lingual applications requiring theinvolvement of several linguistic di↵erent languages or low-resource languages. A trade-o↵ between performance and the amount of trained languages has to be made to achievelanguage-independence (demonstrated also by Artetxe et al. (2019)’s experiments [25]).

Lastly, Libovicky et al. (2020) present interesting findings about the language-independent

36

advantage of contextual word embeddings over aligned static word embeddings. The latterare trained in a supervised manner and aligned with bilingual corpora while contextual wordembeddings are generated by mBERT in an unsupervised manner (see section 3.2.1). Theembedding for a single word is supposed to be identical across all languages in both cases,but their findings show contextual word embeddings (like created in transformer models)present less di↵erences than static embeddings benefitting from their training to generalisemore strongly between languages [28] (also shown by Wu and Dredze (2019) [47]).

5.3 Cross-lingual NER in Transformer models

After getting an insight of challenges generalising language representations in multilingualmodels, this subsection goes more into detail with specific findings for cross-lingual NERand its state-of-art models mBERT and XLM-R.

Indicated already by experimental proofs in section 4.2.3 and further by a modifiedversion of BERT: UDify [20], developed to compute Universal dependency treebanks forsyntactical properties between languages, BERT and its multilingual version are learninglanguage-specific information for language representations and tasks.

Pires et al. (2019) use CoNLL-2002/2003 and an additional in-house annotated NERdataset on 16 languages to test possible existing between languages in cross-lingual NERcomputations with mBERT. Their zero-shot experiment results show no impacts on predic-tion accuracy for an alphabet or word overlap but a correlation between typological similarityof language pairs and F1 scores for POS-tagging (another text classification task for gram-matical properties). The similarity thereby was measured based on the set of typologicalfeatures derived from WALS30. Especially, a shared order of words in a sentence seem toleverage the quality of a generalised representation between languages: In their experimentsthe language pairs31 EN - Bulgarian (BG) and BG - EN have scores (82.2% and 87.1%) 10percent points under the monolingual F1 scores. Simultaneously, EN - Japanese (JA) and JA- EN result in F1 scores almost half of the monolingual F1 scores. In opposite to Bulgarian,English does not share the same order of subject, verb, object with Japanese. Besides that,fine-tuning mBERT on POS tagging again with Urdu (Arabic script) and testing it on Hindi(Devanagari script), which is not part of the 104 pre-trained languages, the scores demon-strate mBERT’s cross-lingual transfer ability making it also applicable for unseen languages.[38]

Summarising Pires et al.’s findings, mBERT does create multilingual representations,however, they are flawed with certain language-pairs, for instance, those with di↵erent wordorder. Further experiments on the order of subject, verb, object (SVO) and adjective, noun

30https://wals.info31Annotation: training language - test language

37

(AN) were undertaken with several language pairs on POS-tagging. The results can be seenin Table 3 and demonstrate languages with the structure SOV and NA depend higher onthe transfer with languages having the same order than the other way around. [38]

SVO SOVSVO 81.55 66.52SOV 63.98 64.22

AN NAAN 73.29 70.94NA 75.10 79.64

Table 3: POS accuracies when transferring SVO/SOV languages or AN/NA languages.Row = fine-tuning, column = evaluation

Source: [38]

With comparable experimental settings for NER (also using CoNLL 2002/2003 dataset,but an additional Chinese dataset), Wu and Dredze (2019) investigate mBERT fine-tunedon five NLP tasks in English and evaluate annotations outputs on the rest languages. Whiletheir results for Dutch, Spanish, German are in a range of 69.56% to 77.57%, for the moredistant Chinese the F1 score results in 51.90%, concluding the degree of language similarityseems to have an impact, at least for the tested languages.

A additional layer-wise consideration of how mBERT’s pre-training keeps individual lan-guage information tested on 99 languages, shows that at every layer mBERT is able to classifylanguages with an accuracy of 96%, demonstrating little layer-variety. Their experimentsalso probe subword overlaps between source and target language in mBERT and in oppositeto Pires et al., it find that it impacts the performance in NER, POS and Dependency Parsing[47].

In the experiments mBERT was fine-tuned exclusively on English and therefore checkedonly for the overlap between English and the following languages: Dutch, Spanish, German,and Chinese. Pires et al. experiments tested on a all language pair combinations from adataset corpus of 16 languages.

K et al. (2019) investigate BERT’s cross-lingual transfer learning focusing on thepre-training using a EN-BERT fine-tuning on NER (in English) and evaluate it on thetarget languages including Russian, Spanish, and Hindi (with the aid of LORELEI newsand a social media dataset). They consider three categories in their experiment-setups foranalysing: linguistic similarities between the languages, architecture of the network.

Interestingly, the results show, in contradiction to Pires et al. and Wu and Dredze, thatno word piece overlap is necessary even in bilingual settings. Their findings were madeby introducing a ”Fake-English” representing English with shifted characters avoiding anyoverlap with other English texts.

Structural similarities between language pairs are investigated by looking at word-orderingand word frequency properties. They use, once again, artificial English language not sharingany vocabulary/characters and additionally change the order of words randomly and doing

38

the same in the target language. Their findings for word-ordering show that the more theproperties of word order di↵er between Fake-English and the target language, the more F1scores drop. Same as Pires et al., word-order seems to be an impact on transfer learning.Moreover, further experiments for word frequency test bilingual BERT for training unigrammword frequencies and evaluate cross-lingual representations. It shows very low F1 scores forNER, demonstrating that word frequency cannot be used singly for e�cient cross-lingualtransfer.

The last aspect of their probing experiments is the architecture’s influence, consideringthe number of layers in BERT and its number of parameters. The results present as the otherpublications showed already, mBERT is able to identify languages across layers. Further,increasing the number of parameters can improve results, especially the depth is determining.[19] Word-order seems to be again an important impact on cross-lingual transfer, not justbetween fine-tuning and evaluating also K et al. presents experiments showing the existingrelationship for pre-training. Word overlap seem to be less influential.

Demonstrating the language-independence of contextual word embeddings’ in the pre-vious section, Libovicky et al. also compare multilingual transformer models in theirpublication. Additionally, they train DistillmBERT (a BERT model with knowledge distil-lation [42] having 6 layers instead of 12) on the same training dataset as mBERT to checkif lowering the number of layers results in better generalisation across languages. Theirfindings show, by feeding computed contextual word embeddings to a language ID classifier,the resulted grouping corresponds mainly to the language families and similar accuracy re-sults across the models: mBERT (96%), UDify (95.9%), DistillBERT (95.3%) and XLM-R(95%) (also demonstrated by Kudugunta et al. (2019) [21]). Libovicky et al. expect thatthe pre-training method of mBERT containing next sentence prediction can be responsiblefor learning linguistic families, which demand deep language knowledge to figure out if twosentences are adjacent. [28]

It is the only publication providing a comparison between mBERT’s and XLM-R’s multi-lingual ability and showing that both mBERT and XLM-R learn language-specific knowledge,especially linguistic information being also the base for the grouping of languages to theirfamilies.

39

6 Experimental setup

As explained in section 3.2.3, the lack of linguistic knowledge encoded in the trainingdataset (no linguistic supervision), does not ensure language-independence. Thus, we dis-prove language-independence with experiments. In this section, we first describe typologicalfeatures of our language set (ES features), WikiANN as our annotated NER dataset, andthe initial experiments we made to analyse the dataset’s quality. Further, we define ourhypotheses for the experiments and explain the experiments themselves: the Language-pairsexperiment and ES features experiment. In them, we use mBERT and XLM-RoBERTa, tosee if these models incorporate language-specific typological knowledge.

6.1 Typological properties of Indo-European languages

Linguists classify languages by their typological attributes. The most extensive databaseavailable online, providing insights about those typological attributes of language groups, isWALS (“The World Atlas of Language Structures online”)32. With the goal to investigatetypologial similarities more thoroughly, in our experiments we chose the less extensive Eu-ropean Sprachbund feature set for Indo-European Languages. Its size allows usto inspect allfeatures, while still touching on a significant subset of languages.

6.1.1 Twelve features characterising the European Sprachbund

The European Sprachbund (ES) is a linguistic area characterized by the presence of Stan-dard Average European (SAE) languages - a region belongs to ES if its spoken languageis identified as a SAE language. The ES features do not exist in most of the global lan-guages; instead, they can be observed mainly in the Indo-European languages - a languagefamily present in Europe, the Middle East and South Asia, containing Romance, Germanic,Balto-Slavic languages and the Balkan languages 33. The twelve features below describe theStandard Average European. They present syntactic or morpho-syntactic characteristics oflanguages and a minimum of five features is required for a language to be labelled as an SAElanguage. [16]

1. definite and indefinite article, like “the” and “a” in English

2. relative clauses with relative pronouns, for example: “The man who drives the car”vs. “The man whose car I drove” in English - “man” is the head of the sentence thepronoun depends on

3. have perfect, a tense using the verb “have” and the past participle form of a verb

32http://wals.info33https://en.wikipedia.org/wiki/Indo-European

40

4. nominative experiencers, which indicates that predicates can be general or inverting,e.g., “I like it” may be pointed to an agent, but it could also be “It pleases me” whereit is pointed to a goal (inverting)

5. participial passive, which is formed by a linking verb and a past participle form of theverb

6. anticausative prominence, means a verb depends on its causitive member, i.e. “tobreak” can be used transitive (with a direct object) or intransitive (without)

7. dative external processors, for example in German “Die Mutter wascht dem Kind dieHaare” direct translation is: “The mother is washing the child the hair.” instead of“..the child’s hair”. “The child” is the dative and ”the hair” is the accusative

8. negative pronouns with lack of verbal negation, e.g, in the English sentence “Nobodycame” the verb itself is not negated but rather it is indicated by the pronoun

9. particles in comparative constructions, in English by using “than/that” or in Latin byusing “quam/qui”

10. relative-based equative constructions, for instance in German “so gross wie ich” meaning“as tall as me”, the sentence uses a relative adverb “wie” to introduce a relative clause

11. subject person a�xes as strict agreement markers, which means that the subject pro-nouns can not be dropped to give a clear indication of the subject. A counter examplewould be Spanish, where you could say “(To) Te amo” for “I love you”

12. intensifier-reflexive di↵erentiation: in English “self” is not just used as an intensifier,i.e. “The president himself ...” also as a reflexsive “He loves himself”. But in a lotof languages like Polish (“Sam prezydent ...” and “Kocha siebie”), di↵erent words areused for the two cases

The following map shows the number of shared features between languages34:

34Full lists of features for each language in Table 22 in the Appendices

41

Figure 16: 9 characteristic features of SAE

Languages and the number of ES features they exhibit. The overview contains 9 of the 12features. Languages with the same amount of features do not necessarily have the same features.

Source: adopted from [16]

German and French contain most of the ES features. In general, the Romance and Balkanlanguages share more features with most of the languages than the Northern languages andEnglish. Slavic languages are furthest from the common ES feature nucleus. The well-knownrule of a strict order of subject-verb-object across the European languages is not among thefeatures of the European Sprachbund.

6.2 Dataset

As we explained in section. 2.1, languages and di↵erences between them are so complex thatit is impractical to describe all the rules manually. For this reason, development of deeplearning techniques that endorse a concept of ’The unreasonable e↵ectiveness of data’ (theidea that if you have enough data, every model works well) was an enormous push in NLP.[15]

In our research, we use a model that has already learnt languages, so while feeding NERannotated datasets into it, it can focus solely on the NER task (instead of both languageunderstanding and NER). Still, it does not mean that the fine-tuning dataset (i.e., a dataseton which we train a Transformer) can consist of just a few instances. One of the mostpopular gold standards for NER is CoNLL-2002/2003 Shared Task. It consists of manuallyannotated datasets for English, German, Dutch, and Spanish, proved to have most balanced

42

corpus in terms of named entities classes across other NER sources [3]. Unfortunately, ourexperiments require training data for many languages and from resource-poor languages, forwhich human-annotated datasets are unavailable. On the other hand, manually annotating afew instances, would not be su�cient in size, which is why we look at WikiANN, an existingNER annotated dataset for many languages..

6.2.1 WikiANN

Machine translation and language modelling systems’ performance grew much faster duringthe past few years than for other machine learning tasks, like document classification orNER. The responsible reason is, there is no training data available for the latter in the wild.Fortunately, the vast database of Wikipedia seems to get closest to the gold standard interms of quality.

Cross-lingual name tagging and linking framework [36] using Wikipedia has producedenormous WikiANN datasets. All datasets are publicly available35. Di↵erent from CoNLL,they are not human-annotated and thus less reliable in terms of quality. They are, however,indirectly made by humans; they leverage Wikipedia articles and properties of articles whichwere written and described (through rich properties) by people through crowd-sourcing.

There are two main steps in which entities could be identified and assigned to a typethrough Wikipedia. Firstly, most named entities are labelled as anchor links and vice versa:often the anchor links are labelling named entities. This is how named entities are identified.Secondly, each entry in Wikipedia is mapped to an external knowledge base that containsrich properties. Based on those properties a named entity can be identified as either Location(LOC), Person (PER), Organisation (ORG) or Other. For instance, an entity with a propertypopulation is more likely to be a Location. This is a step when entities are classified to types.[36]

WikiANN’s advantage over other NER datasets (CoNLL, OntoNotes 5.036) is in thenumber of languages (282) and the size of its datasets - up to 200 times larger than CoNLLfor English datasets and around 20 times larger for other datasets (depending on the numberof training mentions on Wikipedia). For our set of languages derived from section 6.1, theonly unavailable were Lezgian and Nenets. To use CoNLL (only four languages), we wouldhave to perform Annotation Projection for each of the remaining languages to provide atraining set in a reasonable size. WikiANN does not require this additional step. Last butnot least, it has the same source as the pre-trained model (mBERT was also trained onWikipedia articles), so during pre-training and fine-tuning, mBERT has instances of similardomains and thus better results. Taking all of this into account, we decided to use WikiANNover other datasets.

35Accessible here: https://elisa-ie.github.io/wikiann/36NER dataset generated mainly from news article for English, Chinese, Arabic, can be found here:

https://catalog.ldc.upenn.edu/LDC2013T19

43

6.2.2 Dataset preparation and cross-validation

WikiANN datasets are not uniform in length – they depend on a supply of a given languageon Wikipedia. To remove biases (generally, bigger training datasets give better results thansmall datasets), we decided to trim datasets to the same size. Firstly, we looked into in thenumber of lines and decided to make the cut-o↵ at Breton language, as the sizes betweenIcelandic and Breton di↵er significantly. In the Table 4 we present an overview of WikiANNdatasets and their length in lines.37 One line corresponds to one word or for sentence sepa-rations (empty lines in this case). WikiANN datasets’ lengths go along with the amount ofWikipedia articles in a given language.

After the cut-o↵, we still have a reasonable training size without sacrificing many lan-guages. Thirty-one languages remained, and seven were ignored (Icelandic, Irish, ScotsGaelic, Komi, Maltese, Sardinian, and Udmurt). Breton became the smallest dataset, witharound 150,000 lines and 17,003 sentences. The remaining languages were then shu✏ed andtrimmed to 17,003 sentences (by shu✏ing we have sentences from random positions and thuscovering more topics). To compare, CoNLL’s English dataset is 323,555 lines (218,608 train-ing). Additionally, as a form of cross-validation, especially for the bigger datasets, we createdfour such 17,003 sentences subsets from the original WikiANN. We run all the experimentson all four datasets for each language and take a weighted average of their results.38

Language no of linesEng 65,812,076Swd 15,506,369... ...Tat 400,046Ltv 373,321Grg 341,823Wel 252,458Alb 199,242Arm 183,837Brt 149,477Ice 84,944Ir 69,973... ...

Table 4: WikiANN datasets and their length in lines. Cut-o↵ is after Breton.

37Whole list is in the Appendices, Table 2138Code for pre-processing can be found in take sentences.py

44

6.3 Model and tools

6.3.1 Choice of parameters

The space of experimental settings for deep neural networks is so vast that it poses a greatchallenge for researchers to find a group of settings with the best performance. Its compo-nents include parameters and hyperparameters, which are often, wrongly, used interchange-ably. Parameters (e.g., weights and biases) are properties of the training data learnt duringtraining. In contrast, hyperparameters (e.g., learning rate, number of epochs, hidden layers,activation functions, etc.) are not learnt during training but need to be set beforehand.To find good values of hyperparameters, we can perform hyperparameter exploration, i.e.,search across various hyperparameter configurations to find a set that results in the bestperformance. Given that the search space is vast, brute-force evaluation of each configura-tion (grid search) is too expensive. For that reason, it is also done with a random search orsome more advanced techniques, like Bayesian Optimization.

Instead of tuning hyperparameters experimentally, we may also take over the settingsfrom the model we fine-tune on, which often gives good results. If results are not satis-factory, the learning rate can be tweaked to a slightly lower value. To find if our pre-sethyperparameters, taken over from SimpleTransformer (see next section)39, are right, weperform tests on CoNLL datasets and compare their results with benchmarks in the nextsection.

6.3.2 Tools

To fine-tune Transformers for NER task, we needed a tool that:

(a) provides Transformers architectures and their pretrained models

(b) enables adding an NER layer on top of mBERT and XLM-R

(c) is flexible in the choice of hyperparameters

Two most popular frameworks, both with a big community behind them, are Tensorflow40

and Pytorch41. For our purpose, however, we could look for higher level APIs, as we onlyneeded the possibility of changing high-level settings, like the learning rate or annotationscheme. Our goal is to compare performance between languages - not finding the best modelso foundation architecture of mBERT and XLM-R could stay unchanged.

39Default parameter settings, can be found here: https://github.com/ThilinaRajapakse/simpletransformers40https://www.tensorflow.org41https://pytorch.org

45

The Transformers library42 from HuggingFace43 provides architectures together with pre-trained models, enabling using a given model directly without the pre-training process.In addition, Transformers o↵ers deep interoperability with TensorFlow 2.0 and PyTorch,so potential experiments in further work (e.g., extracting layers) could be added withoutmuch trouble. Simple Transformers API, build as a wrapper around the Transformerslibrary, abstracts away the complicated setup code even further while keeping room for theadaptation of the main configurations.44 45

6.3.3 NERModel

After creating the environment for Simple Transformers, we were able to use its NERModelclass. There are two mBERT models available in Transformers library: cased and uncased.We chose a recommended by the library cased model trained on 104 languages. We provided alist of all Named Entities labels (classes) for the BIO scheme. The training was performed onGPU to speed up computations. To the best of our knowledge, the only parameter di↵erentto the original mBERT hyperparameters is the number of epochs. Because WikiANN has alot of training instances and they are similar, we are using one epoch whereas mBERT wastrained with three epochs [38].

6.3.4 Evaluation

The accuracy measures, acquired from statistics, enable evaluation of the algorithms andmodels in machine learning. The metrics we include in our experiments are accuracy, preci-sion, recall, and F1 score, and they are calculated based on numbers of true positives, truenegatives, false positives, and false negatives, in the results. Accuracy, which is a measureof all correctly identified instances (true positives and true negatives), is not the best choicefor Named Entity Recognition task. For NER datasets, it is natural to have few positives(named entities) and many negatives (no entity). Therefore, systems that do not recognizeany entities or recognize only a small chunk of them, usually achieve high accuracy becauseof its many true negatives. It is misleading, so for our evaluation, we use the F1 score instead(it considers both the precision and the recall of the test). We use a hold-out validation tech-nique, i.e., training on one dataset and validating on another. Below, we present exemplaryresults of fine-tuning the model46:

42https://github.com/huggingface/transformers43https://huggingface.co44https://towardsdatascience.com/simple-transformers-introducing-the-easiest-bert-roberta-xlnet-and-

xlm-library-58bf8c59b2a345https://towardsdatascience.com/simple-transformers-named-entity-recognition-with-transformer-

models-c04b9242a2a046Each model we run was saved in the folder results as pytorch model.bin file. The evaluation over labels

is in eval results.txt

46

precision recall f1-score supportORG 78 74 76 690LOC 94 97 96 3029PER 93 95 94 1370

micro avg 92 93 93 5089

Table 5: Exemplary results from eval results.txt

Annotation Scheme of WikiANN is relatively less fine-grained than other annotationschemes (see the introduction to section 2). It recognises the following entities:

1. ORG (Organisation)

2. LOC (Location)

3. PER (Person)

When we compare results, we use the value of micro weighted average to adjust forclass imbalance. The class imbalance occurs when there is a disproportion in the numberof observations for each class, i.e., there is many more examples of one class than of otherclasses. For instance, if we took an average of results for each entity type (3 averages), andthe number of PER entities was much higher than the other two, PER entities results wouldbias the final result. The micro weighted average solves this problem by aggregating thecontributions (support in Table 5) of all classes to compute the average.

6.4 Exploratory work

We run an experiment on CoNLL datasets (Table 6). The pre-processing included theremoval of DOCSTART lines and a change to the commonly used BIO47 annotation scheme(to make it equal with WikiANN).48 We compare our results against results from Pires etal. (2019) [38]49 (Table 7). The aim is to find out if our hyperparameters are adequatelytuned.50

47See section 2 Introduction48Example dataset can be found in Table 20 in Appendices49See section 5.250For those experiments, all WikiANN datasets used in this experiments were trimmed to 10% of the

original size making it closer to CoNLL sizes. The original sizes are in appendix, Table 19.

47

Fine-tuning/Eval Eng Dut SpnEng 95 79 73Dut 73 89 72Spn 66 67 86

Table 6: F1 scores for CoNLL dataset

Fine-tuning/Eval Eng Dut SpnEng 90 77 74Dut 65 90 72Spn 65 64 87

Table 7: F1 scores for CoNLL dataset from Pires et al. (2019)

We obtained comparable results to those from the Pires et al.; therefore, we decided tokeep the values of the original parameters.

Fine-tuning/Eval Eng Dut SpnEng 88 ? 64Dut 72 93 86Spn 70 84 92

Table 8: F1 scores for WikiANN dataset - large datasets

Due to technical issues with the large datasets, the results for Eng - Dut were not reasonable.

Next, in Table 8 results from WikiANN datasets are presented. They are close to thoseobtained on CoNLL’s. However, our validation set is not fully reliable as it is simply anothersubset of WikiANN dataset (hold-out) while the traditional method of evaluation is to com-pare against human performance (test set annotated manually). Since there are no manuallyannotated instances of WikiANN, we chose to evaluate it on CoNLL’s test set (which arehuman-annotated).

Those performance results (Table 9) are significantly lower than before. Because previousresults were very close to Pires et al. on CoNLL (and in some cases even few percentagesbetter), we suspect the reason of drop in performance is rather in variances between datasources (context variability)51 than annotation quality: CoNLL was trained on Reuters newsstories (for English), not on Wikipedia.

51For most corpora there is a significant drop in performance for out-of-domain training

48

Fine-tuning/Eval Eng Dut SpnEng 47 ? 26Dut 52 48 62Spn 52 42 48

Table 9: F1 scores for WikiANN dataset evaluated on CoNLL

6.5 Hypotheses

Based on our previous literature exploration, we present four hypotheses which are going tobe tested with our experiments.

Hypothesis 1: Fine-Tuning and Evaluating mBERT on the same language showsbetter results than doing the same on di↵erent languages This hypothesis testsmBERT’s general ability for its language-independence. From previous research and alsoour pre-experiments on the CoNLL-dataset, we could see that cross-lingual F1 scores arestill below monolingual F1 scores. We expect to observe this phenomenon also with ourchosen subsets of WikiANN.

Hypothesis 2: Language pairs, which share several ES-features outperform lan-guage pairs with fewer shared features Hypothesis number 2 corresponds to the pre-vious publications (see section 5.2) investigating the multilingual functionality of mBERTwith the outcome that mBERT seems to be sensitive to typological similarities. We thenexpect to see the same results for our WikiANN datasets, based on the typological featuresdescribing the main linguistic characteristics of Indo-European languages.

Hypothesis 3: There are specific features from European Sprachbund, thatdemonstrate the dependency of F1 score and the choice of fine-tuning and evalu-ation languages Proving this hypothesis would o↵er insight into where mBERT still learnslanguage-specific features for entity labelling. We add word order characteristics of the lan-guage as a thirteenth feature to compare word-order results with the findings from section5.

Hypothesis 4: We expect the same results for the previous experiment 1 and 2using the other introduced transformer model XLM-RoBERTa The base model ofXLM-RoBERTa is, like mBERT, trained on 12-layers transformer models with eight heads,728 hidden states and sequence length of 128. From previous research, we could see thatXLM-RoBERTa performs similarly for NER tagging on CoNLL dataset as mBERT (seesection 4.2.3). On the one hand, mBERT advantage is the higher overlap of words in pre-training and fine-tuning phase (the source for both mBERT and WikiANN is Wikipedia).

49

One the other hand, the power of XLM-RoBERTa is its much larger training dataset andits Translated Language Modeling technique with languages IDs (see section 4.2.2). We,therefore, cannot be sure which model will perform better but we expect the di↵erences tobe small as the advantages of both models (source for mBERT and training size for XLM-RoBERTa) should balance each other. We repeat the feature experiments as well to see ifboth models show language-dependencies in fine-tuning and evaluation for the same featuresor if the di↵erent pre-training and architectures have an influence.

6.6 Design of experiments

All of our experiments were done under zero-shot settings, where mBERT has not been fine-tuned on the evaluated language - just for comparison in monolingual settings. Furthermorethe determined values for the parameters, described in the section before, remained thesame over the experiment design points. How the design points were chosen for the eachexperiment setup will be explained in the next sections. First, language-pairs experiments(hypothesis 1 and hypothesis 2) are meant to discover if mBERT is language-independent.Next, ES features experiment (hypothesis 3) tries to find out which features have impact onthe Transformer, making it language-dependent.

6.6.1 Typological similarity between languages experiment

We are going to test the e↵ect of typological similarity derived from European Sprachbundbetween languages on mBERTs fine-tuning for NER (Hypothesis 1/2). For that we are goingto fine-tune and evaluate for language pairs sharing a lot of typological characteristics witheach other and language pairs sharing just a few or no features.

Language pairs were identified by their amount of shared features, using euclidean dis-tance measurements between the language vectors. Those vectors are binary vectors basedon the European Sprachbund features - each vector has a dimension of 12, with 1s if theyhave a given feature and 0s if they do not (it is a true/false representation based on theresearch from Haspelmath (2001) [16]). The full list of used language vectors is in Table 22in the Appendices.

In Table 10, we group language pairs based on the number of features they share. Sofor instance, there exist 17 language-pairs that have all the 12 values same for all the 12features (first row). We run the experiments on 0, 1, 4, 7, 11 and 12 shared features. For11, 7 and 4, we set an upper limit to 40 languages, which are randomly selected. We filterout outliers (language pairs with a high standard deviation to the mean), which is why for11 shared feature 36 language pairs are taken into account for the analysis. Further, for12 shared features, we filter out monolingual pairs, since they do not present cross-lingualtransfer e↵ects.

50

Shared features Tested language pairs12 1711 367 395 382 301 14

Table 10: Amount of tested language pairs per distance

For each distance, we run the language pairs on four di↵erent subsets (described in section6.2), ending up with four iterations for each tested language pair.

6.6.2 Impact of individual typological features experiment

After considering the general e↵ect of typological characteristics of languages on mBERT’scross-lingual learning, we have a closer look at the impact of individual typological featuresfrom the European Sprachbund. In this experiment we find out for each feature in isolation,if they are learnt by a Transformer. We change the experiment design from language pairsto language groups. The languages serve here as representations of the feature they contain(belonging to true group) or not contain (belonging to false group). Taking several languagesat once mitigates the risk of other features having a possible e↵ect on the results. To improvethe quality of the experiments further, we are not simply taking all languages, but insteadchoose languages that are most similar to each other, i.e., share most values for the features(see previous section), apart from the one we analyse in each data point.

To achieve that, we remove the investigated feature from the language vectors, andrepresent them by the remaining 11 features in a common vector space.52 The closest klanguages are identified by finding for each language the k closest neighbors with nearestneighbour search (euclidean distance) and choose the language and its k neighbours with thesmallest distance over all neighbours. We decided for setting k to 10, 15 and 20 and lookingat the number of true and false languages for each feature with the goal to have a least fourlanguages representing each group. The chosen k-value for each feature is presented in Table11 and the corresponding average distances for each k-value in Table 11. It gives us an ideato which extend the language groups present the investigated feature.

52See how the vectors were created in the section 6.6.1

51

feature k-value1 152 253 154 105 206 157 208 159 1510 2011 2012 15

k-value average distance rest features (averg.)10 0.83 0.6915 1.17 1.3720 1.38 1.9

Table 11: Chosen k-value for each feature and average distance for each k

For feature two k=25 was chosen, since most of the languages contain this feature and just asmall amount not. The second table present for each k-value the average euclidean distance thek-language vectors share with each other in the common vector space. An euclidean distance of0.83 for k=10 means for example that the chosen languages share except the investigated feature,

a common average dependency on 0.69 of the rest 11 features with each other.

In general, we can see that the rest dependency along the language group is even fork=20 low. Here the chosen languages share except the considered feature in average 1.9other features with each other. Again, we want to make sure that the chosen experimentgroups are as representative as possible for the analysed features.

To keep the experiment settings consistent for all features, we try to create true andfalse language groups for each feature with a similar amount of involved languages (notaccording to the chosen k-value). While selecting suitable languages, the ones with similareuclidean distance are kept together. Furthermore, where distances are the same alongpossible languages, the group is divided by their geographical distance (assuming theseslanguages are linguistically closer we want to avoid that the cross-lingual transfer benefitsfrom this shared similarity), i.e. Spn, Prt, Itl are kept together and Rus, Rom, Lit to reduceother influences as much as possible.

In our experiment design we do fine-tuning twice - first on the mix of languages nothaving the feature - false group (all features in Europen Sprachbund are binary) and evalu-ating on the mix of languages having that feature - true group. In the second fine-tuning,we add true languages to the training set, while the evaluation set stays the same. As aresult, we have one model trained on languages not having the feature and one having thefeature as illustrated in Figure 17. From hypothesis 3, we expect the F1 score to rise for thesecond fine-tuning if mBERT is language-dependent, otherwise it is agnostic, at least in that

52

feature. Moreover, we have two experiments - swapping true languages from the evaluationset with true languages from training set. Each is further done for 4 di↵erent languagedatasets (cross-validation described in section 6.2.2). E↵ectively, we have 208 design points- 4 datasets * 2 experiments * 2 fine-tunings * 13 features. 53

Figure 17: Experiment design for feature experiments

In order to make our experiments clean, we impose the following rules:

(a) Each fine-tuning training set has the same size of 17,003 sentences

(b) Each evaluation set has the same size - 4,350 sentences

(c) To create a language group, language datasets of 17,003 sentences are first concatenatedand then sentences are read and shu✏ed; as a result, languages should have similarcontribution. All shu✏ing in the code is seeded so experiments are reproducible.

(d) In the second fine-tuning, true and false languages are in 50 to 50 proportion to ensureequal contribution.

53Code for gathering all the results in features results/features gathering results.py

53

true lan. false lan. Eval. 1 F-T 1.1 F-T 1.2 Eval. 2 FT 2.1 FT 2.2Spn Pol Spn Pol Pol Rom Pol PolPrt Sln Prt Sln Sln Alb Sln SlnIt uk It uk uk Hng uk uk

Rom Ltv Ltv Ltv Ltv LtvAlb Lit Lit Lit Lit LitHng Scr Scr Scr Scr Scr

Blg Blg Blg Blg BlgRom SpnAlb PrtHng It

Table 12: Language groups for feature 1 - experiment 1 and 2, k = 15

Above we have an example of a table from the experiment design for feature 1. In firsttwo columns we have languages having the feature and not having the feature, respectively.54

Next three columns are part of the first experiment: the evaluation set, first fine-tuningtraining set and second fine-tuning training set. First fine-tuning is always simply all falselanguages. true languages are divided between evaluation set and second fine-tuning. Inthe second experiment we only swap true languages - the group that went earlier to secondfine-tuning is now the evaluation set and vice-versa.

54We are not using all available languages - look in previous section. Explanation from language abbre-viations can be found in Appendices Table 21

54

7 Results and analysis

7.1 Language-pairs experiments

Each score in the Results and Analysis chapter represents an average of the four iterationson random subsets (17k sentences) from the original WikiANN dataset (see section 6.2.2for further explanation). Across our experiments we investigate the BIO evaluation scheme,explained in section 6.3.4. All language abbreviations are explained in the Appendices, Table21.

7.1.1 Monolingual results

Figure 18: Monolingual results for mBERT and WikiANN dataset sizes (black line)

First, we start by examining monolingual F1 scores for each language. From Figure 18, weobserve that the scores do not show significant di↵erences between the languages. Addition-ally, they go along with the F1 scores on the large WikiANN datasets in the pre-experiments(section 6.4). Further, as indicated by the vertical black line at the top of the bars, alllanguages have a small standard deviation over the four iterations. Stable results verify thechosen subset’s dataset size achieving a reasonable sampling of the WikiANN dataset 55.

Interestingly, English performance is the worst, while Tatar’s result is the best. Theblack line plotted on the right y-axis displays the number of lines of the language’s originalWikiANN datasets (from section 6.2.2). English has, by far, the most extensive dataset(around 62k lines). More articles imply a greater variety of authors and topics and requiremore substantial training for a system to learn the given information. In case, we would havefocused on training in English, parameters like a number of epochs or batch size could havebeen adjusted in mBERT, to reach better monolingual results for English. However, sincewe focus on dealing with various language combinations, we reduce the impact of datasetsize inequality, by using English only in one language group within the feature experiments.

55the decision of the subset dataset size is based on the size of our smallest WikiANN dataset (Breton),see section 6.2.2

55

Our monolingual results provide us with good evidence that also our cross-lingual results forthe following experiments will be representative.

7.1.2 Cross-lingual results

Now, we are going to look at cross-lingual results. We repeat the experiment from our pre-experiments on the CoNLL languages (English, Dutch, Spanish) but this time instead ofrunning it on a large dataset, we continue with our chosen subsets.

Fine-tuning/Eval Eng Dut SpnEng 80.82 81.43 66.24Dut 71.36 90.65 85.05Spn 70.03 83.14 89.76

Table 13: F1 scores for WikiANN dataset for CoNLL languages

Except for the language pair Eng - Dut, all cross-lingual results are below the monolingualcounterpart scores. The highest di↵erence obtained is for a model fine-tuned in English andevaluated on Spanish sentences. However, those di↵erences are not very big - in someinstances, they are only a few percents apart. On the other hand, we need to take inconsideration that those three languages are linguistically related. We will see lower F1scores for more distant languages in the successive experiments, showing overall mBERT’scross-lingual transfer limitations benchmarked against monolingual. Additionally, the resultsare very similar to the scores derived from the original Wikiann datasets (see Table 8, section6.4), showing again that trimming the dataset to 17k resulted in an appropriate designdecision.

7.1.3 Correlation between shared ES features and the transfer quality

Here, we address the challenges of cross-lingual projection by using typological features fromEuropean Sprachbund. The relation is shown in Figure 19 and results are in Table 14.

56

Figure 19: F1 scores and amount of shared ES features

Shared ES-Features F1 score Mean F1 score Standard Deviation Std over 4 it.12 76.94 10.74 1.3411 71.03 11.80 1.567 69.96 10.29 1.655 67.56 12.21 1.912 69.33 8.06 1.441 67.72 8.34 2.12

Table 14: Correlation between the number of shared ES features and the quality of thetransfer between language pairs.

We plot on the x-axis the number of shared ES-features between the languages belongingto a pair and on the y-axis the average F1 score from all the language pairs computed foreach amount of shared features. The graph shows that the more features the languagesshare, the better the overall F1 score. The demonstrated correlation displays a variationfor 2 shared features showing a better result than for 5 shared features. Nonetheless, thereis a gap between sharing a few features (1 or 2) and sharing several features (11 or 12),concluding that MBERT’s cross-lingual performance for language pairs depend on all orsome of the typological ES features. The vertical bars present the standard deviation ofthe language pair’s F1 scores. From the table, we can see an average of around 9.8%. Thestandard deviation is quite similar for every feature number, strengthen the validity of anexisting correlation.

57

7.1.4 Case Study: Tatar

From the 31 languages we included in our experiments, Tatar is the only language mBERTwas not pre-trained on. Findings within the literature review showed that mBERT per-forms surprisingly well for languages it is not pre-trained on in zero-shot settings. We wantto see if the same applies for Tatar in our experiments. The results are presented in Table 15.

LAN 1 LAN 2 Shared ES-features � mean � std (over 4 iter.)Tat Arm 11 -18.19 0.84Tat Eng 2 -18.10 0.69Tat Ukr 7 -11.60 -0.48Trk Tat 11 -9.07 5.21Eng Tat 2 -6.70 4.76Tat Fr 1 -3.73 -0.42Tat Grk 2 -3.28 -0.60Fr Tat 1 -3.02 3.80Tat Trk 11 -2.73 0.50Tat Grm 1 -0.88 -0.03Dut Tat 2 1.32 3.07Tat Cz 5 1.41 -0.17Tat Dut 2 1.42 -0.86Grm Tat 1 2.63 2.32Arm Tat 11 3.21 2.29Tat Swd 5 4.82 0.63Est Tat 11 5.15 3.81Tat Hng 5 5.21 -1.29Tat Nor 5 8.40 -0.20

Table 15: Results Tatar

Table description: LAN 1 represents the fine-tuning language and LAN 2 the evaluation language.� mean is the di↵erence between the language pair’s F1 score and the average F1 score of

languages with the same distance (as the language pair at hand). belong to. For instance, thedistance of Tat-Arm is 1.0 and the average F1 score of this distance (1.0) over all language pairsis 71.03. So: 52.84 (F1 score for Tat-Arm) - 71.03 = -18.19. The same applies for the standard

deviation over the 4 iterations.

From the subsection before, we could see a shared standard deviation over the F1 scoresbetween the distances of around 9.8%. This is our benchmark when evaluating the di↵erencebetween the two F1 scores (� mean) in both directions (positive/negative - the direction ofthe delta does not matter here rather the existing di↵erence to the distance’s mean score).

58

The language pairs Tat - Arm, Tat - Eng and Tat - Ukr results lie outside the benchmark,presenting lower F1 scores with 18 and 11 percentage points under the average. Predomi-nantly, mBERT’s results are worse when we use Tatar for fine-tuning than when we use it asan evaluation language. Nonetheless, there is also an equal amount of other examples whereTatar is the fine-tuning language with a result similar to other language pairs. Because Tataralso has one of the smallest underlying WikiANN datasets, low variety of entities can be anexplanation of those deviations. However, overall, the results are similar to other languagepair F1 scores.

Table 15 illustrates the � standard deviation (std) of F1 scores over the four cross-validations. There is no correlation between � mean and � std, but Trk - Tat, Eng - Tatand Est - Tat, have the highest standard deviations, where Tatar is, in all cases, an evaluationlanguage. While the di↵erence here is smaller than for the � mean, the results are still worthto consider having a di↵erent benchmark, average � std across all language pairs is around1.2%. In opposite to the very stable results for the other language pairs, when evaluating onTatar, there is a higher dependency on an overlap of the fine-tune and evaluating sampling,i.e. similar contents can be a possible share.

Therefore our experiments with Tatar prove in general the findings from the literaturereview, that mBERT also works well for languages mBERT is not pre-trained on (in zero-shot settings). We can add from the findings with Tatar, that when Tatar takes the roleof the evaluation language, results depend slightly more on the overlap of fine-tuning andevaluated annotated dataset. There is no conclusive answer, what the overlap has to consistof.

7.1.5 Case Study: Di↵erent scripts

Another aspect, we look at is the impact of di↵erent scripts, when two languages do nothave a character overlap. This case study is derived from the idea of the Masked LanguageModel mBERT is using, making mBERT independent from character representations (seealso undertaken probing experiments in section 5.2, showing mBERT is not constrained byscript overlaps).

The languages from our experiments involve the following alphabets: Latin, Greek, Ara-bic, Armenian, Cyrillic.56 We filtered for the language pairs that do not share the samescript and presented, for each scripts, all possible combination with other scripts, and theircorresponding F1 scores in Table 16.

56An overview of language and script can be found in the Appendices Table 21

59

script (LAN 1) script (LAN 2) no. of pairs � mean � std (F1 score)Arabic Armenian 1 -18.19 0.84

Cyrillic 1 -11.60 -0.48Greek 1 -3.28 -0.60Latin 13 3.32 -0.39

Armenian Arabic 1 3.21 2.29Cyrillic 2 -12.09 1.81Georgian 1 -6.86 -0.48Latin 8 -10.91 0.86

Cyrillic Latin 7 5.35 -0.29Georgian Armenian 1 -20.20 -0.53

Cyrillic 1 -4.93 -1.23Greek 1 3.04 -0.87Latin 10 -1.29 1.08

Greek Georgian 1 -0.05 -0.72Latin 3 2.03 1.08

Latin Arabic 8 1.90 1.99Armenian 5 -21.88 0.62Cyrillic 18 -2.31 0.16Georgian 5 -2.22 -0.68Greek 3 7.01 -1.17

Table 16: Results di↵erent scripts

Results are grouped by script language pairs. The � mean represent the mean over the languagepair’s deltas within the group. Each language pair’s delta is calculated by the average F1 score ofthe distance the language pair belong to subtracted by the F1 score of the language pair (samescheme as for Case study: Tatar). Further, the � std is calculated in the same way, but in

opposite to Tatar showing here the � mean over the F1 scores from the script group.

For the evaluation of accuracy across scripts, our benchmark is again the average standarddeviation of around 9.8% across all distances for pair F1 scores. The script combinationsLatin - Armenian, Georgian - Armenian and Arabic - Armenian present with deltas around-22%, -20% and -18% the highest di↵erences. None of the script combination showed unusualresults in both directions but all of them transfer to Armenian script. The representationof the Armenian script is in our set of languages supported by the language Armenian.Therefore we unfortunately do not dispose over a cross-lingual score of a language pairtransferring from Armenian to Armenian, we can compare to. The transfer from Armenianto another script do not display remarkable di↵erences in the results. Over all the restcombinations, the findings are mostly within the standard deviation of 9.8% or slightly above.Generally, our experiment results seem to be mildly a↵ected by di↵erent scripts across the

60

chosen languages. The conspicuous scores for Armenian do have too little support in ourexperiments to derive to the conclusion, that transferring to Armenian script demonstratedi�culties for mBERT.

7.2 ES Features experiments

Based on the identified correlation between F1 scores and the amount of shared ES fea-tures, we further look at the impact of features considered in isolation on transferring acrosslanguages.

7.2.1 Results for each feature

The experiments in this section are now undertaken in language groups based on the designdescribed in section 6.6.2. The used annotation for referring to the single experiments followsthe described setup: Fine-Tuning 1 present the fine-tuning on false languages (do not havea given features) and evaluation on true languages (that have the given feature), while Fine-Tuning 2 is fine-tuned on a combination of false and true languages (referred to as mixed).For each feature in two experiments a di↵erent true group combination for fine-tuning/evaluation is used (Experiment 1/ Experiment 2).

Figure 20: Features (F1 scores for Fine-Tuning 1 and Fine-Tuning 2)

Marked bars correspond to Fine-Tuning 1 (false to true), and the other one to Fine-Tuning 2(mixed to true)

Figure 20 shows the resulting F1 scores for Experiment 1 and Experiment 2 for eachES feature57. Overall, we notice in both fine-tunings higher F1 scores in comparison to theresults with language pairs from the previous section. Multilingual BERT seems to benefitfrom being fine-tuned on more than one language in zero-shot settings. The advantage ofjoint training e↵ect, already mentioned in the context of pre-training multilingual languagerepresentations (see section 5), also seems to a↵ect the fine-tuning step.

57the feature numbers follow the same order as in section 6.1.1

61

Proceeding by looking closer at the histogram, we can see that not in all cases Experiment1 and Experiment 2 show similar results for the F1 scores. For example, for feature 2 orfeature 6, Experiment 2 shows significantly lower F1 scores than Experiment 1. Otherlinguistic dependencies between the languages, except the considered feature, can impactthe accuracy. Beside that, we can recognise that within each experiment, there is no massiveleap between Fine-Tuning 1 and Fine-Tuning 2 scores, which goes along with the rather flatidentified correlation (Figure 19).

Figure 21: Features (rel. di↵erence F1 scores Fine-Tuning1 and Fine-Tuning2)

For some features, the two experiments are not consistent: one is decreasing and theother increasing on the second fine-tuning (e.g., feature 6 or feature 7). We conclude that forthese cases, mBERT is probably not learning a given feature. Feature 9 experiment resultsdecrease in Fine-Tuning 2, which also indicates that the architecture is not incorporatingthis feature. Moreover, here other dependencies between the experimental groups have toinfluence the results.

Across the results, we can see that feature 5, 8, 10 and 12 show the highest increasefrom all the ES features (feature 2 and 4 show in Experiment 2 an even higher increase thanfeature 8, but less stability over the two experiments). They present the following syntacticallinguistic attributes:

Feature no. Description5 Pariticipal passive8 Negative pronouns and lack of verbal negations10 Relative-based equative constructions12 Intensifier-reflexsive di↵erentiation

Table 17: Typological ES features from section 6.1.1

Running our experiments also on feature 13 - word order (positioning of adjective, nounwithin a sentence) confirms the findings from section 5.1.

62

7.2.2 Repeating language pair experiment with 4 identified features

In the previous section, our results show a correlation of mBERT’s NER accuracy withfeature 5, 8, 10 and 12. We repeat our experiments for the language pairs, reducing therepresentation of the languages on these four features, expecting to reflect the correlationmore clearly. Each experiment design point (shared features: 4, 2, 0) consist of 20 languagepairs and the F1 score mean is shown in Table 18.

Shared ES features F1 score Mean F1 score Standard Deviation4 69.98 2.042 75.03 1.40 70.48 1.67

Table 18: Language pair results with filtered features

Surprisingly, language pairs sharing all four features do not outperform languages sharingtwo or none of the features. From the standard deviation over the F1 scores, we assume thatthe subset of 20 language pairs are representative for the design points since the resultsare quite stable over the pairs belonging to one group. There have to be other linguisticdependencies causing the correlation in the language pairs to experiment not visible fromthe ES features. We could see from Lin et al. (2019)[29], that next to syntactical linguisticattributes, also geographical or genetic a�liation seem to play a role for cross-lingual NER.

7.3 Di↵erences across multilingual transformer models

In the last section of this chapter, we compare our findings from mBERT to XLM-R - theother pre-trained multilingual transformer model - by conducting on it the same languageand feature experiments. For clarity, mBERT results are displayed in blue and XLM-R inorange.

7.3.1 Monolingual results

Figure 22: Monolingual results mBERT and XLMR

63

From the monolingual results in Figure 23, we can see for XLM-R a similar behaviour in theorder of languages’ F1 scores, as for mBERT. English is still the language with the worstF1 score. In almost all cases XLM-R scores are lower than scores for mBERT. MBERT istrained on articles from Wikipedia (so the same source as WikiANN dataset), while XLM-Ris pre-trained on the CommonCrawl. On the other hand, XLM-R contains a much largeramount of tokens trained on. Nonetheless, the first aspect seems to be more influentialin learning entities for the transformer models, since on CoNLL 2002/2003 in section 4.2.3mBERT and XLM-R showed similar results.

7.3.2 Language-pairs experiment

We run the language pair experiment on the same language pairs as for mBERT on 0, 1,7, 11 and 12 shared ES-features, and plot it against the same dimensions from mBERT,displayed in Figure 23.58

Figure 23: Language distance Results mBERT and XLMR

For XLM-R too, there exists a correlation between the number of shared ES features andthe F1 scores of language pairs. XLM-R seem to have more di�culties in transferring betweenlanguages that share none or just a few linguistic features in comparison to mBERT. But alsofor language pairs having several feature values in common, XLM-R results fall behind thoseof mBERT continuing with the observations made from the monolingual results. Moreover,XLM-R has a slightly higher standard deviation over the F1 scores, with an average ofaround 11.7% (mBERT 9.8%).

58Due to time limitations 5 shared features were not taken into consideration as in Figure 19, instead thefocus is on comparing XLM-R and mBERT

64

7.3.3 ES features results

Since the experiment on the language pairs shows that XLM-R, just like mBERT, learns atleast a subset of the ES features, influencing the cross-lingual transfer for NER between thelanguages, we are redoing the same feature experiments and considering, if XLM-R showssimilar impacts.

Figure 24: Features (F1 scores Fine-Tuning1 and Fine-Tuning2) mBERT and XLM-R

mBERT in blue and XLM-R in orange; from right to left Fine-Tuning 1 (dark color) to Fine-Tuning2 (light color); each two times - Experiment 1 and Experiment 2.

Again, XLM-R F1 scores consistently fall behind the F1 scores of mBERT. Besides that,similarly to mBERT, XLMR’s overall accuracy is growing when we fine-tune XLM-R onmore than one language (in Figure 24 results are mainly below 70% mark).

Figure 25: Features (rel. di↵erence F1 scores Fine-Tuning1 and Fine-Tuning2) mBERTand XLMR

By following the same procedure of analysis as for mBERT in the previous section, wesee that XLM-R and mBERT accuracy results are matching for feature 5, 8, 10, 12 and13. Additionally, feature 7 learning is more vivid for XLM-R. Moreover, XLM-R showssignificant increases for feature 1, 3 and 4, but is unevenly stronger for Experiment 2. Theresult di↵erence between mBERT and XLM-R is most stark for feature 6. Here adding alanguage, that has the feature to the fine-tuning language group drop the F1 results in bothexperiments for XLM-R while mBERT shows di↵erences between Experiment 1 and 2.

65

Even if our last experiments from mBERT features59 show that there must exist otherlinguistic dependencies influencing the cross-lingual results, we find it interesting that XLM-R seems to learn the same features as mBERT, underlining an impact of these linguisticproperties on cross-lingual NER tagging with transformer models. XLM-R shows less stabil-ity in the results over Experiment 1 and 2. A possible reason could be one more time the lackof overlap between pre-training source (CommonCrawl) and fine-tuning source (Wikipedia).Experiments from Wu et al. (May 2020) show: unmatching domains in the pre-training havea higher e↵ect on the performance than in fine-tuning [46].

7.4 Summary of results

Our findings show, mBERT underperforms cross-lingually monolingual NER F1 scores andis therefore not language-agnostic (hypothesis 1). Further, we present that there exists acorrelation between the features from European Sprachbund describing the main syntacticalproperties of Indo-European languages with the accuracy for entity-labelling across lan-guage combinations (hypothesis 2). By disproving that mBERT is language-independent,we checked in feature experiment for specific dependencies to point out. The range of thelearning e↵ect across the features resulted from -1.31% to 5.32% demonstrating mBERTmight still be learning syntactical properties (at least the ones covered by the ES features).It shows that there exist further features or aspects a↵ecting the results than the ones testedfrom us (hypothesis 3). Lastly, comparing the results from mBERT against XLM-R showsthat XLM-R learnt a similar subset from the ES features (Feature 5, 8, 10, 12 and 13) foridentifying entities assuming that our results from mBERT are representative to some extentacross transformer models (hypothesis 4).

59i.e., reducing the language representation to four features - Table 18

66

8 Conclusion

8.1 Summary

This paper investigated multilingual transformer models’ ability of cross-lingual transferbetween languages. This area of research is vital as cross-lingual systems become veryscalable as the number of languages grows. The sources of this capacity are twofold. First,a single architecture is reused for many languages in oppose to having custom technologiesfor every single language. Second, if we represent languages in a joint language space,we do not need to develop costly datasets for resource-poor languages, i.e., we overcomedataset scarcity. The early Direct Transfer models trained on many languages learnt little incomparison to their monolingual versions. While massively multilingual trained Transformerresults are much more promising, they still do not achieve the same results as the respectivemonolingual models. We suspect a reason for it is that they incorporate language-specificinformation which limits cross-lingual abilities.

By running two experiments on 31 languages NER datasets with pretrained multilingualTransformer models, we first proved with a language-pairs experiment that Transformers arenot language-agnostic on Indo-European languages and, second, that there is a correlationbetween syntactical linguistic features and their performance. Moreover, in feature experi-ments, we identified four features that, if present in the training set, seem to improve theperformance on the languages also containing these feature. It implies that Transformer mod-els learn these language-specific properties. It is true for both mBERT and XLM-RoBERTa.In the context of cross-lingual NER, we need to keep in mind that while we run our exper-iments on many languages, the feature space considered is relatively small. A side-e↵ect isthat when proving a system to be language-dependent in our feature experiment, we can onlyexplain it with features at hand. However, countless other features are inevitably impactingthe results; many of them might not even be described by any system yet (even an enormousWALS).

In conclusion, in our experiments, we got one step further at developing a universal NERlanguage system - we figured out which language-specific information might be incorporatedin mBERT and XLM-R by our features experiment. Moreover, we confirmed that using onemodel for several languages works better if those languages are similar.

8.2 Contribution

1. We used WikiANN datasets on 31 languages for our experiments (before CoNLL wasmainly used with English for training)

2. We divided languages into groups according to European Sprachbund (ES) features

67

3. We demonstrated which features from ES might have an impact on cross-lingual learn-ing with multilingual Transformers for NER

4. We confirmed findings from ‘How multilingual is multilingual BERT?’[38] and ‘Do WeNeed Word Order Information for Cross-lingual Sequence Labeling’[31] - that wordorder matters (is incorporated in weights as well)

5. Additionally, we showed the cross-lingual transfer ability of non-pretrained languagesand across scripts in two case studies

6. We provided a classification of cross-lingual methods used in NLP (and for NER) andlisted the work done so far on that topic during last few years.

68

9 Addendum

While analysing our results, we came across a paper from Lauscher et al. (May 2020) -having a comparable experiment domain as us - and a few other very recent publicationsrelated to our research question. Obviously, they are not incorporated in this thesis butfor the purpose of completeness we want to add a brief summary of their findings as anaddendum.

Similar to us, Lauscher et al. evaluate Transformer models XLM-R and mBERT forNER on WikiANN dataset (zero-shot), but on a subset of 12 languages. Their fine-tuningcentres on English and language representation vectors generated with LANG2VEC (basedon linguistic features from URIEL database [30]). They study language similarity, the impactof data size, the role of di↵erent tasks and ways to improving zero-short transfer for distantlanguages with a small supervision. The investigations of phonological and geographicalproperties show for NER a high correlation (range (Pearson) 0.75 - 0.78). Also languagefamily a�liations, inventory and syntactical features demonstrate an influence on the cross-lingual learning (correlation range 0.23 - 0.56, where the latter number belongs to syntacticalfeatures - the same category we investigated).

The language’s datasize in the pre-training phase of mBERT (based on the size ofWikipedia articles for each language) present in opposite no visible correlation. Besidesthis, their few-shots results state out, just adding a few annotated sentences in the targetlanguage improves F1 scores significantly. For example NER-scores across all target lan-guages trained in one case on English and in the other on 10 languages improve by around8 %. But the increase to more languages flattens and the increase with 1000 languages incomparison to just train it on English is around 14 % [25]. Our feature experiments showedbetter F1 scores of fine-tuning on language groups as well even if we kept a zero-shot settings.

Also another publication in April 2020 from Groenwold underlines our findings partiallyas well. By looking at the cross-lingual performance for text classification of eight models(inter alia mBERT) tested on morphological, syntactical and word order features derivedfrom WALS database, present no clear correlations by one of them but a general strongerperformance of language pairs of the same language family. [14]

Furthermore, there seems to be a general recent interest in evaluating the multilingualability of systems. For example, Hu et al. (April 2020) published a multilingual evaluationtool, called XTREME, letting cross-lingual systems test their performance across 40 lan-guages from 12 language families and 9 tasks in zero-shot conditions [17]. Another one isfrom Liang et al. (April 2020) called XGLUE following the same idea for 11 tasks and ontwo di↵erent multilingual datasets. [27]

Our research focused on the task Named Entity Recognition and also from previousfindings we could see the influence of di↵erent typological properties on cross-lingual perfor-mance, but di�culties to point out specific ones. Further, the results show di↵erences acrossNLP tasks. [25] It seems sensible to investigate further task-specific for developing a general

69

system across di↵erent language understanding task striving for language-independence andthe application on low-resource languages.60

60Also Artetxte et al. (Apr 2020) argues for this in [2]

70

References

[1] Wasi Uddin Ahmad, Zhisong Zhang, Xuezhe Ma, Eduard Hovy, Kai-Wei Chang, andNanyun Peng. On di�culties of cross-lingual transfer with order di↵erences: A casestudy on dependency parsing. arXiv preprint arXiv:1811.00570, 2018.

[2] Mikel Artetxe, Sebastian Ruder, Dani Yogatama, Gorka Labaka, and Eneko Agirre.A call for more rigor in unsupervised cross-lingual learning. arXiv preprintarXiv:2004.14958, 2020.

[3] Isabelle Augenstein, Leon Derczynski, and Kalina Bontcheva. Generalisation in namedentity recognition: A quantitative analysis. Computer Speech & Language, 44:61–83,2017.

[4] Emily M Bender. On achieving and evaluating language-independence in nlp. LinguisticIssues in Language Technology, 6(3):1–26, 2011.

[5] Shreenivas Bharadwaj, Vinayak Athavale, Monik Pamecha, Ameya Prabhu, and ManishShrivastava. Towards deep learning in hindi ner: An approach to tackle the labelleddata scarcity. In arXiv:1610.09756, 2016.

[6] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching wordvectors with subword information. Transactions of the Association for ComputationalLinguistics, 5:135–146, 2017.

[7] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, GuillaumeWenzek, Francisco Guzman, Edouard Grave, Myle Ott, Luke Zettlemoyer, and VeselinStoyanov. Unsupervised cross-lingual representation learning at scale. arXiv preprintarXiv:1911.02116, 2019.

[8] Alexis Conneau, Guillaume Lample, Marc’ Aurelio Ranzato, Ludovic Denoyer, andHerve Jegou. Word translation without parallel data. arXiv:1710.04087, 2017.

[9] Ryan Cotterell and Kevin Duh. Low-resource named entity recognition with cross-lingual, character-level neural conditional random fields. In Proceedings of the EighthInternational Joint Conference on Natural Language Processing (Volume 2: Short Pa-pers), pages 91–96, 2017.

[10] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprintarXiv:1810.04805, 2018.

[11] Georgiana Dinu, Angeliki Lazaridou, and Marco Baroni. Improving zero-shot learningby mitigating the hubness problem. arXiv preprint arXiv:1412.6568, 2014.

75

[12] Alex Graves. Sequence transduction with recurrent neural networks. arXiv preprintarXiv:1211.3711, 2012.

[13] Gregory Grefenstette and Julien Nioche. Estimation of english and non-english languageuse on the www. arXiv preprint cs/0006032, 2000.

[14] Sophie Groenwold, Samhita Honnavalli, Lily Ou, Aesha Parekh, Sharon Levy, DibaMirza, and William Yang Wang. Evaluating the role of language typology intransformer-based multilingual text classification. arXiv preprint arXiv:2004.13939,2020.

[15] Alon Halevy, Peter Norvig, and Fernando Pereira. The unreasonable e↵ectiveness ofdata. IEEE Intelligent Systems, 24(2):8–12, 2009.

[16] Martin Haspelmath. The European linguistic area: Standard Average European. 2,2001.

[17] Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and MelvinJohnson. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization. arXiv preprint arXiv:2003.11080, 2020.

[18] Alankar Jain, Bhargavi Paranjape, and Zachary C Lipton. Entity projection viamachine-translation for cross-lingual ner. arXiv:1909.05356, 2019.

[19] Karthikeyan K, Zihan Wang, Stephen Mayhew, and Dan Roth. Cross-lingual ability ofmultilingual bert: An empirical study, 2019.

[20] Daniel Kondratyuk. 75 languages, 1 model: Parsing universal dependencies universally.arXiv preprint arXiv:1904.02099, 2019.

[21] Sneha Reddy Kudugunta, Ankur Bapna, Isaac Caswell, Naveen Arivazhagan, andOrhan Firat. Investigating multilingual nmt representations at scale. arXiv preprintarXiv:1909.02197, 2019.

[22] Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, andChris Dyer. Neural architectures for named entity recognition. arXiv:1603.01360, 2016.

[23] Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining. arXivpreprint arXiv:1901.07291, 2019.

[24] Guillaume Lample, Alexis Conneau, Marc’Aurelio Ranzato, Ludovic Denoyer, andHerve Jegou. Word translation without parallel data. 2018.

76

[25] Anne Lauscher, Vinit Ravishankar, Ivan Vulic, and Goran Glavas. From zero to hero:On the limitations of zero-shot cross-lingual transfer with multilingual transformers.arXiv preprint arXiv:2005.00633, 2020.

[26] Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li. A survey on deep learning fornamed entity recognition. IEEE Transactions on Knowledge and Data Engineering,2020.

[27] Yaobo Liang, Nan Duan, Yeyun Gong, Ning Wu, Fenfei Guo, Weizhen Qi, MingGong, Linjun Shou, Daxin Jiang, Guihong Cao, et al. Xglue: A new benchmarkdataset for cross-lingual pre-training, understanding and generation. arXiv preprintarXiv:2004.01401, 2020.

[28] Jindrich Libovicky, Rudolf Rosa, and Alexander Fraser. On the language neutrality ofpre-trained multilingual representations. arXiv preprint arXiv:2004.05160, 2020.

[29] Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui Li, Yuyan Zhang, Mengzhou Xia, ShrutiRijhwani, Junxian He, Zhisong Zhang, Xuezhe Ma, et al. Choosing transfer languagesfor cross-lingual learning. arXiv preprint arXiv:1905.12688, 2019.

[30] Patrick Littell, David R. Mortensen, Ke Lin, Katherine Kairis, Carlisle Turner, and LoriLevin. URIEL and lang2vec: Representing languages as typological, geographical, andphylogenetic vectors. In Proceedings of the 15th Conference of the European Chapterof the Association for Computational Linguistics: Volume 2, Short Papers, pages 8–14,Valencia, Spain, April 2017. Association for Computational Linguistics.

[31] Zihan Liu and Pascale Fung. Do we need word order information for cross-lingualsequence labeling. arXiv preprint arXiv:2001.11164, 2020.

[32] Monica Marrero, Julian Urbano, Sonia Sanchez-Cuadrado, Jorge Morato, andJuan Miguel Gomez-Berbıs. Named entity recognition: fallacies, challenges and op-portunities. Computer Standards & Interfaces, 35(5):482–489, 2013.

[33] Stephen Mayhew, Chen-Tse Tsai, and Dan Roth. Cheap translation for cross-lingualnamed entity recognition. In Proceedings of the 2017 conference on empirical methodsin natural language processing, pages 2536–2545, 2017.

[34] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Je↵ Dean. Distributedrepresentations of words and phrases and their compositionality. In Advances in neuralinformation processing systems, pages 3111–3119, 2013.

[35] Jian Ni, Georgiana Dinu, and Radu Florian. Weakly supervised cross-lingualnamed entity recognition via e↵ective annotation and representation projection.arXiv:1707.02483, 2017.

77

[36] Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight, and HengJi. Cross-lingual name tagging and linking for 282 languages. In Proceedings of the 55thAnnual Meeting of the Association for Computational Linguistics (Volume 1: LongPapers), pages 1946–1958, 2017.

[37] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark,Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXivpreprint arXiv:1802.05365, 2018.

[38] Telmo Pires, Eva Schlinger, and Dan Garrette. How multilingual is multilingual bert?arXiv preprint arXiv:1906.01502, 2019.

[39] Alec Radford, Je↵rey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019.

[40] Lev Ratinov and Dan Roth. Design challenges and misconceptions in named entityrecognition. In Proceedings of the Thirteenth Conference on Computational NaturalLanguage Learning (CoNLL-2009), pages 147–155, 2009.

[41] Sebastian Ruder, Ivan Vulic, and Anders Søgaard. A survey of cross-lingual wordembedding models. arXiv:1706.04902, 2017.

[42] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert,a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprintarXiv:1910.01108, 2019.

[43] Dipanjan Sarkar. Text analytics with Python: a practitioner’s guide to natural languageprocessing. Apress, 2019.

[44] Chen-Tse Tsai, Stephen Mayhew, and Dan Roth. Cross-lingual named entity recognitionvia wikification. In Proceedings of The 20th SIGNLL Conference on ComputationalNatural Language Learning, pages 219–228, 2016.

[45] Oscar Tackstrom, Ryan McDonald, and Jakob Uszkoreit. Cross-lingual word clustersfor direct transfer of linguistic structure. In NAACL, pages 477–487, 2012.

[46] Shijie Wu, Alexis Conneau, Haoran Li, Luke Zettlemoyer, and Veselin Stoyanov.Emerging cross-lingual structure in pretrained language models. arXiv preprintarXiv:1911.01464, 2019.

[47] Shijie Wu and Mark Dredze. Beto, bentz, becas: The surprising cross-lingual e↵ective-ness of bert. arXiv preprint arXiv:1904.09077, 2019.

78

[48] Yonghui Wu, Min Jiang, Jun Xu, Degui Zhi, and Hua Xu. Clinical named entity recog-nition using deep learning models. In AMIA Annual Symposium Proceedings, volume2017, page 1812. American Medical Informatics Association, 2017.

[49] Jiateng Xie, Zhilin Yang, Graham Neubig, Noah A Smith, and Jaime Carbonell. Neuralcross-lingual named entity recognition with minimal resources. arXiv:1808.09861, 2018.

79

Documents

Master Thesis: Developing a Cross-Lingual Named Entity … · 2020-06-22 · To build a Cross-lingual Named Entity Recognition system (NER), we ﬁrst need to transfer successfully