Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
WikiMatrix: Mining 135M Parallel Sentencesin 1620 Language Pairs from Wikipedia
Holger SchwenkFacebook AI
Vishrav ChaudharyFacebook AI
Shuo SunJohns Hopkins
Hongyu GongUniversity of Illinois
Francisco GuzmanFacebook AI
Abstract
We present an approach based on multilin-gual sentence embeddings to automatically ex-tract parallel sentences from the content ofWikipedia articles in 85 languages, includingseveral dialects or low-resource languages. Wedo not limit the the extraction process to align-ments with English, but systematically con-sider all possible language pairs. In total, weare able to extract 135M parallel sentences for1620 different language pairs, out of whichonly 34M are aligned with English. This cor-pus of parallel sentences is freely available.1
To get an indication on the quality of the ex-tracted bitexts, we train neural MT baselinesystems on the mined data only for 1886 lan-guages pairs, and evaluate them on the TEDcorpus, achieving strong BLEU scores formany language pairs. The WikiMatrix bitextsseem to be particularly interesting to train MTsystems between distant languages without theneed to pivot through English.
1 Introduction
Most of the current approaches in Natural Lan-guage Processing (NLP) are data-driven. The sizeof the resources used for training is often the pri-mary concern, but the quality and a large vari-ety of topics may be equally important. Mono-lingual texts are usually available in huge amountsfor many topics and languages. However, multi-lingual resources, typically sentences in two lan-guages which are mutual translations, are morelimited, in particular when the two languages donot involve English. An important source of par-allel texts are international organizations like theEuropean Parliament (Koehn, 2005) or the UnitedNations (Ziemski et al., 2016). These are profes-sional human translations, but they are in a more
1https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix
formal language and tend to be limited to politicaltopics. There are several projects relying on vol-unteers to provide translations for public texts, e.g.news commentary (Tiedemann, 2012), Opensub-Titles (Lison and Tiedemann, 2016) or the TEDcorpus (Qi et al., 2018)
Wikipedia is probably the largest free multi-lingual resource on the Internet. The content ofWikipedia is very diverse and covers many top-ics. Articles exist in more than 300 languages.Some content on Wikipedia was human translatedfrom an existing article into another language, notnecessarily from or into English. Eventually, thetranslated articles have been later independentlyedited and are not parallel any more. Wikipediastrongly discourages the use of unedited machinetranslation,2 but the existence of such articles cannot be totally excluded. Many articles have beenwritten independently, but may nevertheless con-tain sentences which are mutual translations. Thismakes Wikipedia a very appropriate resource tomine for parallel texts for a large number of lan-guage pairs. To the best of our knowledge, thisis the first work to process the entire Wikipediaand systematically mine for parallel sentences inall language pairs. We hope that this resource willbe useful for several research areas and enable thedevelopment of NLP applications for more lan-guages.
In this work, we build on a recent approach tomine parallel texts based on a distance measurein a joint multilingual sentence embedding space(Schwenk, 2018; Artetxe and Schwenk, 2018a).For this, we use the freely available LASERtoolkit3 which provides a language agnostic sen-tence encoder which was trained on 93 languages
2https://en.wikipedia.org/wiki/Wikipedia:Translation
3https://github.com/facebookresearch/LASER
arX
iv:1
907.
0579
1v2
[cs
.CL
] 1
6 Ju
l 201
9
(Artetxe and Schwenk, 2018b). We approach thecomputational challenge to mine in almost sixhundred million sentences by using fast indexingand similarity search algorithms.
The paper is organized as follows. In thenext section, we first discuss related work. Wethen summarize the underlying mining approach.Section 4 describes in detail how we appliedthis approach to extract parallel sentences fromWikipedia in 1620 language pairs. To asses thequality of the extracted bitexts, we train NMT sys-tems for a subset of language pairs and evaluatethem on the TED corpus (Qi et al., 2018) for 45languages. These results are presented in sec-tion 5. The paper concludes with a discussion offuture research directions.
2 Related work
There is a large body of research on miningparallel sentences in collections of monolingualtexts, usually named “comparable coprora”. Ini-tial approaches to bitext mining have relied onheavily engineered systems often based on meta-data information, e.g. (Resnik, 1999; Resnikand Smith, 2003). More recent methods explorethe textual content of the comparable documents.For instance, it was proposed to rely on cross-lingual document retrieval, e.g. (Utiyama and Isa-hara, 2003; Munteanu and Marcu, 2005) or ma-chine translation, e.g. (Abdul-Rauf and Schwenk,2009; Bouamor and Sajjad, 2018), typically toobtain an initial alignment that is then furtherfiltered. In the shared task for bilingual docu-ment alignment (Buck and Koehn, 2016), manyparticipants used techniques based on n-gram orneural language models, neural translation mod-els and bag-of-words lexical translation probabil-ities for scoring candidate document pairs. TheSTACC method uses seed lexical translations in-duced from IBM alignments, which are combinedwith set expansion operations to score translationcandidates through the Jaccard similarity coeffi-cient (Etchegoyhen and Azpeitia, 2016; Azpeitiaet al., 2017, 2018). Using multilingual noisy web-crawls such as ParaCrawl4 for filtering good qual-ity sentence pairs has been explored in the sharedtasks for high resource (Koehn et al., 2018) andlow resource (Koehn et al., 2019) languages.
In this work, we rely on massively multilin-gual sentence embeddings and margin-based min-
4http://www.paracrawl.eu/
ing in the joint embedding space, as described in(Schwenk, 2018; Artetxe and Schwenk, 2018a,b).This approach has also proven to perform best ina low resource scenario (Chaudhary et al., 2019;Koehn et al., 2019). Closest to this approach is theresearch described in Espana-Bonet et al. (2017);Hassan et al. (2018); Guo et al. (2018); Yang et al.(2019). However, in all these works, only bilin-gual sentence representations have been trained.Such an approach does not scale to many lan-guages, in particular when considering all possiblelanguage pairs in Wikipedia. Finally, related ideashave been also proposed in Bouamor and Sajjad(2018) or Gregoire and Langlais (2017). However,in those works, mining is not solely based on mul-tilingual sentence embeddings, but they are partof a larger system. To the best of our knowledge,this work is the first one that applies the same min-ing approach to all combinations of many differentlanguages, written in more than twenty differentscripts.
Wikipedia is arguably the largest comparablecorpus. One of the first attempts to exploit thisresource was performed by Adafre and de Ri-jke (2006). An MT system was used to trans-late Dutch sentences into English and to com-pare them with the English texts. This methodyielded several hundreds of Dutch/English par-allel sentences. Later, a similar technique wasapplied to the Persian/English pair (Mohammadiand GhasemAghaee, 2010). Structural informa-tion in Wikipedia such as the topic categories ofdocuments was used in the alignment of multi-lingual corpora (Otero and Lopez, 2010). In an-other work, the mining approach of Munteanu andMarcu (2005) was applied to extract large corporafrom Wikipedia in sixteen languages (Smith et al.,2010). Otero et al. (2011) measured the compa-rability of Wikipedia corpora by the translationequivalents on three languages Portuguese, Span-ish, and English. Patry and Langlais (2011) cameup with a set of features such as Wikipedia enti-ties to recognize parallel documents, and their ap-proach was limited to a bilingual setting. Tufiset al. (2013) proposed an approach to mine parallelsentences from Wikipedia textual content, but theyonly considered high-resource languages, namelyGerman, Spanish and Romanian paired with En-glish. Tsai and Roth (2016) grounded multilin-gual mentions to English wikipedia by trainingcross-lingual embeddings on twelve languages.
2
Gottschalk and Demidova (2017) searched forparallel text passages in Wikipedia by compar-ing their named entities and time expressions.Finally, Aghaebrahimian (2018) propose an ap-proach based on bilingual BiLSTM sentence en-coders to mine German, French and Persian par-allel texts with English. Parallel data consistingof aligned Wikipedia titles have been extracted fortwenty-three languages5. Since Wikipedia titlesare rarely entire sentences with a subject, verb andobject, it seems that only modest improvementswere observed when adding this resource to thetraining material of NMT systems.
We are not aware of other attempts to system-atically mine for parallel sentences in the textualcontent of Wikipedia for a large number of lan-guages.
3 Distance-based mining approach
The underling idea of the mining approach usedin this work is to first learn a multilingual sen-tence embedding, i.e. an embedding space inwhich semantically similar sentences are close in-dependently of the language they are written in.This means that the distance in that space can beused as an indicator whether two sentences aremutual translations or not. Using a simple abso-lute threshold on the cosine distance was shownto achieve competitive results (Schwenk, 2018).However, it has been observed that an absolutethreshold on the cosine distance is globally notconsistent, e.g. (Guo et al., 2018). The difficultyto select one global threshold is emphasized in oursetting since we are mining parallel sentences formany different language pairs.
3.1 Margin criterion
The alignment quality can be substantially im-proved by using a margin criterion instead of anabsolute threshold (Artetxe and Schwenk, 2018a).In that work, the margin between two candidatesentences x and y is defined as the ratio betweenthe cosine distance between the two sentence em-beddings, and the average cosine similarity of its
5https://linguatools.org/tools/corpora/wikipedia-parallel-titles-corpora/
nearest neighbors in both directions:
margin(x, y)
=cos(x, y)∑
z∈NNk(x)
cos(x, z)
2k+
∑z∈NNk(y)
cos(y, z)
2k
(1)
where NNk(x) denotes the k unique nearestneighbors of x in the other language, and analo-gously for NNk(y). We used k = 4 in all experi-ments.
We follow the “max” strategy as described in(Artetxe and Schwenk, 2018a): the margin is firstcalculated in both directions for all sentences inlanguage L1 and L2. We then create the unionof these forward and backward candidates. Candi-dates are sorted and pairs with source or target sen-tences which were already used are omitted. Wethen apply a threshold on the margin score to de-cide whether two sentences are mutual translationsor not. Note that with this technique, we alwaysget the same aligned sentences, independently ofthe mining direction, e.g. searching translationsof French sentences in a German corpus, or inthe opposite direction. The reader is referred toArtetxe and Schwenk (2018a) for a detailed dis-cussion with related work.
The complexity of a distance-based mining ap-proach is O(N ×M), where N and M are thenumber of sentences in each monolingual corpus.This makes a brute-force approach with exhaus-tive distance calculations intractable for large cor-pora. Margin-based mining was shown to sig-nificantly outperform the state-of-the-art on theshared-task of the workshop on Building and Us-ing Comparable Corpora (BUCC) (Artetxe andSchwenk, 2018a). The corpora in the BUCC cor-pus are rather small: at most 567k sentences.
The languages with the largest Wikipedia areEnglish and German with 134M and 51M sen-tences, respectively, after pre-processing (see Sec-tion 4.1 for details). This would require 6.8×1015
distance calculations.6 We show in Section 3.3how to tackle this computational challenge.
3.2 Multilingual sentence embeddingsDistance-based bitext mining requires a joint sen-tence embedding for all the considered languages.
6Strictly speaking, Cebuano and Swedish are largerthan German, yet mostly consist of template/machinetranslated text https://en.wikipedia.org/wiki/List_of_Wikipedias
3
…
…
sent LidBPE
LSTM
<s>
sent LidBPE
LSTM
y1
sent LidBPE
LSTM
yn
y1
softmax
y2
softmax
</s>
softmax…
…
…
BPE emb
BiLSTM
x2
BPE emb
BiLSTM
</s>
…
sent embmax pooling
W
BPE emb
x1
BiLSTM
BiLSTM
…
BiLSTM
…
BiLSTM
…
ENCODER DECODER
Figure 1: Architecture of the system used to train massively multilingual sentence embeddings. See Artetxe andSchwenk (2018b) for details.
One may be tempted to train a bi-lingual em-bedding for each language pair, e.g. (Espana-Bonet et al., 2017; Hassan et al., 2018; Guo et al.,2018; Yang et al., 2019), but this is difficult toscale to thousands of language pairs present inWikipedia. Instead, we chose to use one singlemassively multilingual sentence embedding for alllanguages, namely the one proposed by the open-source LASER toolkit (Artetxe and Schwenk,2018b). Training one joint multilingual embed-ding on many languages at once also has the ad-vantage that low-resource languages can benefitfrom the similarity to other language in the samelanguage family. For example, we were able tomine parallel data for several Romance (minority)languages like Aragonese, Lombard, Mirandese orSicilian although data in those languages was notused to train the multilingual LASER embeddings.
The underlying idea of LASER is to train asequence-to-sequence system on many languagepairs at once using a shared BPE vocabulary anda shared encoder for all languages. The sentencerepresentation is obtained by max-pooling over allencoder output states. Figure 1 illustrates thisapproach. The reader is referred to Artetxe andSchwenk (2018b) for a detailed description.
3.3 Fast similarity search
Fast large-scale similarity search is an area witha large body of research. Traditionally, the appli-cation domain is image search, but the algorithmsare generic and can be applied to any type of vec-tors. In this work, we use the open-source FAISSlibrary7 which implements highly efficient algo-rithms to perform similarity search on billions ofvectors (Johnson et al., 2017). An additional ad-vantage is that FAISS has support to run on multi-
7https://github.com/facebookresearch/faiss
ple GPUs. Our sentence representations are 1024-dimensional. This means that the embeddings ofall English sentences require 153·106×1024×4 =513 GB of memory. Therefore, dimensionality re-duction and data compression are needed for effi-cient search. In this work, we chose a rather ag-gressive compression based on a 64-bit product-quantizer (Jegou et al., 2011), and portioning thesearch space in 32k cells. This corresponds tothe index type “OPQ64,IVF32768,PQ64” inFAISS terms.8 Another interesting compressionmethod is scalar quantization. A detailed compar-ison is left for future research. We build and trainone FAISS index for each language.
The compressed FAISS index for English re-quires only 9.2GB, i.e. more than fifty timessmaller than the original sentences embeddings.This makes it possible to load the whole index ona standard GPU and to run the search in a very ef-ficient way on multiple GPUs in parallel, withoutthe need to shard the index. The overall miningprocess for German/English requires less than 3.5hours on 8 GPUs, including the nearest neighborsearch in both direction and scoring all candidates
4 Bitext mining in Wikipedia
For each Wikipedia article, it is possible to getthe link to the corresponding article in other lan-guages. This could be used to mine sentences lim-ited to the respective articles. One one hand, thislocal mining has several advantages: 1) miningis very fast since each article usually has a fewhundreds of sentences only; 2) it seems reasonableto assume that a translation of a sentence is morelikely to be found in the same article than any-where in the whole Wikipedia. On the other hand,
8https://github.com/facebookresearch/faiss/wiki/Faiss-indexes
4
we hypothesize that the margin criterion will beless efficient since one article has usually few sen-tences which are similar. This may lead to manysentences in the overall mined corpus of the type“NAME was born on DATE in CITY”, “BUILD-ING is a monument in CITY built on DATE”, etc.Although those alignments may be correct, we hy-pothesize that they are of limited use to train anNMT system, in particular when they are too fre-quent. In general, there is a risk that we will getsentences which are close in structure and content.
The other option is to consider the wholeWikipedia for each language: for each sentencein the source language, we mine in all target sen-tences. This global mining has several potentialadvantages: 1) we can try to align two languageseven though there are only few articles in com-mon; 2) many short sentences which only differ bythe name entities are likely to be excluded by themargin criterion. A drawback of this global min-ing is a potentially increased risk of misalignmentand a lower recall.
In this work, we chose the global mining op-tion. This will allow us to scale the same ap-proach to other, potentially huge, corpora forwhich document-level alignments are not easilyavailable, e.g. Common Crawl. An in depth com-parison of local and global mining (on Wikipedia)is left for future research.
4.1 Corpus preparation
Extracting the textual content of Wikipedia arti-cles in all languages is a rather challenging task,i.e. removing all tables, pictures, citations, foot-notes or formatting markup. There are severalways to download Wikipedia content. In thisstudy, we use the so-called CirrusSearch dumpssince they directly provide the textual contentwithout any meta information.9 We downloadedthis dump in March 2019. A total of about 300 lan-guages are available, but the size obviously variesa lot between languages. We applied the followingprocessing:
• extract the textual content;
• split the paragraphs into sentences;
• remove duplicate sentences;
9https://dumps.wikimedia.org/other/cirrussearch/
L1 (French) Ceci est une tres grande maison
L2 (German) Das ist ein sehr großes HausThis is a very big houseEz egy nagyon nagy hazIni rumah yang sangat besar
Table 1: Illustration how sentences in the wrong lan-guage can hurt the alignment process with a margincriterion. See text for a detailed discussion.
• perform language identification and removesentences which are not in the expected lan-guage (usually, citations or references to textsin another language).
It should be pointed out that sentence segmenta-tion is not a trivial task, with many exceptions andspecific rules for the various languages. For in-stance, it is rather difficult to make an exhaustivelist of common abbreviations for all languages. InGerman, points are used after numbers in enumer-ations, but numbers may also appear at the end ofsentences. Other languages do not use specificsymbols to mark the end of a sentence, namelyThai. We are not aware of a reliable and freelyavailable sentence segmenter for Thai and we hadto exclude that language. We used the freely avail-able Python tool SegTok10 which has specific rulesfor 24 languages. Regular expressions were usedfor most of the Asian languages, falling back toEnglish for the remaining languages. This givesus 879 million sentences in 300 languages. Themargin criterion to mine for parallel data requiresthat the texts do not contain duplicates. This re-moves about 25% of the sentences.11
LASER’s sentence embeddings are totally lan-guage agnostic. This has the side effect that thesentences in other languages (e.g. citations orquotes) may be considered closer in the embed-ding space than a potential translation in the tar-get language. Table 1 illustrates this problem.The algorithm would not select the German sen-tence although it is a perfect translation. The sen-tences in the other languages are also valid trans-lations which would yield a very small margin. Toavoid this problem, we perform language identi-fication (LID) on all sentences and remove thosewhich are not in the expected language. LID
10https://pypi.org/project/segtok/11The Cebuano and Waray Wikipedia were largely created
by a bot and contain more than 65% of duplicates.
5
21
21.5
22
22.5
23
23.5
24
24.5
25
1.03 1.035 1.04 1.045 1.05 1.055 1.06 0
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
3000
BLEU
Bit
ext
siz
e in k
Margin threshold
de-ende-fr
10
11
12
13
14
15
16
17
1.03 1.035 1.04 1.045 1.05 1.055 1.06 0
50
100
150
200
250
300
350
400
450
500
550
600
BLEU
Bit
ext
siz
e in k
Margin threshold
cs-decs-fr
Figure 2: BLEU scores (continuous lines) for several NMT systems trained on bitexts extracted from Wikipediafor different margin thresholds. The size of the mined bitexts are depicted as dashed lines.
is performed with fasttext12 (Joulin et al., 2016).Fasttext does not support all the 300 languagespresent in Wikipedia and we disregarded the miss-ing ones (which typically have only few sentencesanyway). After deduplication and LID, we disposeof 595M sentences in 182 languages. English ac-counts for 134M sentences, and German with 51Msentences is the second largest language. The sizesfor all languages are given in Tables 3 and 5.
4.2 Threshold optimization
Artetxe and Schwenk (2018a) optimized theirmining approach for each language pair on a pro-vided corpus of gold alignments. This is not pos-sible when mining Wikipedia, in particular whenconsidering many language pairs. In this work, weuse an evaluation protocol inspired by the WMTshared task on parallel corpus filtering for low-resource conditions (Koehn et al., 2019): an NMTsystem is trained on the extracted bitexts – for dif-ferent thresholds – and the resulting BLEU scoresare compared. We choose newstest2014 ofthe WMT evaluations since it provides an N -wayparallel test sets for English, French, Germanand Czech. We favoured the translation betweentwo morphologically rich languages from differ-ent families and considered the following lan-guage pairs: German/English, German/French,Czech/German and Czech/French. The size ofmined bitexts is in the range of 100k to more than2M (see Table 2 and Figure 2). We did not tryto optimize the architecture of the NMT systemto the size of the bitexts and used the same archi-tecture for all systems: the encoder and decoderare 5-layer transformer models as implemented infairseq (Ott et al., 2019). The goal of this study
12https://fasttext.cc/docs/en/language-identification.html
is not to develop the best performing NMT systemfor the considered languages pairs, but to comparedifferent mining parameters.
The evolution of the BLEU score in function ofthe margin threshold is given in Figure 2. Decreas-ing the threshold naturally leads to more mineddata – we observe an exponential increase of thedata size. The performance of the NMT systemstrained on the mined data seems to change as ex-pected, in a surprisingly smooth way. The BLEUscore first improves with increasing amounts ofavailable training data, reaches a maximum andthan decreases since the additional data gets moreand more noisy, i.e. contains wrong translations.It is also not surprising that a careful choice ofthe margin threshold is more important in a low-resource setting. Every additional parallel sen-tence is important. According to Figure 2, the op-timal value of the margin threshold seems to be1.05 when many sentences can be extracted, in ourcase German/English and German/French. Whenless parallel data is available, i.e. Czech/Germanand Czech/French, a value in the range of 1.03–1.04 seems to be a better choice. Aiming at onethreshold for all language pairs, we chose a valueof 1.04. It seems to be a good compromise formost language pairs. However, for the open re-lease of this corpus, we provide all mined sentencewith a margin of 1.02 or better. This would enableend users to choose an optimal threshold for theirparticular applications. However, it should be em-phasized that we do not expect that many sentencepairs with a margin as low as 1.02 are good trans-lations.
For comparison, we also trained NMT systemson the Europarl corpus V7 (Koehn, 2005), i.e. pro-fessional human translations, first on all availabledata, and then on the same number of sentences
6
Bitexts de-en de-fr cs-de cs-fr
Europarl
1.9M 1.9M 568k 627k21.5 23.6 14.9 21.5
1.0M 370k 200k 220k21.2 21.1 12.6 19.2
Mined 1.0M 372k 201k 219kWikipedia 24.4 22.7 13.1 16.3
Europarl 3.0M 2.3M 768k 846k+ Wikipedia 25.5 25.6 17.7 24.0
Table 2: Comparison of NMT systems trained on theEuroparl corpus and on bitexts automatically mined inWikipedia by our approach at a threshold of 1.04. Wegive the number of sentences (first line) and the BLEUscore (second line of each bloc) on newstest2014.
than the mined ones (see Table 2). With the excep-tion of Czech/French, we were able to achieve bet-ter BLEU scores with the automatically mined bi-texts in Wikipedia than with Europarl of the samesize. Adding the mined text to the full Europarlcorpus, also leads to further improvements of 1.1to 3.1 BLEU. We argue that this is a good indi-cator of the quality of the automatically extractedparallel sentences.
5 Result analysis
We run the alignment process for all possible com-binations of languages in Wikipedia. This yielded1620 language pairs for which we were able tomine at least ten thousand sentences. Remem-ber that mining L1 → L2 is identical to L2 → L1,and is counted only once. We propose to ana-lyze and evaluate the extracted bitexts in two ways.First, we discuss the amount of extracted sen-tences (Section 5.1). We then turn to a qualitativeassessment by training NMT systems for all lan-guage pairs with more than twenty-five thousandmined sentences (Section 5.2).
5.1 Quantitative analysisDue to space limits, Table 3 summarizes the num-ber of extracted parallel sentences only for lan-guages which have a total of at least five hun-dred thousand parallel sentences (with all otherlanguages at a margin threshold of 1.04). Addi-tional results are given in Table 5 in the Appendix.
There are many reasons which can influence thenumber of mined sentences. Obviously, the largerthe monolingual texts, the more likely it is to mine
many parallel sentences. Not surprisingly, we ob-serve that more sentences could be mined whenEnglish is one of the two languages. Let us pointout some languages for which it is usually not ob-vious to find parallel data with English, namelyIndonesian (1M), Hebrew (545k), Farsi (303k)or Marathi (124k sentences). The largest minedtexts not involving English are Russian/Ukrainian(2.5M), Catalan/Spanish (1.6M), between the Ro-mance languages French, Spanish, Italian and Por-tuguese (480k–923k), and German/French (626k).
It is striking to see that we were able to minemore sentences when Galician and Catalan arepaired with Spanish than with English. On onehand, this could be explained by the fact thatLASER’s multilingual sentence embeddings maybe better since the involved languages are linguis-tically very similar. On the other, it could be thatthe Wikipedia articles in both languages share a lotof content, or are obtained by mutual translation.
Services from the European Commission pro-vide human translations of (legal) texts in all the24 official languages of the European Union. ThisN-way parallel corpus enables training of MTsystem to directly translate between these lan-guages, without the need to pivot through En-glish. This is usually not the case when translat-ing between other major languages, for example inAsia. Let us list some interesting language pairsfor which we were able to mine more than hun-dred thousand sentences: Korean/Japanese (222k),Russian/Japanese (196k), Indonesian/Vietnamese(146k), or Hebrew/Romance languages (120–150k sentences).
Overall, we were able to extract at least tenthousand parallel sentences for 85 different lan-guages.13 For several low-resource languages,we were able to extract more parallel sentenceswith other languages than English. These include,among others, Aragonse with Spanish, Lombardwith Italian, Breton with several Romance lan-guages, Western Frisian with Dutch, Luxembour-gish with German or Egyptian Arabic and Wu Chi-nese with the respective major language.
Finally, Cebuano (ceb) falls clearly apart: ithas a rather huge Wikipedia (17.9M filtered sen-tence), but most of it was generated by a bot, asfor the Waray language14. This certainly explainsthat only a very small number of parallel sen-
1399 languages have more than 5,000 parallel sentences.14https://en.wikipedia.org/wiki/Lsjbot
7
ISO
Nam
eL
angu
age
Fam
ilysi
zeaz
babe
bgbn
bsca
csda
deel
eneo
eset
eufa
fifr
glhe
hihr
huid
isit
jakk
kolt
mk
ml
mr
nenl
nooc
plpt
roru
shsi
sksl
sqsr
svsw
tate
tltr
ttuk
vizh
tota
l
arA
rabi
cA
rabi
c65
1617
1511
5440
3494
6753
9966
999
3717
440
2458
5316
350
6838
3860
9018
123
8311
4833
5232
3212
7358
974
157
7112
535
3232
3930
4958
1327
2718
6914
7093
8644
35az
Aze
rbai
jani
Turk
ic18
737
314
89
1517
1134
1071
931
107
1614
2910
129
1015
174
2523
611
98
54
222
132
2225
1447
93
98
910
195
115
242
819
1019
885
baB
ashk
irTu
rkic
536
214
49
1716
1327
1028
928
86
912
2912
94
1012
123
2613
45
79
11
121
153
1924
1542
101
1010
710
204
31
111
815
710
697
beB
elar
usia
nSl
avic
1690
164
716
149
208
339
287
55
1024
910
39
119
324
122
57
82
21
1410
219
2313
161
82
87
510
162
32
18
280
88
803
bgB
ulga
rian
Slav
ic33
2738
3476
7953
132
6235
740
122
4325
3761
117
4358
3047
6860
1710
271
1138
4286
2935
1384
589
9611
469
270
4131
4346
2665
6312
2123
2156
1212
660
6037
70bn
Ben
gali
Indo
-Ary
an14
1221
4147
3370
3628
027
8126
1420
3768
2734
2123
4136
864
384
2021
237
83
5035
452
7646
6220
825
2617
2554
612
84
3337
3131
1991
bsB
osni
anSl
avic
1060
4443
3271
3321
024
7025
1620
3660
3230
1616
439
3811
5236
622
2339
1920
845
366
4862
3759
178
1625
3419
130
387
1512
1333
839
3831
2348
caC
atal
anR
oman
ce83
3210
086
180
9012
0581
1580
5477
4483
490
268
8437
5792
107
2331
610
312
5245
6145
5617
144
102
5712
135
811
016
952
5250
5734
6710
214
3035
3177
1698
106
9080
47cs
Cze
chSl
avic
7634
7523
370
519
7518
162
3645
9518
554
7238
6310
578
2316
110
515
5355
5136
4115
139
8611
176
153
8218
650
3747
464
3063
9715
3134
2575
1410
474
8053
25da
Dan
ish
Ger
man
ic28
8118
054
436
3914
045
2629
7514
244
5525
4369
6320
115
769
3735
3930
3512
110
303
889
123
7010
937
3239
4023
4316
811
2021
2355
1162
6857
4006
deG
erm
anG
erm
anic
5094
495
1573
186
418
106
5366
163
626
8010
957
8719
210
734
388
217
2382
7864
5158
2147
220
717
285
294
129
368
6850
9410
651
8121
620
5857
3212
723
165
107
134
9692
elG
reek
Hel
leni
c32
1162
039
145
4123
3555
137
4856
2643
6473
1511
969
635
3452
2732
976
608
7714
478
114
3831
3546
2852
6211
1618
2056
968
7562
3728
enE
nglis
hG
erm
anic
1344
3129
833
7724
311
930
337
527
5744
654
523
125
948
810
1985
2126
851
2030
615
739
571
124
1579
663
637
668
2461
631
1661
224
115
178
318
180
395
546
5195
9175
477
3268
110
7378
634
321
eoE
sper
anto
cons
truc
ted
2371
149
3125
2346
134
4639
2229
5746
1510
148
726
2830
2828
981
479
7791
4381
2628
4132
1936
538
1619
1737
850
4239
2817
esSp
anis
hR
oman
ce25
202
8915
483
155
905
610
153
7194
167
198
4267
121
926
108
7692
6598
2527
218
135
235
923
183
393
8184
8193
5310
718
121
5771
4814
726
187
206
174
1498
0et
Est
onia
nU
ralic
2303
2224
7085
3239
2033
5641
1475
576
2935
3220
218
7249
773
7648
9627
1934
3518
3458
916
1615
437
5640
4425
11eu
Bas
que
Isol
ate
2259
1433
6543
2513
2135
2710
5433
518
1919
1110
544
296
4358
3047
199
2020
1221
386
1311
829
530
2523
1699
faFa
rsi
Iran
ian
3954
3471
2536
2024
3946
964
468
2620
2711
106
4932
450
7740
7221
921
2417
3042
821
127
429
4138
4221
22fi
Finn
ish
Ura
lic74
2815
647
6428
4890
6322
131
879
4347
4029
3010
126
8610
119
131
6913
939
2750
4625
4612
612
2324
2172
1076
6064
3998
frFr
ench
Rom
ance
3449
415
413
660
8516
416
138
744
214
2489
7183
6283
2533
116
612
425
555
820
641
072
7483
8648
9218
619
5665
4213
026
170
165
157
1244
1gl
Gal
icia
nR
oman
ce22
2141
2133
5056
1412
050
928
2735
2939
1166
5217
6522
756
8430
3629
3320
3954
815
1722
4312
5158
4639
69he
Heb
rew
Sem
itic
6962
2841
6563
1712
182
743
3542
2625
986
648
8413
367
131
3521
3638
2345
6710
2125
1354
873
6662
3666
hiH
indi
Indo
-Ary
an13
5321
3331
756
353
1816
246
1112
4027
444
6335
5617
618
2114
2240
713
183
294
3326
3017
81hr
Cro
atia
nSl
avic
2229
5847
1480
488
2731
5224
2410
6548
771
8551
8566
619
3553
2120
556
916
1617
429
5546
4236
29hu
Hun
gari
anU
ralic
7702
7020
146
9911
4948
4727
2810
121
7510
126
148
8714
946
2656
5527
5388
1329
3020
759
8374
7543
57id
Indo
nesi
anM
alay
o-Po
lyne
sian
3899
1614
677
845
3355
2523
1010
183
893
204
9412
743
2337
4632
5679
1324
1921
7911
7314
683
4743
isIc
elan
dic
Ger
man
ic48
831
182
912
126
53
2722
327
3520
3013
413
139
1328
47
55
163
1816
1499
5it
Ital
ian
Rom
ance
2102
517
924
8362
7358
7824
240
150
2021
948
016
130
367
6872
8148
8315
319
4458
4111
223
144
143
137
1008
4ja
Japa
nese
Japo
nic
3161
414
222
4748
2123
812
381
912
817
579
196
4019
4850
2851
9612
3731
1284
1292
7526
753
82kk
Kaz
akh
Turk
ic16
846
56
22
118
111
1722
1232
81
87
68
163
31
110
414
69
581
koK
orea
nK
orea
nic
5400
2326
1010
456
415
6393
4789
239
2526
1729
517
1713
547
548
4957
2539
ltL
ithua
nian
Bal
tic20
3128
1616
657
396
7064
3910
725
1530
3016
2946
813
1112
366
5733
3520
91m
kM
aced
onia
nSl
avic
1487
2122
953
466
5693
5688
5219
2939
2510
648
914
1516
438
5757
4527
61m
lM
alay
alam
Dra
vidi
an27
23
141
327
4158
3546
212
2221
1320
446
43
116
226
1517
1388
mr
Mar
athi
Indo
-Ary
an50
31
4635
547
8647
5022
325
2513
1956
45
31
171
2614
1716
48ne
Nep
ali
Indo
-Ary
an23
517
134
1721
1419
101
1010
69
173
11
71
116
658
4nl
Dut
chG
erm
anic
1409
113
313
177
218
9619
953
4266
6434
6115
116
3735
2990
1810
484
8863
64no
Nor
weg
ian
Ger
man
ic13
649
103
161
7412
142
2843
5126
4727
012
2423
2461
1369
7963
4771
ocO
ccita
nR
oman
ce47
213
2411
146
56
64
712
23
33
73
98
770
6pl
Polis
hSl
avic
1627
020
097
285
5640
8168
3569
121
1639
4228
9216
172
8492
6035
ptPo
rtug
uese
Rom
ance
1335
417
731
274
7671
8547
101
155
2042
5445
140
2315
621
316
510
932
roR
oman
ian
Rom
ance
3504
136
4443
4249
3058
7515
2327
2972
1382
9674
4515
ruR
ussi
anSl
avic
3253
770
4285
7844
114
140
1754
5529
119
2524
8612
214
811
311
shSe
rbo-
Cro
atia
nla
ngua
ges
Sout
hSl
avic
2069
1727
4619
373
468
1413
1735
945
4236
3337
siSi
nhal
aIn
do-A
ryan
320
2222
1015
484
44
215
121
1416
1464
skSl
ovak
Slav
ic19
0436
1734
518
1415
1637
951
3838
2726
slSl
oven
ian
Slav
ic18
3619
4750
915
1617
398
5048
4226
23sq
Alb
ania
nA
lban
ian
740
2531
613
1010
2729
3024
1515
srSe
rbia
nSl
avic
4226
519
1918
1443
871
5645
3510
svSw
edis
hG
erm
anic
1790
616
3339
3572
1782
7473
5107
swSw
ahili
Nig
er-
Con
go23
56
34
1212
1011
652
taTa
mil
Dra
vidi
an16
296
229
530
1927
1315
teTe
lugu
Dra
vidi
an15
091
211
3016
2013
15tl
Taga
log
Mal
ayo-
Poly
nesi
an31
212
116
1710
1015
trTu
rkis
hTu
rkic
4067
1067
7769
3698
ttTa
tar
Turk
ic44
911
710
603
ukU
krai
nian
Slav
ic12
871
7372
7043
viV
ietn
ames
eV
ietic
6727
8946
60zh
Chi
nese
Chi
nese
1230
644
15
Tabl
e3:
Wik
iMat
rix:
num
ber
ofex
trac
ted
sent
ence
sfo
rea
chla
ngua
gepa
ir(i
nth
ousa
nds)
,e.g
.E
s-Ja
=219
corr
espo
nds
to21
9,26
0se
nten
ces
(rou
nded
).T
heco
lum
n“s
ize”
give
sth
enu
mbe
rofl
ines
inth
em
onol
ingu
alte
xts
afte
rded
uplic
atio
nan
dL
ID.
8
Src/Trgar bg bs cs da de el en eo es frfr-ca gl he hr hu id it ja ko mk nb nl pl ptpt-br ro ru sk sl sr sv tr uk vizh-cnzh-tw
ar 4.9 1.8 3.0 4.5 3.8 6.7 20.3 4.1 13.2 12.2 9.0 5.6 3.5 2.2 2.7 9.2 9.9 4.2 5.3 5.5 4.9 4.4 3.0 12.0 12.2 5.6 5.6 1.5 2.7 1.2 4.0 2.4 4.5 12.3 8.2 4.9bg 3.0 4.2 6.7 8.7 8.5 10.0 25.3 7.7 16.3 14.7 11.9 8.4 3.2 6.4 4.7 10.3 12.2 4.0 5.4 16.2 8.1 7.7 6.3 14.7 15.4 9.8 12.4 4.0 6.4 2.7 7.1 2.4 10.4 11.7 6.4 4.7bs 1.2 6.1 4.1 3.7 5.8 4.5 21.7 9.9 6.7 4.6 5.7 1.0 30.9 2.4 5.3 6.5 1.4 12.9 3.8 2.9 9.9 10.9 5.7 4.8 3.1 10.4 10.6 4.4 1.5 5.4 5.8 2.8 1.8cs 1.9 7.8 3.7 7.1 8.3 6.4 20.0 10.4 12.6 11.4 8.6 5.0 2.5 6.5 4.9 7.8 9.6 4.1 5.5 5.0 6.3 7.6 8.1 10.8 12.1 6.3 9.4 28.1 6.7 1.6 7.0 2.6 7.8 9.0 6.6 4.7da 2.0 8.9 4.0 5.2 14.0 9.0 32.9 6.7 16.7 16.7 12.8 7.3 3.5 4.4 4.7 10.8 13.4 4.7 6.0 6.2 33.1 12.4 4.8 14.0 16.2 8.5 7.8 3.1 5.2 1.4 25.8 2.7 6.3 11.2 7.3 4.9de 2.4 9.7 4.9 8.1 16.9 7.8 24.5 15.9 17.4 18.3 14.7 6.8 4.3 5.5 7.2 8.6 13.5 6.4 6.6 5.6 11.5 17.6 6.8 14.2 15.2 8.7 9.2 5.4 8.6 1.5 12.7 3.6 7.8 11.3 9.2 4.3el 4.1 11.2 4.8 5.5 9.7 6.7 27.9 8.0 18.8 16.3 13.5 10.1 4.0 5.6 5.3 13.1 15.3 5.5 6.1 10.2 9.3 8.7 5.1 18.0 18.3 10.4 8.4 3.1 6.4 2.0 7.2 3.4 7.2 14.4 8.7 5.3en 11.9 23.9 14.7 15.5 30.9 20.4 27.1 22.6 35.8 32.6 25.1 24.3 17.3 18.8 13.5 28.8 29.5 10.2 18.6 21.8 31.8 25.1 12.0 31.4 37.0 20.4 17.4 13.8 16.5 5.5 29.1 10.3 17.6 26.9 18.0 10.7eo 1.8 6.4 7.3 7.4 13.5 8.1 23.1 16.1 17.6 12.7 10.5 1.9 3.0 4.8 9.2 13.9 2.5 4.8 6.2 7.2 12.1 6.7 12.4 16.0 6.9 8.6 7.4 5.6 0.6 8.8 1.8 5.4 7.7 5.2 3.6es 6.2 14.3 6.8 8.0 15.7 12.9 16.4 33.2 13.5 25.6 19.9 30.1 8.1 8.7 7.7 16.1 23.8 7.9 9.9 11.9 13.3 14.1 8.1 27.6 27.8 14.7 11.6 6.0 8.1 2.6 13.7 5.2 10.0 17.8 12.3 6.6fr 6.0 12.7 5.0 8.3 15.6 14.5 15.4 31.6 18.6 26.4 17.2 6.8 6.9 7.7 15.0 24.6 7.3 8.7 10.7 13.1 15.9 8.1 24.2 26.0 16.5 12.7 5.8 7.1 2.1 13.8 4.2 9.8 17.0 11.1 6.6fr-ca 4.9 12.5 2.8 7.8 14.3 12.9 15.4 27.8 14.0 23.7 18.1 6.8 6.8 7.9 13.7 23.4 7.5 8.8 10.3 15.1 7.2 18.6 23.2 15.3 11.3 6.1 6.6 3.3 12.5 5.0 9.7 15.4 6.2gl 2.6 7.3 4.7 2.9 7.4 5.3 8.5 23.4 9.2 34.4 16.0 15.2 2.5 3.5 2.8 9.8 19.3 3.6 4.2 5.6 6.9 6.5 4.3 22.4 23.7 7.7 5.9 1.7 3.1 0.2 4.3 2.3 3.9 9.4 5.8 4.2he 3.4 5.7 1.8 3.7 6.1 5.4 6.3 25.7 4.2 15.3 13.6 10.3 5.1 2.5 3.2 8.9 11.2 4.1 4.5 4.6 5.4 6.5 3.7 13.7 15.0 6.3 8.0 1.9 2.6 1.0 5.2 1.8 5.1 10.7 6.9 4.6hr 1.6 8.7 29.9 7.0 6.5 5.8 6.6 24.4 4.4 12.6 9.9 7.8 5.7 1.5 4.3 8.2 10.4 1.8 4.0 14.1 5.3 6.1 5.4 12.1 12.6 7.3 8.3 4.6 12.6 11.7 6.0 1.9 6.9 9.4 4.8 2.9hu 1.6 5.6 2.6 5.9 7.0 5.5 16.7 6.5 10.8 10.9 9.3 3.9 2.1 3.6 6.6 8.2 4.4 6.2 3.9 4.5 6.0 4.3 9.7 9.9 7.1 5.8 3.4 4.2 1.2 5.3 3.0 4.4 8.5 6.5 4.2id 4.1 9.1 4.2 5.2 10.1 6.7 11.1 24.9 8.2 16.4 15.1 11.1 9.9 5.1 5.6 5.2 12.7 5.8 9.1 7.0 10.0 9.4 5.5 14.6 16.7 9.8 8.1 3.6 5.6 1.8 9.3 4.6 7.3 18.5 11.0 6.2it 5.3 11.7 5.0 6.9 13.1 11.5 14.5 30.0 13.9 26.4 24.9 20.0 19.3 6.2 7.0 6.7 14.0 7.3 9.0 9.9 13.3 12.8 7.3 22.8 24.9 13.3 10.2 5.4 7.1 2.3 11.9 4.6 8.8 15.3 10.7 5.8ja 1.4 1.9 0.7 1.8 3.1 2.7 2.5 7.9 2.2 6.0 6.0 4.7 2.3 1.4 1.3 2.0 3.5 4.6 16.9 1.6 2.6 3.1 1.8 5.1 4.9 2.8 2.7 1.2 1.8 0.5 2.6 1.9 2.2 5.9ko 0.9 1.7 1.3 2.0 1.7 1.7 8.7 1.3 4.7 4.4 3.4 1.5 0.9 0.7 1.5 3.1 3.2 9.2 1.2 1.3 1.9 1.4 4.3 4.6 2.0 2.1 1.0 1.5 0.4 1.5 1.4 1.5 4.3mk 2.4 18.2 12.0 5.4 7.2 4.3 10.3 23.4 8.9 15.2 11.5 10.0 5.4 4.0 12.6 3.7 8.9 11.1 3.5 4.7 7.5 6.7 4.5 13.9 15.6 8.3 7.5 3.5 7.1 3.7 5.6 2.0 6.4 10.9 6.3 4.3nb 2.7 8.5 5.4 32.7 9.8 8.9 35.1 9.5 17.0 14.6 8.0 3.1 3.9 4.5 9.9 15.0 5.5 5.7 6.7 4.7 14.2 9.5 7.5 3.2 3.9 0.6 26.3 1.9 6.2 7.8nl 2.4 8.2 2.9 5.9 14.2 16.1 8.4 26.5 13.4 16.8 16.7 13.5 7.3 3.9 4.8 5.3 11.4 13.3 5.2 5.9 5.9 5.3 13.8 15.4 7.8 7.6 4.1 5.1 1.6 11.1 3.2 6.1 10.6 8.0 5.1pl 1.8 7.4 2.9 8.2 6.6 6.5 5.4 15.1 7.5 11.4 11.3 8.6 5.2 2.3 4.8 4.0 7.5 8.6 4.2 5.7 5.2 4.1 6.2 9.6 9.9 6.2 9.5 6.2 5.3 1.5 5.6 2.4 8.9 7.7 6.3 3.6pt 7.4 15.2 7.0 8.0 15.5 13.2 18.7 35.0 12.0 32.4 26.7 19.0 23.0 8.8 10.3 8.4 17.5 24.4 7.8 10.6 13.1 14.3 14.9 8.9 15.3 11.7 6.6 8.1 2.0 14.3 6.2 9.3 18.0 12.9 5.8pt-br 6.5 14.7 7.4 8.6 16.8 12.9 17.6 37.3 16.0 31.0 26.6 20.3 23.0 8.7 9.8 8.1 18.6 24.8 7.8 10.7 12.5 14.8 8.5 15.1 11.8 6.4 8.9 2.8 14.6 5.3 10.8 18.8 13.2 6.7ro 3.2 9.7 3.7 5.1 9.4 7.5 10.4 25.0 6.7 18.8 19.3 14.6 10.0 4.0 5.7 5.9 11.0 15.5 4.3 6.4 7.3 8.0 8.1 5.2 15.4 17.7 8.0 3.6 5.0 1.9 7.0 3.3 6.6 12.7 7.8 4.9ru 3.3 12.6 4.2 7.7 8.5 8.8 8.3 18.7 9.9 14.3 14.5 11.0 6.0 4.9 6.8 5.6 9.5 11.7 6.1 7.7 7.4 8.0 8.1 8.9 12.4 13.9 8.2 5.8 5.4 2.7 8.2 2.9 22.5 11.5 9.1 5.2sk 0.7 5.1 2.7 27.0 4.3 5.7 3.2 16.9 9.3 9.4 8.5 6.7 2.7 1.0 5.1 3.7 4.9 6.9 2.2 3.9 3.5 4.9 5.0 7.1 7.8 8.5 3.5 6.6 5.0 1.5 4.3 1.6 5.4 5.4 2.3 2.5sl 1.2 6.2 7.6 5.5 4.7 7.6 5.8 17.3 5.9 11.4 8.5 6.4 3.2 1.2 11.2 3.9 6.5 7.8 2.7 4.2 6.3 4.3 5.5 4.8 9.9 4.8 5.9 3.6 1.9 4.1 2.2 4.3 7.4 3.8 2.6sr 1.8 7.6 33.2 5.1 3.7 3.8 5.8 22.8 3.2 11.9 9.1 7.5 3.5 1.1 30.4 2.7 6.1 8.2 1.2 2.9 13.7 3.3 3.3 3.9 10.8 11.3 5.6 7.2 2.8 9.5 3.0 1.1 5.7 6.8 4.3 3.1sv 2.2 7.4 4.8 5.9 26.5 12.6 8.1 31.8 11.0 16.9 15.7 10.7 7.1 3.3 5.5 5.4 11.6 13.3 4.8 6.2 5.2 25.4 11.5 5.8 15.0 17.4 7.9 8.1 3.8 4.7 1.0 3.2 6.9 12.9 7.8 4.8tr 2.2 3.5 2.0 2.6 3.9 4.1 4.7 15.9 2.9 9.4 7.7 6.7 3.6 1.6 2.1 3.4 6.7 6.4 4.3 7.0 3.5 3.1 4.2 2.5 9.0 8.4 4.6 4.0 1.8 2.3 0.8 3.5 3.3 8.2 6.7 4.4uk 2.9 12.3 5.3 7.4 7.5 7.5 8.4 20.7 6.5 14.2 14.1 11.2 5.5 3.5 6.6 4.7 9.5 11.2 4.9 5.8 7.2 6.3 6.9 9.6 12.9 7.2 23.5 4.9 5.7 2.6 6.9 2.6 11.4 7.9 4.9vi 4.2 7.5 4.0 4.7 8.5 6.0 8.8 20.2 7.3 13.7 13.2 9.9 6.5 4.6 4.9 4.7 14.7 10.7 5.6 9.3 6.9 5.7 7.3 4.5 13.0 14.1 8.5 7.2 3.4 4.6 1.7 8.2 4.0 6.7 9.9 6.7zh-cn 2.1 3.2 1.0 2.2 3.8 3.2 4.5 11.8 3.8 8.2 7.6 3.2 1.7 1.9 3.0 6.6 6.0 3.4 3.8 2.2 7.1 7.9 4.1 4.1 1.6 2.4 0.9 3.1 2.3 3.0 10.8zh-tw 2.2 3.1 1.1 2.1 3.7 2.8 3.9 10.7 3.4 7.5 7.2 6.1 2.8 1.8 1.6 3.0 6.2 5.4 2.8 3.5 2.3 6.3 6.9 3.5 3.9 1.4 2.1 0.9 3.0 2.4 2.9 10.0
Table 4: BLEU scores on the TED test set as proposed in (Qi et al., 2018). NMT systems were trained on bitextsmined in Wikipedia only (with at least twenty-five thousand parallel sentences). No other resources were used.
tences could be extracted. Although the same botwas also used to generate articles in the SwedishWikipedia, our alignments seem to be better forthat language.
5.2 Qualitative evaluation
Aiming to perform a large-scale assessment ofthe quality of the extracted parallel sentences, wetrained NMT systems on the extracted parallelsentences. We identified a publicly available dataset which provide test sets for many languagepairs: translations of TED talks as proposed in thecontext of a study on pretrained word embeddingsfor NMT15 (Qi et al., 2018). We would like to em-phasize that we did not use the training data pro-vided by TED – we only trained on the mined sen-tences from Wikipedia. The goal of this study isnot to build state-of-the-art NMT system for forthe TED task, but to get an estimate of the qual-ity of our extracted data, for many language pairs.In particular, there may be a mismatch in the topicand language style between Wikipedia texts andthe transcribed and translated TED talks.
For training NMT systems, we used a trans-former model from fairseq (Ott et al., 2019)
15https://github.com/neulab/word-embeddings-for-nmt
with the parameter settings shown in Figure 3in the appendix. For preprocessing, the textwas tokenized using the Moses tokenizer (withouttrue casing) and a 5000 subword vocabulary waslearnt using SentencePiece (Kudo and Richardson,2018). Decoding was done with beam size 5 andlength normalization 1.2.
We evaluate the trained translation systems onthe TED dataset (Qi et al., 2018). The TED dataconsists of parallel TED talk transcripts in mul-tiple languages, and it provides development andtest sets for 50 languages. Since the developmentand test sets were already tokenized, we first deto-kenize them using Moses. We trained NMT sys-tems for all possible language pairs with more thantwenty-five thousand mined sentences. This givesus in total 1886 language pairs in 45 languages.We train L1 → L2 and L2 → L1 with the samemined bitexts L1/L2. Scores on the test sets werecomputed with SacreBLEU (Post, 2018). Table 4summarizes all the results. Due to space con-straints, we are unable to report BLEU score for alllanguage combinations in that table. Some addi-tional results are reported in Table 6 in the annex.23 NMT systems achieve BLEU scores over 30,the best one being 37.3 for Brazilian Portugueseto English. Several results are worth mentioning,
9
like Farsi/English: 16.7, Hebrew/English: 25.7,Indonesian/English: 24.9 or English/Hindi: 25.7We also achieve interesting results for translationbetween various non English language pairs forwhich it is usually not easy to find parallel data,e.g. Norwegian ↔ Danish ≈33, Norwegian ↔Swedish ≈25, Indonesian ↔ Vietnamese ≈16 orJapanese / Korean ≈17.
Our results on the TED set give an indication onthe quality of the mined parallel sentences. TheseBLEU scores should be of course appreciated incontext of the sizes of the mined corpora as givenin Table 3. Obviously, we can not exclude thatthe provided data contains some wrong alignmentseven though the margin is large. Finally, we wouldlike to point out that we run our approach on allavailable languages in Wikipedia, independentlyof the quality of LASER’s sentence embeddingsfor each one.
6 Conclusion
We have presented an approach to systematicallymine for parallel sentences in the textual contentof Wikipedia, for all possible language pairs. Weuse a recently proposed mining approach basedon massively multilingual sentence embeddings(Artetxe and Schwenk, 2018b) and a margin cri-terion (Artetxe and Schwenk, 2018a). The sameapproach is used for all language pairs without theneed of a language specific optimization. In total,we make available 135M parallel sentences in 85languages, out of which only 34M sentences arealigned with English. We were able to mine morethan ten thousands sentences for 1620 differentlanguage pairs. This corpus of parallel sentencesis freely available.16 We also performed a largescale evaluation of the quality of the mined sen-tences by training 1886 NMT systems and evalu-ating them on the 45 languages of the TED corpus(Qi et al., 2018).
This work opens several directions for future re-search. The mined texts could be used to first re-train LASER’s multilingual sentence embeddingswith the hope to improve the performance on low-resource languages, and then to rerun mining inWikipedia. This process could be iteratively re-peated. We also plan to apply the same method-ology to other large multilingual collections. Themonolingual texts made available by ParaCrawl or
16https://github.com/facebookresearch/LASER/tree/master/tasks/WikiMatrix
CommonCrawl17 are good candidates.We expect that the WikiMatrix corpus has
mostly well-formed sentences and it should notcontain social media language. The mined paral-lel sentences are not limited to specific topics likemany of the currently available resources (parlia-ment proceedings, subtitles, software documenta-tion, . . .), but are expected to cover many topicsof Wikipedia. The fraction of unedited machinetranslated text is also expected to be low. We hopethat this resource will be useful to support researchin multilinguality, in particular machine transla-tion.
7 Acknowledgments
We would like to thank Edoaurd Grave for helpwith handling the Wikipedia corpus and MatthijsDouze for support with the use of FAISS.
ReferencesSadaf Abdul-Rauf and Holger Schwenk. 2009. On the
Use of Comparable Corpora to Improve SMT per-formance. In EACL, pages 16–23.
Sisay Fissaha Adafre and Maarten de Rijke. 2006.Finding similar sentences across multiple languagesin Wikipedia. In Proceedings of the Workshop onNEW TEXT Wikis and blogs and other dynamic textsources.
Ahmad Aghaebrahimian. 2018. Deep neural networksat the service of multilingual parallel sentence ex-traction. In Coling.
Mikel Artetxe and Holger Schwenk. 2018a. Margin-based Parallel Corpus Mining with MultilingualSentence Embeddings. https://arxiv.org/abs/1811.01136.
Mikel Artetxe and Holger Schwenk. 2018b. Mas-sively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. In https://arxiv.org/abs/1812.10464.
Andoni Azpeitia, Thierry Etchegoyhen, and EvaMartınez Garcia. 2017. Weighted Set-TheoreticAlignment of Comparable Sentences. In BUCC,pages 41–45.
Andoni Azpeitia, Thierry Etchegoyhen, and EvaMartınez Garcia. 2018. Extracting Parallel Sen-tences from Comparable Corpora with STACC Vari-ants. In BUCC.
Houda Bouamor and Hassan Sajjad. 2018.H2@BUCC18: Parallel Sentence Extractionfrom Comparable Corpora Using MultilingualSentence Embeddings. In BUCC.17http://commoncrawl.org/
10
Christian Buck and Philipp Koehn. 2016. Findings ofthe wmt 2016 bilingual document alignment sharedtask. In Proceedings of the First Conference on Ma-chine Translation, pages 554–563, Berlin, Germany.Association for Computational Linguistics.
Vishrav Chaudhary, Yuqing Tang, Francisco Guzman,Holger Schwenk, and Philipp Koehn. 2019. Low-resource corpus filtering using multilingual sentenceembeddings. In Proceedings of the Fourth Confer-ence on Machine Translation (WMT).
Cristina Espana-Bonet, Adam Csaba Varga, AlbertoBarron-Cedeno, and Josef van Genabith. 2017. AnEmpirical Analysis of NMT-Derived InterlingualEmbeddings and their Use in Parallel Sentence Iden-tification. IEEE Journal of Selected Topics in SignalProcessing, pages 1340–1348.
Thierry Etchegoyhen and Andoni Azpeitia. 2016. Set-Theoretic Alignment for Comparable Corpora. InACL, pages 2009–2018.
Simon Gottschalk and Elena Demidova. 2017. Mul-tiwiki: Interlingual text passage alignment inWikipedia. ACM Transactions on the Web (TWEB),11(1):6.
Francis Gregoire and Philippe Langlais. 2017. BUCC2017 Shared Task: a First Attempt Toward a DeepLearning Framework for Identifying Parallel Sen-tences in Comparable Corpora. In BUCC, pages 46–50.
Mandy Guo, Qinlan Shen, Yinfei Yang, HemingGe, Daniel Cer, Gustavo Hernandez Abrego, KeithStevens, Noah Constant, Yun-Hsuan Sung, BrianStrope, and Ray Kurzweil. 2018. Effective Paral-lel Corpus Mining using Bilingual Sentence Embed-dings. arXiv:1807.11906.
Hany Hassan, Anthony Aue, Chang Chen, VishalChowdhary, Jonathan Clark, Christian Feder-mann, Xuedong Huang, Marcin Junczys-Dowmunt,William Lewis, Mu Li, Shujie Liu, Tie-Yan Liu,Renqian Luo, Arul Menezes, Tao Qin, Frank Seide,Xu Tan, Fei Tian, Lijun Wu, Shuangzhi Wu,Yingce Xia, Dongdong Zhang, Zhirui Zhang, andMing Zhou. 2018. Achieving Human Parity onAutomatic Chinese to English News Translation.arXiv:1803.05567.
Jeff Johnson, Matthijs Douze, and Herve Jegou. 2017.Billion-scale similarity search with GPUs. arXivpreprint arXiv:1702.08734.
Armand Joulin, Edouard Grave, Piotr Bojanowski, andTomas Mikolov. 2016. Bag of tricks for efficienttext classification. https://arxiv.org/abs/1607.01759.
H. Jegou, M. Douze, and C. Schmid. 2011. Prod-uct quantization for nearest neighbor search. IEEETrans. PAMI, 33(1):117–128.
Philipp Koehn. 2005. Europarl: A parallel corpus forstatistical machine translation. In MT summit.
Philipp Koehn, Francisco Guzman, Vishrav Chaud-hary, and Juan M. Pino. 2019. Findings of thewmt 2019 shared task on parallel corpus filteringfor low-resource conditions. In Proceedings of theFourth Conference on Machine Translation, Volume2: Shared Task Papers, Florence, Italy. Associationfor Computational Linguistics.
Philipp Koehn, Huda Khayrallah, Kenneth Heafield,and Mikel L. Forcada. 2018. Findings of the wmt2018 shared task on parallel corpus filtering. In Pro-ceedings of the Third Conference on Machine Trans-lation: Shared Task Papers, pages 726–739, Bel-gium, Brussels. Association for Computational Lin-guistics.
Taku Kudo and John Richardson. 2018. Sentencepiece:A simple and language independent subword tok-enizer and detokenizer for neural text processing. InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing: SystemDemonstrations, pages 66–71, Brussels, Belgium.Association for Computational Linguistics.
P. Lison and J. Tiedemann. 2016. Opensubtitles2016:Extracting large parallel corpora from movie and tvsubtitles. In LREC.
Mehdi Zadeh Mohammadi and NasserGhasemAghaee. 2010. Building bilingual par-allel corpora based on Wikipedia. In 2010 SecondInternational Conference on Computer Engineeringand Applications, pages 264–268.
Dragos Stefan Munteanu and Daniel Marcu. 2005. Im-proving Machine Translation Performance by Ex-ploiting Non-Parallel Corpora. Computational Lin-guistics, 31(4):477–504.
P Otero, I Lopez, S Cilenis, and Santiago de Com-postela. 2011. Measuring comparability of multi-lingual corpora extracted from Wikipedia. IberianCross-Language Natural Language ProcessingsTasks (ICL), page 8.
Pablo Gamallo Otero and Isaac Gonzalez Lopez. 2010.Wikipedia as multilingual source of comparable cor-pora. In Proceedings of the 3rd Workshop on Build-ing and Using Comparable Corpora, LREC, pages21–25.
Myle Ott, Sergey Edunov, Alexei Baevski, AngelaFan, Sam Gross, Nathan Ng, David Grangier, andMichael Auli. 2019. fairseq: A fast, extensibletoolkit for sequence modeling. In Proceedings ofthe 2019 Conference of the North American Chap-ter of the Association for Computational Linguistics(Demonstrations), pages 48–53, Minneapolis, Min-nesota. Association for Computational Linguistics.
Alexandre Patry and Philippe Langlais. 2011. Identi-fying parallel documents from a large bilingual col-lection of texts: Application to parallel article ex-traction in Wikipedia. In Proceedings of the 4th
11
Workshop on Building and Using Comparable Cor-pora: Comparable Corpora and the Web, pages 87–95. Association for Computational Linguistics.
Matt Post. 2018. A call for clarity in reporting bleuscores. In Proceedings of the Third Conference onMachine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computa-tional Linguistics.
Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Pad-manabhan, and Graham Neubig. 2018. When andwhy are pre-trained word embeddings useful forneural machine translation? In Proceedings of the2018 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 2 (Short Pa-pers), pages 529–535, New Orleans, Louisiana. As-sociation for Computational Linguistics.
Philip Resnik. 1999. Mining the Web for BilingualText. In ACL.
Philip Resnik and Noah A. Smith. 2003. The Webas a Parallel Corpus. Computational Linguistics,29(3):349–380.
Holger Schwenk. 2018. Filtering and mining paralleldata in a joint multilingual space. In ACL, pages228–234.
Jason R. Smith, Chris Quirk, and Kristina Toutanova.2010. Extracting parallel sentences from compa-rable corpora using document level alignment. InNAACL, pages 403–411.
J. Tiedemann. 2012. Parallel data, tools and interfacesin OPUS. In LREC.
Chen-Tse Tsai and Dan Roth. 2016. Cross-lingual wik-ification using multilingual embeddings. In Pro-ceedings of the 2016 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages589–598.
Dan Tufis, Radu Ion, S, tefan Daniel, Dumitrescu, andDan S, tefanescu. 2013. Wikipedia as an smt trainingcorpus. In RANLP, pages 702–709.
Masao Utiyama and Hitoshi Isahara. 2003. ReliableMeasures for Aligning Japanese-English News Arti-cles and Sentences. In ACL.
Yinfei Yang, Gustavo Hernandez Abrego, Steve Yuan,Mandy Guo, Qinlan Shen, Daniel Cer, Yun-HsuanSung, Brian Strope, and Ray Kurzweil. 2019. Im-proving multilingual sentence embedding using bi-directional dual encoder with additive margin soft-max. In https://arxiv.org/abs/1902.08564.
Michał Ziemski, Marcin Junczys-Dowmunt, and BrunoPouliquen. 2016. The United Nations Parallel Cor-pus v1.0. In LREC.
12
A Appendix
Table 5 provides the amounts of mined parallelsentences for languages which have a rather smallWikipedia. Aligning those languages obviouslyyields to a very small amount of parallel sentences.Therefore, we only provide these results for align-ment with high resource languages. It is also likelythat several of these alignments are of low qual-ity since the LASER embeddings were not directlytrained on most these languages, but we still hopeto achieve reasonable results since other languagesof the same family may be covered.
ISO Name LanguageFamily
size ca da de en es fr it nl pl pt sv ru zh total
an Aragonese Romance 222 24 7 12 23 33 16 13 9 10 14 9 11 6 324arz Egyptian
ArabicArabic 120 7 6 11 18 12 12 10 8 9 10 8 12 7 278
as Assamese Indo-Aryan 124 8 6 11 7 11 12 10 9 9 8 8 9 3 216azb South Azer-
baijaniTurkic 398 6 4 9 8 9 10 9 7 6 8 6 7 3 172
bar Bavarian Germanic 214 7 6 41 16 12 12 10 8 9 10 8 10 5 261bpy Bishnupriya Indo-Aryan 128 2 1 4 4 3 4 2 2 3 2 2 3 1 71br Breton Celtic 413 20 16 22 23 22 19 16 6 200ce Chechen Northeast
Caucasian315 2 1 2 2 2 2 2 2 2 2 2 2 1 56
ceb Cebuano Malayo-Polynesian
17919 14 9 22 29 27 24 24 15 17 20 55 21 9 594
ckb Central Kur-dish
Iranian 127 2 2 6 8 5 5 4 4 4 4 3 6 4 113
cv Chuvash Turkic 198 4 3 5 4 6 6 7 5 4 6 5 8 2 129dv Maldivian Indo-Aryan 52 2 2 5 6 4 4 3 3 3 3 3 5 3 96fo Faroese Germanic 114 13 12 14 32 21 18 15 11 11 17 12 13 6 335fy Western
FrisianGermanic 493 13 8 16 32 21 18 17 38 12 18 13 14 5 453
gd Gaelic Celtic 66 1 1 1 1 1 1 1 1 1 1 1 1 1 41ga Irish Irish 216 2 3 4 3 3 3 2 2 3 2 3 1 70gom Goan
KonkamiIndo-Aryan 69 9 7 10 8 13 13 13 9 9 11 9 10 4 240
ht Haitian Cre-ole
Creole 60 2 1 3 4 3 4 3 2 3 2 2 3 1 72
ilo Iloko Philippine 63 3 2 4 5 4 4 4 3 3 4 3 4 2 96io Ido constructed 153 5 3 6 11 7 7 5 5 5 6 5 5 3 143jv Javanese Malayo-
Polynesian220 8 5 8 13 12 10 11 8 7 11 8 8 3 219
ka Georgian Kartvelian 480 11 7 15 12 16 17 16 12 11 14 12 13 5 288ku Kurdish Iranian 165 5 4 8 5 8 7 8 7 6 7 6 6 3 222la Latin Romance 558 12 9 17 32 20 18 17 12 13 18 13 14 6 478lb LuxembourgishGermanic 372 12 7 26 22 19 18 15 11 11 16 12 11 4 305lmo Lombard Romance 147 6 3 7 10 7 7 11 6 5 7 5 5 3 144mg Malagasy Malayo-
Polynesian263 6 5 9 13 9 12 8 7 7 7 8 7 4 199
mhr EasternMari
Uralic 61 3 2 4 3 4 4 5 3 3 4 3 4 2 96
min MinangkabauMalayo-Polynesian
255 4 2 6 7 5 5 5 4 4 4 5 5 2 121
mn Mongolian Mongolic 255 4 3 7 5 6 6 7 6 5 5 5 5 3 197mwl Mirandese Romance 64 6 3 4 10 8 6 5 3 4 34 3 4 2 154nds nl Low Ger-
man/SaxonGermanic 65 5 4 6 10 7 7 6 15 5 6 5 5 3 151
ps Pashto Iranian 89 2 3 2 3 3 3 3 3 3 3 3 1 73rm Romansh Italic 57 2 2 10 5 4 4 3 2 3 3 3 3 1 86sah Yakut Turkic/Sib 134 4 3 7 5 6 6 6 5 5 5 5 6 3 134scn Sicilian Romance 81 5 3 6 9 7 7 11 5 5 6 5 5 2 143sd Sindhi Iranian 115 3 9 8 8 7 7 6 7 5 8 5 152su Sundanese Malayo-
Polynesian120 4 3 5 7 6 5 6 4 4 5 4 4 2 117
tk Turkmen Turkic 56 2 2 3 3 4 3 4 2 2 4 2 3 1 76tg Tajik Iranian 248 5 4 11 15 9 9 8 8 7 8 6 10 6 192ug Uighur Turkic 83 4 3 9 10 7 8 6 6 5 6 5 9 6 168ur Urdu Indo-Aryan 150 2 2 3 5 3 3 3 3 3 3 3 3 2 123wa Walloon Romance 56 3 2 4 5 5 4 4 3 3 4 3 3 2 93wuu Wu Chinese Chinese 75 8 6 11 17 12 11 10 8 9 11 9 10 43 283yi Yiddish Germanic 131 3 2 4 3 4 4 5 3 3 4 3 4 1 92
Table 5: WikiMatrix (part 2): number of extracted sen-tences (in thousands) for languages with a rather smallWikipedia. Alignments with other languages yield lessthan 5k sentences and are omitted for clairty.
Table 3 gives the detailed configuration whichwas used to train NMT models on the mined datain Section 5.
--arch transformer--share-all-embeddings--encoder-layers 5--decoder-layers 5--encoder-embed-dim 512--decoder-embed-dim 512--encoder-ffn-embed-dim 2048--decoder-ffn-embed-dim 2048--encoder-attention-heads 2--decoder-attention-heads 2--encoder-normalize-before--decoder-normalize-before--dropout 0.4--attention-dropout 0.2--relu-dropout 0.2--weight-decay 0.0001--label-smoothing 0.2--criterion label smoothed cross entropy--optimizer adam--adam-betas ’(0.9, 0.98)’--clip-norm 0--lr-scheduler inverse sqrt--warmup-update 4000--warmup-init-lr 1e-7--lr 1e-3 --min-lr 1e-9--max-tokens 4000--update-freq 4--max-epoch 100--save-interval 10
Figure 3: Model settings for NMT training withfairseq
Finally, Table 6 gives the BLEU scores on theTED corpus when translating into and from En-glish for some additional languages.
Lang xx→ en en→ xx
et 15.9 14.3eu 10.1 7.6fa 16.7 8.8fi 10.9 10.9lt 13.7 10.0hi 17.8 21.9mr 2.6 3.5
Table 6: BLEU scores on the TED test set as proposedin (Qi et al., 2018). NMT systems were trained on bi-texts mined in Wikipedia only. No other resources wereused.
13