Comparing Languages and Dialects in Central Asia - Over … · Comparing Languages and Dialects in Central Asia ... This thesis is the result of a collaboration between ... each pair

Comparing Languages and Dialects in Central Asia

Rene van der Ark, MA-Thesis, University of Groningen

27-06-2008

1

Contents

1 Introduction 3

2 Data Description 4

3 Levenshtein Distance Analysis 73.1 Method Description . . . . . . . . . . . . . . . . . . . . . . . 7

3.1.1 General Description of Approach . . . . . . . . . . . . 73.1.2 The Levenshtein Algorithm . . . . . . . . . . . . . . . 73.1.3 Levenshtein Applied to Dialectometric Research . . . 83.1.4 Interpretation of Edit-Distances using Several Visual-

isation Techniques . . . . . . . . . . . . . . . . . . . . 103.1.5 Hierarchical Agglomerative Clustering . . . . . . . . . 113.1.6 Multidimensional Scaling . . . . . . . . . . . . . . . . 13

3.2 Application of Levenshtein Distance Analysis . . . . . . . . . 153.2.1 Matrix Generation . . . . . . . . . . . . . . . . . . . . 153.2.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 163.2.3 Multidimensional Scaling . . . . . . . . . . . . . . . . 203.2.4 Differences in Swadesh Lists . . . . . . . . . . . . . . . 23

4 Exploring Automatic Loanword-Detection 274.1 Method Description . . . . . . . . . . . . . . . . . . . . . . . 27

4.1.1 General Description of Approach . . . . . . . . . . . . 274.1.2 Applying Precision / Recall to Pairwise Levenshtein

Results . . . . . . . . . . . . . . . . . . . . . . . . . . 274.1.3 Paired Edit-Distances Versus a Generalized Matrix . . 284.1.4 Generating Precision / Recall Figures Based on Pre-

classification . . . . . . . . . . . . . . . . . . . . . . . 294.1.5 Conjecture of Borrowing-Threshold Based on Preci-

sion / Recall . . . . . . . . . . . . . . . . . . . . . . . 304.2 Application of Loanword Detection . . . . . . . . . . . . . . . 32

5 Mix of Distributions Analysis 425.1 Method Description . . . . . . . . . . . . . . . . . . . . . . . 425.2 Results of Mix of Distributions Analysis . . . . . . . . . . . . 42

6 Discussion 45.

(Appendix starts at page 51)

2

1 Introduction

This thesis is the result of a collaboration between the Paris anthropologicalresearch institute Musee de l’Homme and the University of Groningen. Thepurpose of the larger research-project (led by Evelyne Heyer of Musee del’Homme) is to analyse the common traits between genetic and linguisticmarkers in the countries Uzbekistan, Tajikistan and Kirghistan. The taskof the University of Groningen was to quantify the differences in the linguis-tic data, acquired and preclassified by Philippe Mennecier, by means of theLevenshtein algorithm, a procedure which measures the distance betweentwo strings of tokens[3]. Unique about the data provided is that they spanacross multiple languages and even two larger language groups.

This thesis has two main research questions. The first applies the meth-ods developed by Wilbert Heeringa using Levenshtein-distance and aggrega-tive statistics to the linguistic data provided by Mennecier, the second aim-ing to test whether Levenshtein distance is suitable for automatic detectionof loanwords in linguistic data. A secondary goal in these two main theses isto test whether the reduction of the wordlist used in the analysis will havea significant impact on the results.

The first thesis analysed can be formulated as follows. It has been provedthat the Levenshtein-measure is effective in separating dialects within a sin-gle language; we aim to prove that it will work equally well to separate twodifferent language groups. We assume this to be a straightforward task andwill make use of the preclassifications made by Mennecier to reinforce thisclaim.

Secondly, we aim to discover whether Levenshtein-distance can be adoptedto automatically detect loanwords in phonetic transcriptions. To this end anovel use of Precision/Recall analysis is used on the data, using the preclas-sification made by Mennecier. Next, the distribution of edit-distances foreach pair of respondents is analysed as a mix of distributions, by means ofthe EM-algorithm, which is also a novel approach to the analysis of phonetictranscriptions.

The tertiary goal, analysed in both main theses, is to test whether reduc-ing the full word-list used in this research to only the words in the Swadesh-100 and Swadesh-200 wordlist will significantly impact the results of theanalysis. This will be further clarified in the body of the thesis.

3

Figure 1: Map of the Central Asian region. As can be seen an attempt wasmade to equally distribute the respondents. Some of the Kirghiz respondentsand the Turkmen respondents are not part of the data for this thesis.

2 Data Description

The special topic of interest for Musee de l’Homme was combining geneticdata with the linguistic data acquired. This is the main topic of researchfor Franz Manni, who published earlier papers on the same topics [12][13].For this purpose, the Central Asian region has some special interest, exactlydue to the fact that there is a mix of roaming nomads, settled farmers andcity folk. Although it is common in computational dialect comparison tohave a spatial component with spatially equally distributed respondents, itis omitted for the purpose of this thesis. On the one hand, the purpose ofthe broader research — combining the genetic data with the linguistic data— does not explicitly require this spatial component, at least not in theearly stage of research at which time this thesis was formulated. On theother hand, because the main focus of the two-fold thesis described in theintroduction is not directly on the spatial component.

The data for this thesis were recorded by the linguist Philippe Mennecier,of Musee de l’Homme (Paris), in the countries Uzbekistan, Kyrgyzstan andTajikistan. These three bordering countries are completely landlocked, dou-

4

English X-SAMPA (1) IPA (1) X-SAMPA (2) IPA (2)one bIr bIr i: i:

two jIk@ jIk@ dU: dU:

three yt S y>tS saraj saraj

four t8rt t8rt safOr safOr

five b jIS ájIS panZ panZ

big }lk@n 0lk@n kalOn kalOn

long uzaq uzaq darOz darOz

wide k jeN kjeN kuSOd kuSOd

thick s jemlz O sjemlz�

Gafs Gafs

heavy awlr awlr vazmin vazmin

Table 1: Phonetic transcriptions in IPA and X-SAMPA

bly to the north by Kazakhstan and the Russian federation, to the west byChina and to the south by Iran and Afghanistan. Being a region of diverseharsh climates, ranging from mountainous in the west of Tajikistan andKyrgyzstan (the eastern Himalayas) to arid in the north of Uzbekistan, itis difficult to acquire linguistic data which are evenly distributed across theregion. Most people live in the center of these three countries, where theirborders are tightly knotted around a spiral. Here there is a more moderatecontinental climate. Further complicating the equal distribution of respon-dents is the fact that some of the residents are nomadic tribes[18]. The mapin Figure 1 provides a rough impression of the lay-out of the region and thedistribution of respondents.

The data were originally annotated according to the IPA system. Forthe purpose of computational analysis of the data, they needed to be tran-scribed to a system more suitable. X-SAMPA, an extension to the SAMPAnotation, fully reflects all aspects of IPA in a way that it can be easily parsedby computers, cross-platform, being fully based on ASCII characters [21].This work has also been done by Mennecier. To give more insight into theway X-SAMPA is annotated, please refer to Table 1, which contains a sam-ple of the Central Asia dataset with IPA transcription and its correspondingX-SAMPA transcription.

The languages spoken in the region are part of two language groups:Indo-Iranian and Turkic. The Indo-Iranian languages analysed are Tajikand Yagnobi and the Turkic languages are Kazakh, Karakalpak, Kirgiz and

5

SWADESH Ka-Gazli 1 Ki-Ordaj 2 T-Agalikone bIr b@r jak

A A Ctwo jek@ EkkI du # d8

A A B # Bfour t8rt t8r t SahOr

A A Cfive b jIs b jeS pand Z

A A Cbig }lkIn katta kalOn

A C D

Table 2: A sample of the provided transcriptions and preclassifications, usedin this thesis. Two pronunciations by the same respondent are separated bya hash-symbol (’#’).

Uzbek. The data used for this thesis are transcriptions of pronunciationsby 78 respondents, 39 of which belonging to the Indo-Iranian group andanother 39 belonging to the Turkic group. The full list of words used forthe pronunciations consists of 176 items and it contains words from boththe Swadesh-100 list (86 items) and the Swadesh-200 list (163 items) as isshown in Appendix A.1.1

All respondents and pronunciations have been preclassified by Men-necier. The respondents have been coded based on Mennecier’s estimationof which of the six languages they belong to: Karakalpak (KK), Kazakh(Ka), Kirgiz (Ki), Uzbek (U), Tajik (T), Yagnobi (Y). Their region of originhas also been recorded along with an index for cases where there are morerespondents per site. Each separate pronunciation has also been coded inde-pendently with a letter. All pronunciations coded with the same letter areestimated to be either a cognate or a transfer of one another. As described inthe introduction, dialectometric analysis by Levenshtein algorithm is testedagainst these preclassifications. To get an impression of the data providedrefer to Table 2.

1Two of which have been omitted, due to incomplete recording

6

3 Levenshtein Distance Analysis

3.1 Method Description

3.1.1 General Description of Approach

This section describes the application of the Levenshtein measure to thephonological transcriptions gathered in the field by Mennecier. Whereasfocus shifts to loanword detection in the next section this section describesthe application of methods originally applied by Heeringa and which havebecome standard in recent computational dialectometric research[3]. Theanalysis by Heeringa was made by means of the L04 software package de-veloped by Kleiweg[9]. This research adheres to this approach.

First intent was to apply standard methods to the data gathered in Cen-tral Asia. In a later stage of analysis it became apparent that our datalends itself to further focus on loanword detection. Still, the methods origi-nally applied by Heeringa provide a wide range of analysis which serve as afoundation for the second half of the research.

3.1.2 The Levenshtein Algorithm

The algorithm developed by Levenshtein in the latter half of the 1960s is ameans to calculate the edit-distance between two given strings of tokens[11].As its name implies, the edit-distance measure assumes that the distance be-tween two strings bears a one to one relation to the number of edits neededto transform one string of tokens into another. In fact, edit-distance is noth-ing more than the sum of edit-actions taken in this transformation.

When computing the measure there are three legal edit actions: inser-tion, deletion and substitution. An insertion is the addition of a new tokeninto a string. A deletion is the removal of one token in a string. In a substi-tution one token is replaced by another. Each edit-action is counted as onepoint in the distance-score resulting in an absolute number. To illustratethe measure consider Table 3.

Since the introduction of the edit-distance measure into computationaldialectometry it has also taken root in the field of genealogy, as a means tocalculate distance between genetic strains. For a more detailed descriptionof the Levenshtein algorithm the reader is referred to the PhD-thesis byWilbert Heeringa [3].

7

l 0 n d a0 1 2 3 4 5 null

a 1 1 2 3 4 4 insertionj 2 2 2 3 4 5 insertionl 3 2 3 3 4 5 nulla 4 3 3 4 4 4 substitutionn 5 4 4 3 4 5 deletiona 6 5 5 4 4 4 null

4

Table 3: Edit-distance for ‘round’. Pronunciations from respondents: horiz.,site:Agalik, lang:Tajik; vert., site:Zarmanak, lang:Uzbek.

1 2 3 4 5 6 71 (Ka Gazli 1) 0.14 0.12 0.16 0.17 0.15 0.282 (Ka Gazli 2) 0.14 0.28 0.26 0.26 0.17 0.083 (Ki Ordaj 1) 0.12 0.28 0.28 0.32 0.28 0.334 (KK Halqabad 1) 0.16 0.26 0.28 0.21 0.52 0.475 (U Novmetan 3) 0.17 0.26 0.32 0.21 0.49 0.496 (T Agalik) 0.15 0.17 0.28 0.52 0.49 0.467 (Y Dugova 1) 0.28 0.08 0.33 0.47 0.49 0.46

Table 4: Sample of the Central Asia respondent x respondent edit-distancematrix.

3.1.3 Levenshtein Applied to Dialectometric Research

The aim of this stage of research is to compute the average edit distancebetween the pronunciations of all 78 respondents. All transcriptions of thesample words per respondent are used as input in the form of X-SAMPAnotation, as explained in the data description. The output generated is a78x78 matrix of edit-distances between each pair of respondents, where thedistance for each singular pronunciation-pair is averaged.

Originally the L04 software, developed for this purpose, was designed toanalyse dialects for one language. It was also specifically designed with theintent of generating regional dialect maps. This part of analysis is omittedfor this thesis.

8

In our research the recorded pronunciations hail from respondents whichdo not only speak multiple distinct languages, but moreover speak languagesfrom two language groups: Turkic and Indo-Iranian. Due to this fact andthe difficulties the geography of the region present, the resulting input dataconsist of few respondents. Because of the aggregative nature of the pro-posed analysis the reliability of the results tends not to suffer from havingless data used for input. Also the complexity of having such diverse lan-guages in the data might cause difficulties. Still, as hypothesised, the factthat multiple language groups are analysed also provides the opportunity totest the effectiveness of the approach in recognising different languages andlanguage groups.

Another drawback resulting from the difficulties in finding suitable re-spondents, however, is the fact that the spread of respondents is limited.Considering the fact that 39 respondents per language group are a smallnumber to begin with, the analysis is destabilised further by the fact thatper research site in the region there are multiple respondents used. Nev-ertheless, pronunciation within a single research location did show enoughdiversity to analyse the edit-distances of each respondent separately. And,as this research does not primarily take into account the relation betweentopographical situation and dialect difference, the data prove adequate forpronunciation analysis pur sang.

Two topics of interest which can be viewed as specific to the data fromCentral Asia relate more to the section loanwords but need to be mentionedhere as well, to give a clear insight into how the data for the loanword anal-ysis were acquired.

Firstly, due to the fact that there are two main language groups analysed,it is a central goal of the research to test whether these language-groups areclearly recognised by the software and can be properly grouped. It is hy-pothesised and expected that this should be a straightforward task whichcan be successfully executed. Still, for a more detailed view in this first stageof analysis, the two language groups are also analysed separately, using onlythese subgroups as input.

Secondly, special focus can also be put on the selection of word-sampleswhich were chosen by Mennecier. Of the total set of 176 words, 86 can alsobe found in the Swadesh wordlist of 100 words and 163 words can be foundin the extended Swadesh-200 list. These words are, in general, considered to

9

be most resistant to lexical replacement, due to either frequent use, or thefact that they have been in use for a comparatively long time [17]. The clas-sic example of a word which seems to be unique to each separate language isthe word ‘butterfly’. Also quite a number of strong verbs tend to have a longhistory of frequent use in all languages. Consider the fact that each languagehas its own variety of the verbs ‘to have’ and ‘to be’, which are used in alarge percentage of common speech and therefore have a tendency to evolvein a localised direction. Of course it is also expected that these words areseldom, or never borrowed. This research aims to put this hypothesis to thetest. Therefore the input data has also been split up into three groups. Onefor all words, one for the Swadesh-100 list and one for the Swadesh-200 list.The results of these separate subgroups can be compared only when groupbefore being used as input for the L04 application for Levenshtein distances.

After computation of the edit-distances, consistency measures are com-puted using Cronbach’s alpha [16]. Cronbach’s alpha increases with the sizeof the sample group, so provides a good indication to test whether the sam-ple size for each subset is adequate. Next, Pearson’s Product Moment iscomputed for all the pairs of subsets of word lists, to test whether there is asignificant difference. This test will calculate the average product of z-scoresfor each pair of edit-distances. Where edit-distance A was computed fromwordlist A and edit-distance B was computed from wordlist B.

3.1.4 Interpretation of Edit-Distances using Several Visualisa-tion Techniques

A matrix with a size of 78x78 units is difficult to interpret, because there isquite a significant number of separate scores to take into account when tryingto find an overview. Even though the mass of data has already been reducedby the Levenshtein analysis from a great amount of transcriptions to a setof pairs more easily coped with, it is still a far cry from an overview fromwhich the relative position of respondents to one another becomes apparent.With this in mind several visualisation techniques have been developed byHeeringa and Kleiweg which are also used in the research.

The types of visualisations used are hierarchical agglomerative cluster-ing and multidimensional scaling (MDS). These methods can also be usedto generate regional dialect maps, especially suitable for regions where anequal spread of respondents speaking varieties of the same language can bemanaged. Therefore it is expected that mapping the Central Asian regions

10

will not prove fruitful.

In this stage of research we expect that the Levenshtein algorithm pro-vides more than adequate means of separating one language group fromanother. To test this supposition the preclassifications made by Mennecierpredicting the dialect and language group of each respondent are comparedto the results of the above mentioned visualisations.

3.1.5 Hierarchical Agglomerative Clustering

Hierarchical agglomerative clustering was developed in the field of psycho-metrics for ”partitioning objects into optimally homogeneous groups” [6]. Itis a means to group pairs in a matrix of a given size according to kinship.It works by selecting and ”fusing” pairs of respondents which are minimallydistant from one another, where the act of fusion (here) means averagingthe distance between the two respondents. The iterative process of pair-fusion reduces the matrix in size at each step and is repeated until all therespondents are placed into a dendrogram structure in which fused items arejoined under a single node [20]. The length of the branches in the resultingdendrogram are a result of the distance recorded at the time of fusion. Thethesis follows the ‘unweighted pair group method using arithmetic averages’(UPGMA) approach, adhering to Heeringa’s advice. See Table 5 for a visualdescription of clustering using this weighing method.

Usually, for purpose of further reliability of the clustering dendrograms,bootstrapping is employed. Bootstrapping is the process of averaging a se-ries of clustering runs over the same matrix, leaving out a random selectionof samples at each run. The process of bootstrapping increases reliabilityof the figure dendrogram because in one single clustering execution smallirregularities can occur due to close vicinity of some pairs to others. Boot-strapping will not have a central focus in this research, but will be appliedonce for sake of testing the effectiveness of clustering using one run.

For the resulting dendrogram an arbitrary threshold can be decided uponby the researcher which leads to a set of cognate dialect groups. It is inter-esting to see whether these groups compare well to the groups preclassifiedby Mennecier — i.e whether the software produces the same results as alinguist in the field would expect. On other research resulting subgroups areoften used for mapping cognate dialects and is even combined with results

11

0 A B C D E JOIN(A;B) = 1A at 0.5B 1C 4 3D 3 2 5E 10 8 7 6

1 1 C D E JOIN(AB;D) = 21 at 1.25C 3.5D 2.5 5E 9 7 6

2 2 C E JOIN(ABD;C) = 32 at 2.13C 4.25E 7.5 7

3 3 E JOIN(ABDC;E)3 at 3.63E 7.25

Table 5: Example of clustering using UPGMA. At each iteration find theclosest pair. The joining point is half their value. For the other cells takethe mathematical average of their value for the two joined concepts. Iteratethe process until the last join is made.

12

from MDS analysis, but because of the low number of respondents and thefact that multiple respondents from the same area speak dialects of differ-ent languages, the results from this analysis provides next to no relevantinformation for the region.

3.1.6 Multidimensional Scaling

Just as hierarchical agglomerative clustering, multidimensional scaling (MDS)hails from the field of psycho-metrics[19]. It is a way to reduce a matrixof any given size to a matrix of reduced size (in our case of either two orthree dimensions) containing a set of coordinates for each respondent. Thesecoordinates relate to each other in such a way that the actual distance inthe original matrix is approximated as closely as possible.

For this thesis both two- and three-dimensional scaling are applied. Bymeans of MDS-analysis scatterplots can be produced, giving visual insightinto the way the individual respondents are positioned in relation to oneanother. Three-dimensional scaling also provides opportunity to express co-ordinate positions in RGB values. This can be useful for producing dialectmaps for a region with good respondent distribution. When analysing thedifferent language groups in this thesis separately, mapping MDS analysis isslightly more relevant than mapping clustering. In some research it can evenbe used to correlate the positioning in MDS with real world coordinates, butdue to the imbalance in distribution this approach is not considered useful.

Besides the standard MDS projection there are other ways of represent-ing averaged edit-distances in a low-dimensional space. Two approaches ofMDS also used in the Heeringa thesis are Kruskal and Sammon mapping[3].These have also been compared in use and insightfulness for this research,but will not be reported in this thesis because results can not be consideredrelevant to the hypotheses posed.

For purpose of detail, the language groups are both analysed separatelyand as a full group, using MDS. Reducing the number of respondents toone single language group allows for the MDS analysis to use the availablespace more fully. Therefore fewer extremes exist and a more nuanced pictureemerges in the visualised results.

Besides splitting up the language-groups before analysis, we also triedusing the subsets of Swadesh-lists as separate input. This analysis will re-

13

ceive detailed attention in the next subsection on results and interpretation.

For purposes of presentation and visualisation in both this thesis andpossible future research interactive software is developed to present three-dimensional scaling in more detail. Figures used in this thesis are producedby means of this software and the software is included in the cd.2

2Located in the root is the installer for this program (mds-vis): mds-vis-installer.exe,which runs on al recent Windows-platforms and contains the source code.

14

3.2 Application of Levenshtein Distance Analysis

This section describes the application and results for the traditional edit-distance analysis as described in Heeringa’s thesis. First the generation ofrespondent×respondent matrices will be described, followed by clusteringresults and MDS analysis. The results are tested against Mennecier’s pre-classification. Lastly comparisons are made between the full wordlist andthe reduced wordlists based on the Swadesh-100 and Swadesh-200 list.

3.2.1 Matrix Generation

Matrices for edit-distance were generated not only for the entire dataset, butalso for the following subsets. Per language group: Turkic and Indo-Iranian.There were only three Yagnobi sites totalling six respondents, so analysingthis group separately from the Tajik group (which is the other Indo-Iraniangroup) would be meaningless. Per wordlist: the subset of the total wordlistcontaining only Swadesh-100 words and the subset of the total wordlist con-taining only Swadesh-200 words. For a listing of which words are in the fulllist and Swadesh subsets please refer to Appendix A.1.

For all these subsets Cronbach’s alpha was calculated to assess consis-tency within the data. The greater the number of words tested, the greaterthe internal consistency. Because the languages are more diverse than thosein most research, it was expected that the resulting scores would be rela-tively low. But the contrary proved to be the case, as can be seen in Table6. As can be noted, all scores for Cronbach’s alpha show good internal con-sistency. Scores for the Indo-Iranian (Tajik and Yagnobi) groups are lower,as can be expected due to the fact that it is a mix of two languages, withless internal inconsistency than the Turkic group.

To test whether there is a significant difference between the subsets ofwordlists Pearson’s Product Moment test was applied. The results can be

Group All(176) Sw200 (163) Sw100 (86) NAll Respondents 0.993 0.992 0.988 78Turkic 0.986 0.984 0.975 39Indo-Iranian 0.961 0.952 0.921 39

Table 6: Cronbach’s alpha scores for all wordlists.

15

List A List B Pearson ScoreSwadesh-100 Swadesh-200 0.997All Words Swadesh-100 0.996All Words Swadesh-200 0.999

Table 7: Pearson Product Moment r-scores for word lists.

read in Table 7. The foremost topic of interest was whether there wouldbe a significant difference between the Swadesh-100 and the Swadesh-200list. Based on the figure calculated (r=0.997) there is an almost perfectlinear correlation between the two wordlists. The other two scores werecomputed with the entire Central-Asia wordlist. It becomes evident thatthe Swadesh-200 subset and the full wordlist have an extremely high cor-relation (r=0.999). This is to be expected because there is little differencebetween these two wordlists, as can be seen in Appendix A.1. It is interest-ing to note that even though there is no statistically significant differencebetween the two Swadesh lists, the score is somewhat lower. This differencefound is most likely due to chance. To inspect whether this is the case wouldinvolve testing correlations between a number of random wordlists, but thisexceeds the scope of this research. Still, this result does call for some furtherinspection, and it will be interesting to see whether the difference can alsobe noted in the MDS-analysis of the results.

One remaining factor needs to be noted well. All edit-distances for thisresearch were normalized by wordlength. As can be read in the Heeringa the-sis, normalization was experimented with and did not enhance the analysis,but rather made it marginally less accurate. However, it does not damagethe reliability of the results much, mainly because of the relative closenessin size of pronunciation-strings and the size of the dataset. Because much ofthe analysis for this research had been finished before noting this minor flaw,and because we suspect that wordlength-normalization does marginally en-hance loanword-detection it was decided not to correct this minor flaw. Thiswill be explicitated more elaborately in the section on loanword detection.

3.2.2 Clustering

The results for clustering of the data are compared with the classificationsmade by Mennecier. Clustering results become insightful by means of den-drogram. A first impression of the data is given by the simplified clustering

16

Figure 2: Simplification of UPGMA clustering for all respondents using fullwordlist.

results in Figure 2. Optimally, clustering results are only given after boot-strapping clustering — a process where all the groupings are calculatedmultiple times, leaving out a random sample, and averaged — but for pre-liminary indication of the effectiveness of the approach a single run throughthe data is used. Most dendrograms in this thesis result from a single run,but are also compared with one bootstrapped analysis. For detailed inspec-tion of clustering-dendrograms the reader is referred to Appendix B.1-3.

In Figure 2 four groups are separated. The top two are the Turkic lan-guage groups (Kazakh, Karakalpak, Kirgiz and Uzbek) and the bottom twoare the Indo-Iranian groups (Tajik and Yagnobi). The two clusters separateat the 0.660 normalized edit distance score which makes it clear that theedit-distance approach has no problems in separating two totally differentlanguage groups. This is no surprise, because there is little in common be-tween the pronunciations in the two language groups. The Uzbek languageis separated from the other Turkic languages at the 0.416 score. Yagnobi andTajik also separate at the 0.416 score. These are the most detailed effectiveseparations which can be made by means of clustering. It is not possi-ble to make a clean cut between the remaining Turkic languages Kazakh,Karakalpak and Kirgiz, as can be seen in Figure 3.

In Figure 3 all the black areas are the cases in which respondents fromthe same region end up in the same cluster. This is not always the case:for instance the four respondents from Kokdaria seem not to be as closelyrelated as the respondents from the other regions. It will be interesting totest whether this inconsistency persists after bootstrapping. What is evenmore notable is the fact that the three respondents from Hitoj, which wereclassified by Mennecier as Uzbek speakers, end up as a clean cluster withinthe Kazakh, Karakalpak and Kirgiz cluster. We suspect that this is most

17

Figure 3: Simplification of UPGMA clustering for Turkic speaking respon-dents using full wordlist. In black areas all respondents from the same regionare true neighbours in a cluster. (U: Uzbek; KK: Karakalpak; Ka: Kazakh;Ki: Kirgiz)

likely due to Mennecier’s misclassification, but at the very least it is worthcloser inspection in the MDS-analysis. All other Uzbek respondents areclearly separated from the rest of the Turkic speakers. Yet, in this groupthe proximity between the respondent’s’ pronunciations is probably too closeto make clearly separated clusters.

Figure 4 shows simplified results for the Indo-Iranian languages (Tajikand Yagnobi). Here the respondents from the same region show somewhatmore internal cohesion. There seems to be an equal mix between respon-dents from Nimich and Nushor, indicating a degree of similarity betweenthe two sets of respondents. This makes perfect sense, because the twosites are spatially close: both lie at 39.11 decimal latitude and between 70.6and 70.8 longitude, which amounts to a distance of around 13 kilometres,while most distances between nearest sites range upward from 30 kilometres.

In Appendix B three dendrograms are provided. One for the full set ofwords (B.1), one for the Swadesh-100 list (B.2) and one for the Swadesh-200list (B.3). Upon close inspection the clustering results remain mostly sim-ilar, especially of course the Swadesh-200 and full list, which confirms the

18

Figure 4: Simplification of UPGMA clustering for Indo-Iranian speakingrespondents using the full wordlist. In black areas all respondents from thesame region are true neighbours in a cluster. (Y: Yagnobi; T: Tajik)

high degree of Pearson-correlation. High internal cohesion between respon-dents from the same region remains intact, while the most notable differenceis the shift in ordering within the Tajik cluster, where the respondents fromShink and Urmetan swap positions with the respondents from Rishtan andNovmetan. From this can be deduced that the internal cohesion of thesetwo site-groups is higher, but without detailed inspection at the word-levelprobably not much else.

Appendix B.4 shows the clustering results after bootstrapping for allwords.3 The numbers indicate the percentage of times that a node coincidesper run. All main dialect groups (considering Karakalpak/Kazakh as one

3A more recent dataset was used to generate these results

Cluster One Run BootstrappedTop Node 0.66 0.47Karakalpak/Kazakh/Kirgiz 0.35 0.25Uzbek 0.23 0.17Tajik 0.27 0.21Yagnobi 0.22 0.17

Table 8: Top node of four main clusters, before and after bootstrapping.

19

group and disregarding the Hitoj samples) compute the same cluster at100% of the runs. The results show some small changes in the ordering,which indicates that one run is probably a good indication for results for thisdataset. The example of Kokdaria mentioned previously shows no change.What does change quite dramatically are the edit distance scores for thetop nodes of the clusters. The top nodes for all four main clusters show amuch lower joining point than after the single run with all samples, as canbe seen in Table 8. Bootstrapping multiple clustering results proves that asingle run can be quite misleading, even if placement of samples and clustersremains mainly the same.

3.2.3 Multidimensional Scaling

When using a two- or three-dimensional representation to approximate anN×N sized matrix, correlations are calculated to show how well the dis-tances in the reduced representation matches the original distances. Thecloser distances between samples in the MDS-projection are to the distancesin the N×N matrix, the higher the R-score. Even though R-scores for two-dimensional analysis proved good representability, in this thesis only three-dimensional MDS will be considered, because the approximation of distancecomes closer due to the extra dimension available. Two-dimensional viewsof the three-dimensional analysis printed in this paper can hardly be dis-tinguished from views of the two-dimensional analysis so this provides noobstacle. Table 9 shows that the R-score is very high even for reduction toone dimension.

Table 10 shows the R-scores for all words and for the Swadesh lists.The first column shows these scores for all respondents, the second andthird show the results when considering only the Turkic samples or Indo-Iranian samples. Even after reducing the number of samples the MDS-analysis proves highly representative.

Dm. r

1 0.9542 0.9623 0.984

Table 9: Correlation coefficients between original distances and distances in1-3 dimensions of MDS solutions for all words and all respondents.

20

Figure 5 shows the scatterplot representation of MDS-analysis for allwords and all respondents. The datapoints have been classed according tothe preclassification made by Mennecier, as can be read in the legend. In thisview it becomes clear that all respondents cluster well in their own group.The exceptions are three respondents classified as Uzbek. These three arethe respondents from the Hitoj site mentioned in the previous section. Ap-pendix C.1 shows the scatterplot for only the Turkic respondents and Indo-Iranian respondents separately. In the first plot the Hitoj-exception can beseen more clearly, as the diamonds on the left-hand side indicate.

Although it is amply clear from Figure 5 that MDS proves an effectivemeans of separating the different subgroups it can also be proved statis-tically. Based on the preclassification by Mennecier we can define whichpronunciations belong to which language-group and dialect. The spatialcenter of all samples belonging to one group can thus be computed throughaveraging based on their placement in the MDS-plot.

Consider Table 11. Here the observed Euclidean distance from the data-points to their cluster-center is compared to an expected distance, which isthe mean of the distance of all data-points to each other. The null-hypothesishere is that MDS is not effective: the mean of the distance of all data-pointsto their own cluster is not significantly different from the mean of all data-points to all other data-points. It makes sense that this test is completelysuperfluous when considering the visual representation, but for the sake ofstatistical proof the p-score is given here: 2x10−16. This clearly proves thatedit-distance analysis combined with MDS-representation is quite capableof separating language-groups and dialects.

Word Subset All Respondents Turkic Indo-IranianAll Words 0.984 0.981 0.967Swadesh-200 0.982 0.981 0.967Swadesh-100 0.987 0.981 0.968

Table 10: Correlation coefficients for MDS analysis in 3 dimensions. TheIndo-Iranian group is somewhat more difficult to approximate, probably be-cause Tajik and Yagnobi show more complex differences pronunciations.

21

Figure 5: MDS-analysis for all respondents and all words represented asscatterplot.

For i and j in different language families:H0 : D(samplei, samplej) ≈ D(samplei, clusterCenteri)Hα : D(samplei, samplej) � D(samplei, clusterCenteri)

Cluster Distance ExpectedKarakalpak 0.019 0.389Kazakh 0.019 0.389Kirgiz 0.039 0.389Tajik 0.033 0.389Yagnobi 0.025 0.389Uzbek 0.127 0.389Uzbek (no Hitoj) 0.0226 0.389

Table 11: Observed average Euclidean distances of respondents to the centerof their cluster. Expected value is the average distance of all respondents toeach other.

22

3.2.4 Differences in Swadesh Lists

The next topic of interest is the comparison between the Swadesh-lists. Theway we propose to analyse differences between wordsets using MDS is tocompare the cluster-centers (based on preclassification by Mennecier) men-tioned above. Again Euclidean distance in the MDS-plot is employed toquantify these differences. In the previous paragraph the distances betweenindividual samples and their respective cluster-centers are used to prove ef-fectiveness of the software in identifying dialects and language-groups in thedata. Here, two different MDS-plots are compared to each other. Three ofthese comparisons are made: the MDS-plot for the full wordlist with theplot for the Swadesh-100 list, the Swadesh-200 plot with the Swadesh-100plot, and the Swadesh-200 plot with the full wordlist plot.

In van der Ark et al (2007) we superimposed one MDS plot over anotherto visualise the differences in different word lists as described above. Thiswas incorrect. In the present section explain why the procedure is incorrect.Correct comparisons need to be based on distances between respondentswithin the MDS, and not on the concrete assignment of the coordinates.

Figure 6 shows the scatterplot of the MDS-results for all words over-laid on the results for the Swadesh-100 list.4 As can be seen clearly, thereseems to be a shift in the placement of data-points, indicating that reduc-ing the number of words to the Swadesh-100 subset has an effect on theresults. A cluster’s center lies exactly in the middle of a dot-cloud withthe same symbol. The top table in Table 12 shows the exact Euclideandistances between these cluster-centers. It also shows the distances for theother two comparisons: Swadesh-100 with Swadesh-200 and all words withSwadesh-200. There is almost no shift in placement between all words andthe Swadesh-200 group. This means that it is a good reference-measure totest the significance of position-shift from Swadesh-200 to Swadesh-100. Tocompute the significance of this shift the mean and standard deviation ofall its clusters’ shifts are compared to the mean and deviation of the shiftsbetween all words and Swadesh-200. The resulting significance score is alsoshown in the top table in Table 12.

The observations in the above paragraph seem to be a good indicationthat reducing the number of words to the Swadesh-100 list has a potentially

4The difference between all-words and the Swadesh-200 list is negligible, so comparisonbetween all words and the Swadesh-100 group shows nearly the same results

23

Figure 6: Scatterplot of all words overlaying Swadesh-100 scatterplot. Thegrey datapoints is the scatterplot for all words.

significant effect in MDS-results. However, as was observed in the first para-graph of this section, Pearson scores gave almost no indication that therewould be any significant changes due to wordlist reduction. Therefore theanalysis of differences in spatial placement in MDS-plots deserves furtherscrutiny. For even though the Pearson score based on the edit-distancesindicated weaker linear correlation between the Swadesh-100 group and theother two groups, it does not seem to explain the size of the shift in thedata-points. After all, the correlation, although lower, was still very great.

Upon closer inspection of the distances in the first and second columnof the top table in Table 12 one notices that the Karakalpak group almostdoes not shift at all. In the subsequent groups the shift has a tendencyto rise. This leads us to suspect that the shift in placement might be dueto the order in which MDS-analysis computes the placement of samples inspace. And when one takes a look at the way the input data are ordered thissuspicion becomes stronger. The input-data are ordered as follows: first allKarakalpak respondents, next Kazakh, next Kirgiz, next Tajik, next Uzbek,last Yagnobi. The first column in Appendix A.2 shows the ordering of allthe samples.

24

Cluster Sw-100:Sw-200 All:Sw-100 All:Sw-200Karakalpak 0.024 0.035 0.020Kazakh 0.219 0.216 0.022Kirgiz 0.618 0.590 0.037Uzbek 0.617 0.605 0.029Tajik 0.599 0.582 0.017Yagnobi 0.412 0.396 0.054Mean 0.415 0.404 0.030P(S1:S2, All:S2) = 0.01252

Original RandomizedCluster All:Sw-100 All:Sw-100Karakalpak 0.035 0.018Kazakh 0.216 0.017Kirgiz 0.590 0.028Uzbek 0.605 0.018Tajik 0.582 0.015Yagnobi 0.396 0.038Mean 0.404 0.022

Table 12: Top table: Euclidean distance of cluster-centers for word subsetswithout randomization of samples. Here the average distance between thesubset for all words and Swadesh-200 appears significantly smaller than be-tween Swadesh-100 and either wordlist. Bottom table: original distances be-fore randomization and after randomization from all words to Swadesh-100.Here it becomes clear that the shift that appeared significant disappears.

It could be the case that the increasing size in shifts is a result of thefact that the data-points in the MDS-algorithm are calculated in this orderduring its first iteration, before their placement is refined. If this is the case,the cluster’s centers shift more due to chance than because of the actualeffect of word-reduction.

To test this suspicion another MDS-run for both the full word-set andthe Swadesh-100 word set has been done. This time the respondents’ sam-ples were ordered randomly. Column 2 in Appendix A.2 shows the randomordering of input-samples. Appendix C.3 shows the MDS-plots for the full

25

wordlist and the Swadesh-100 list using the randomized input-data. Uponvisual inspection one can still detect certain shifts between the individualdatapoints. However, when we make the same comparison as above —namely the Euclidean distance between the placement of cluster-centers inthe full wordlist plot and their placement in the Swadesh-100 plot — theshifts are a lot smaller (Bottom table in Table 12). Furthermore, all the shiftsare now comparable in size and the Karakalpak group’s shift is not smaller.The bottom table in Table 12 shows the Euclidean distances between thetwo word-sets with original ordering in the first column and with randomordering in the second column. These results strongly indicate that reduc-tion of words in to the Swadesh-100 list does not impact the MDS-analysis.It does show that differences in placement are mainly due to chance.

Considering the Pearson-tests, the clustering results and the results forMDS it can be concluded that there is no conclusive indication that reducingthe wordlist to only the Swadesh-100 subset has a statistically significantimpact on the results of Levenshtein-analysis for the Central-Asia dataset.5

5Comparing distances in placement between datapoints might be rather meaninglessbecause these should reflect the distances in the original 78×78 matrix.

26

4 Exploring Automatic Loanword-Detection


In the previous section we described how comparing the different Swadeshlists indicated that a reduced word set differentiated the results more finely,though not quite significantly. Based on this increase in average distancebetween centers of language groups it can be conjectured that there is acertain amount of either borrowing or cognates among a number of wordslisted in the extended wordlist. Interestingly, this assumption can be furtherexplored by using the preclassification of the individual pronunciations byMennecier as explained in the data description.

Next the approach to identifying loanwords automatically using the datais described, being a somewhat novel use of precision/recall analysis.

4.1.1 General Description of Approach

The approach to analysing the degree of pronunciation-transfer consists ofbroadly three steps. The first step is to use the Levenshtein algorithm pursang, comparing all pronunciation pairs per respondent at the word-level.The second step is to filter out all the pairs from the same language group.The third step is an 11pt precision/recall analysis [1] using the preclassifica-tion of cognate, or borrowed words. This process will be dealt with in moredetail shortly.

Another promising approach to understanding the full set of individualdistances between pronunciation-pairs is viewing their full distribution asa mix of two distributions. One of these distributions is conjectured to bethe set of pairs so close in their pronunciation that they can be consideredloans, which appears to be practically identical across languages so thattheir mutual distances are uniformly zero. The other is the set of wordswhich are assumed to be non-related and this set appears to have the shapeof a normal distribution. This approach will be described in detail in thenext section.

4.1.2 Applying Precision / Recall to Pairwise Levenshtein Re-sults

As described in the data description, preclassifications were made for eachpronunciation separately, allotting it to a group of cognates, or loans. For

27

each word, all the pronunciations receive a letter, which is the same for allthe pronunciations which Mennecier deems alike enough to be either cog-nates or loans.

This allows us to compare the results of a pairwise Levenshtein analysisat word-level with the estimate of an experienced linguist in the field. Asmentioned in the previous section the edit-distances were normalized byword-length.

4.1.3 Paired Edit-Distances Versus a Generalized Matrix

In the previous section the more traditional use of the L04 software was de-scribed to generate a 78×78 matrix for all respondents. The same process isapplied here, except for the last step of aggregation. Whereas in the first halfof the research all the edit-distances between two respondents are averagedacross the entire range of words used to provide an average dialect-distancebetween two respondents, here the focus is on edit-distance for each wordseparately.

Although normalization by wordlength is of no added value in the highlyaggregated statistics employed in the first half of research, it can be usefulwhen analyzing loanwords considering single respondent-pairs separately.Logically, if pronunciations of a word consist of a longer string of tokensthere is a great chance that more edit-operations are necessary to transformone pronunciation to another, producing a skew in the data: shorter wordshave shorter edit-distances. Therefore, weighing the edit-distance score bythe inverse of the length of the pronunciations for each pair now has theeffect of correcting this skewing.

Leaving out the final step of aggregation generates a list of pronunciation-distance pairs for each word, per respondent. For 78 respondents and 176words this generates a list of pairs of over 500-thousand, even ignoring allcases of respondents for whom no pronunciation was recorded for a certainword.

For the next step further filtering of the pairs is required. This is howeverbest explained with a deeper understanding of how precision / recall analysisapplies to this research, which we now describe.

28

4.1.4 Generating Precision / Recall Figures Based on Preclassi-fication

Precision / recall analysis (P/R) is a technique most often used in infor-mation science to determine the effectiveness of the retrieval of documentsrelevant to a given query. In this case precision is the percentage of doc-uments considered relevant at a certain percentage of recall, computed bydividing the number of documents relevant to the query by the total numberof documents retrieved at this stage. The latter half of the division is therecall-figure. This results in a visual representation of the effectiveness ofthe retrieval-system[1]. Documents which are relevant need to be preclassi-fied manually.

The same concept can be applied when trying to determine what wordsare loans in the phonetic data for this research, because, as mentioned, thepronunciations have been preclassified by a linguist as belonging to a group.In very short terms our P/R-analysis tests whether Levenshtein-analysis ef-fectively ‘retrieves’ the pronunciation-pairs considered ‘relevant’, where thiscan be defined as being preclassified by Mennecier as belonging to the samegroup. So all pairs which were preclassified as belonging to the same groupmake up the one hundred percent recall point. For every 10th percentileof recall precision is calculated: the percentage of pairs considered so far,which were also preclassified with the same letter (effectively belonging tothe same pronunciation-group).

The first step in the detection of loanwords is to filter out irrelevantpairs. If the respondents of the two pronunciations were preclassified asspeakers of the same language group, it is not reasonable to assume thatloans within the same language-group can be detected by means of calcu-lating edit-distance, because cognate words are so abundant that the scorewill only increase gradually from zero upward. Whereas in two different lan-guage groups we expect that there is a clean break between very low scores(the conjectured loans) and higher scores. Therefore, the remaining pairs ofpronunciations are all the pairs consisting of an Indo-Iranian speaker and aTurkic speaker.

For the application of P/R-analysis in this approach the list of pairsneeds to be sorted by edit-distance score, starting at the lowest. Formallyone can then define the terms precision and recall as follows. One hundredpercent recall is reached the moment the total number of pronunciation-

29

pairs which were preclassified as belonging to the same group is retrieved.Because the list is sorted by edit-distance this stage is reached after search-ing only a fraction of the full list. Precision at a certain percentage of recallis the number of pairs preclassified as same, divided by the total numberof pairs retrieved. For this research the standard 11-pt recall analysis isadopted[1].

For further clarification a simple description of the process follows. Putsimply, a program iterates through the list twice, from edit-distance zeroto the highest. The first loop counts the full number of ‘relevant’ pairs(which were preclassified as belonging to the same pronunciation-group).This determines the stage of total recall. The second loop iterates throughthese pairs with one counter keeping track of relevant pairs and one counterkeeping track of all pairs. Each time a tenth-percentile of recall is reached(determined by dividing total-recall by 10) precision is computed by dividingthe number of relevant pairs retrieved by the full set of pairs retrieved atthis stage. The second loop terminates at the point of full recall, wherethe precision for the remainder is computed. The resulting set of figures isdisplayed in a dropping curve of precision.

4.1.5 Conjecture of Borrowing-Threshold Based on Precision /Recall

The P/R-figure resulting from the process described above does not tell any-thing relevant about loanwords in the data when viewed out of context. Theinformation which is still missing at this stage is the average edit-distancescore at a certain stage of recall. This is why the analysis also keeps track ofthe average edit-distance at each stage of recall. The average edit-distanceat each stage of recall can be overlaid in the figure as a rising curve.

In this way the interpretation of the P/R-analysis becomes meaningfuland certain conclusions can be drawn. Interestingly one can now determineboth the average edit-distance and the precision at each percentile of recall.This way the preclassification factor and the edit-distance factor can becombined to make an estimate of the edit-distance threshold, below whichtwo pronunciations can be considered each other’s loans. Of course, withoutbackground knowledge nothing can be said about the directionality of theloan, based on this information.

Now that a threshold is hypothesised, the data can be considered at

30

word-level. The number of times each semantic concept is loaned — basedon the conjectured ‘borrowing-threshold’ — can be counted. Also the de-gree of borrowing between two respondents can be counted and percentuallyaggregated per region.

Another interesting question is to what degree concepts being loaned arepart of the different word-subsets. It can be hypothesized that the degreeof borrowing for the reduced Swadesh-100 list is significantly lower than forthe extended Swadesh-list and the full list of tested concepts.

31

4.2 Application of Loanword Detection

In the P/R analysis, about half of the full set of pronunciation-pairs are con-sidered: the somewhat 250-thousand pairs involving an Indo-Iranian speakerand a Turkic speaker. Of these 250-thousand pairs one hundred percent re-call is reached at about 37-thousand records, at which point the last pairpreclassified as belonging to the same cognate group is found. Therefore theprecision and recall figures in Figure 7, considered in the next paragraph,are based on about 15% of the records.

Figure 7 shows the results of the P/R analysis visually and Table 13shows the corresponding numbers exactly. So initially (up to the third per-centile) precision is almost perfect, meaning that all the words up to thispoint are preclassified by Mennecier as belonging to the same cognate group.This makes sense, because the R/P analysis starts by considering the verylowest (zero-score) edit-distances. Table 13 shows that up to the third per-centile of recall edit-distances are close to zero. After the fifth percentilethe precision-score starts dropping more dramatically, which happens at anaverage edit-distance of almost 0.06.

Based on these figures we can now conjecture up to which score we con-sider edit-distances reasonably low enough to count a pronunciation pair asa loan.6 Two thresholds were chosen. The first one considered is the 0.06normalized edit-distance, based on a precision score of 0.977 at the fifthpercentile, just before the precision starts dropping. The second is the 0.02score, at the third percentile, up to which point precision scores almost per-fect. For this second threshold it can safely be stated that all cases of pairscan be considered loans.

Having defined these thresholds, these will be used for further statisticalanalysis of the data. No more runs of the P/R-algorithm are necessary. Itneeds to be mentioned that attempts have been made at P/R-analysis withinthe same language groups of Turkic and Indo-Iranian but as explained inthe previous chapter, words are too cognate for a clear definition of what aloan is to be found.

6Bear in mind that for this thesis, the word ‘loan’ can be interpreted in the most broadsense of the word. Formally: two pronunciations which are so close together that theymust be either borrowed, or recent cognates. It can also be true that both languagesborrowed the same word from a third language.

32

Figure 7: Averaged 11-point recall/precision curve for all pairs, combinedwith average edit distance score per 10th percentile of recall. Vertical linesindicate thresholds initially chosen based on precision and average edit dis-tance.

Recall Precision Avg. Edit-distance0.0 0.997 00.1 0.997 00.2 0.998 00.3 0.988 0.0200.4 0.981 0.0420.5 0.977 0.0570.6 0.940 0.0750.7 0.890 0.1020.8 0.754 0.1470.9 0.500 0.2031.0 0.134 0.349

Table 13: Exact figures for Recall/Precision analysis. Italics indicate thresh-olds initially chosen based on precision and average edit distance.

33

Figure 8: Histograms showing the distribution of edit-distances forrespondent-pairs of a different language/dialect within Turkic on the leftand for respondent-pairs of a different language-group (one Turkic, the otherIndo-Iranian) on the right.

To illustrate this fact consider the two histograms in Figure 8. Theseshow the distribution of edit-distances. In the left histogram edit distancesare shown for all respondent-pairs from a different dialect/language withinthe Turkic language-group. In the right histogram edit distances are shownfor all respondent-pairs from a different language-group (one Indo-Iranian,the other Turkic). As becomes clear there is a huge difference in distribu-tion. Whereas the right histogram shows a clean break between low scoresand normally distributed higher scores, the left histogram shows a gradualdrop. The distribution for different speakers within the Indo-Iranian groupis greatly similar to the distribution for Turkic speakers shown in Figure 8.

Based on the thresholds of 0.06 and 0.02 all the pronunciation pairswhich can be considered loans can now be counted and compared with Men-necier’s classification. Table 14 shows the percentage of borrowing betweenlanguages for these two thresholds and the preclassification. On the hori-zontal axes are the Turkic speakers and on the vertical axes the Indo-Iranianspeakers. This representation is a reduction from a matrix showing all therespondents separately with absolute numbers. So, all the pronunciations-pairs which are close enough to have an edit-distance between them of lowerthan the defined threshold are counted as one. The percentage shown isthe number of times a word is a loan divided by the full number of words

34

Preclassification Karakalpak Kazakh Kirgiz UzbekTajik 10.32% 6.24% 9.42% 13.98%

Yagnobi 7.15% 4.24% 7.89% 10.38%

0.06 Karakalpak Kazakh Kirgiz UzbekTajik 2.62% 1.65% 4.86% 9.30%

Yagnobi 2.39% 1.60% 3.64% 6.97%

0.02 Karakalpak Kazakh Kirgiz UzbekTajik 0.78% 0.68% 2.58% 5.71%

Yagnobi 0.89% 0.75% 1.65% 3.91%

Table 14: Degree of loanwords for the full wordlist, based on only Mennecier’spreclassification and edit-distance thresholds of resp. 0.06 and 0.02. Ascan be seen, not only the percentages are considerably lower in the secondtwo cases, the proportions between percentages also change. For a visualinterpretation of these figures please refer to Appendix D.

considered. For this representation the full wordlist is used, so the totalnumber of words considered is 176. This percentage is first computed foreach respondent-pair separately and next is averaged for all speakers fromthe same language. For an impression of the original matrix — which is fartoo large to print regularly — take a look at Table 15.

Some quite notable differences become apparent between the percent-ages based on the preclassification and the percentages based on the edit-distance thresholds. The fact that the percentages based on preclassificationare larger is easily explained by the fact that these include all the loans whichhave a larger edit-distance and can therefore only be detected by a trainedlinguist and not by software using the edit-distance measure. A more dis-turbing fact, however is the change in proportions within the different anal-yses. In the preclassification Kazakh and Karakalpak borrow considerablymore from the Indo-Iranian languages than based on edit-distance thresh-olds. We of course assume that the preclassification by a specialised linguistare more reliable than an experimental method of loanword detection, sothere must be some flaw in the analysis. It is very likely that the change inproportions is due to the fact that edit-distance measures do not take intoaccount typological regularities such as shifts in vowel production. It is rea-

35

KK Halqabad 1 Ki Ordaj 1 U Novmetan 3T Agalik 1 5 13T Kaptarhona 2 3 8 16T Nushor 2 1 2 10T Nushor 3 1 3 11Y Dugova 2 2 2 13Y Safedorak 1 1 3 10

Table 15: Small sample of the original borrowing-matrix with absolute countof loans

sonable to assume that Karakalpak and Kazakh speakers produce a voweldifferently from Kirgiz and Uzbek speakers. This could be a serious bias inthe analysis and begs for further inspection, but is regrettably beyond thescope of this thesis. The chance that the linguist is biased as opposed to thesoftware is a lot smaller because in that case a smaller change in propor-tions would be the result, due to irregularities in preclassification. It would,however, be too early to assume that the automatic loanword detection bymeans of edit-distance is not meaningful and at the very least it does notimpact comparison between Swadesh-lists. Therefore we now continue theanalysis, noting that there is a good chance that the results are biased withrespect to their proportions.

As becomes clear from the representation in Table 14 there is a muchhigher degree of borrowing between Kirgiz/Uzbek and the Indo-Iraniangroup than between the other Turkic languages and the Indo-Iranian lan-guages, based on the defined thresholds. It is quite reasonable to explainthis higher degree of borrowing by looking at the spatial distribution of sites.The Kirgiz and Uzbek sites (except Hitoj) are closer to Tajikistan than theKarakalpak and Kazakh sites, as can be seen in Figure 9. Still, some ofthe respondents were members of nomadic tribes, so this should not explainthis degree of borrowing entirely. Henceforth only the 0.02 threshold willbe considered in analysis, because it can be considered reliable, while stillshowing significant percentages.

As mentioned in previous chapters, the Swadesh-lists did not show anysignificant statistical impact on the analysis in the first half of research. Thedifference in results can also be analysed here, by filtering out edit-distancescores for words not belonging to these reduced wordlists. Table 16 shows

36

Figure 9: Map of the Central Asian region. As can be seen an attempt wasmade to equally distribute the respondents. Some of the Kirgiz respondentsand the Turkmen respondents are not part of the data for this thesis.

Swadesh-200 Karakalpak Kazakh Kirgiz UzbekTajik 0.83% 0.71% 2.75% 5.72%

Yagnobi 0.95% 0.80% 1.74% 3.89%

Swadesh-100 Karakalpak Kazakh Kirgiz UzbekTajik 1.81% 0.81% 2.44% 6.78%

Yagnobi 1.36% 0.43% 1.90% 5.93%

Table 16: Degree of loanwords for the Swadesh wordlists, based on edit-distance threshold of 0.02.

37

List A List B Pearson ScoreSwadesh-100 Swadesh-200 0.884Swadesh-100 All Words 0.887Swadesh-200 All Words 0.997

Table 17: Pearson correlation between the considered wordlists forborrowing-matrices of Indo-Iranian respondent x Turkic respondent.

the same matrix — only using the 0.02 threshold — for the Swadesh-100 andSwadesh-200 list, separately. At first glance there seems to be some differ-ence between the two matrices, the most considerable for the Uzbek-Yagnobiborrows: 3.89% for the Swadesh-200 list rises to 5.93% for the Swadesh-100list. This fact is surprising, because the intent of the reduction of words inthe Swadesh-list was to reduce the number of borrowed words, increasingthe percentage of words unique to a language. However, these percentagesare based on too little data, to prove conclusively that the reduction of theSwadesh-list was not useful. Proving this fact would involve repeating thisanalysis in several other datasets.

Here again the Pearson test can be adopted to test the correlation be-tween results, resulting in Table 17. Of course, the Pearson test was notdone on the averaged percentages shown in the above reduced matrices, buton the absolute numbers per respondent-pair, in order not to reduce relia-bility of the test. Correlation between all lists is still quite high, but thePearson-scores in this half of research do show greater differences than inthe first half (see 7 in the previous chapter).

Still, this does not necessarily mean that the difference in correlation isstatistically significant. It is important to bear in mind that reducing thetotal number of words to less than half of the full wordset has the effectof added weight to one single word. To put it plainly, whereas the averageedit-distance for one word in our subset of the Swadesh-200 list weighs as1/163rd of the total data, one word in the Swadesh-100 list accounts for1/86th of the data. A look at Table 18 makes this line of reasoning moreinsightful.7

7This view does not give insight into the source/destination of the borrowing, but allcases are — of course — between Indo-Iranian and Turkic. As mentioned earlier, it couldalso be that loans come from a third language, like Russian, or even English.

38

This table lists the 20 words borrowed most often in the data. The firstcolumn is the percentage of times the word is borrowed of all the timesit is pronounced by a respondent-pair. The second column is the absolutenumber of times the word scores below 0.02 normalized edit-distance (i.e.is borrowed). The fourth and fifth column shows which wordlists the wordbelongs to. Totals for both wordlists in this top-twenty show somewhat thesame ratio as their full totals: 86 to 163 makes a ratio of 0.51 and 8 to 19makes a ratio of 0.42. This is a good indication that about the same num-ber of words are borrowed in both lists. Still, the two words borrowed mostoften by far, ‘fog’ and ‘breathe’, are not in the Swadesh-100 list. Logically,this would lead to the expectation that borrowing is percentually greater inthe Swadesh-200 list, even though Table 16 shows the opposite to be true ingeneral. This is easily explained by the fact that the full number of hits forthe word ‘fog’ counts for less in a 1:163 ratio than the added totals of the topSwadesh-100 words ‘liver’, ‘tree’ and ‘feather’ in a 3:86 ratio. The absolutenumber of hits in the first case is percentually a smaller part of the full setof pronunciation using the Swadesh-200 list, than the three words borrowedless frequently in the second case using the Swadesh-100 list. Generally, theanalysis becomes more sensitive to one single change, the more the num-ber of words is reduced. It can be concluded that the Pearson correlationbetween Swadesh-100 and Swadesh-200 for loanwords is not low enough toaccount for these skewing factors.

One more fact to note in Table 18 is shown in the last column. Thiscolumn is only meant to test whether the expectation that normalization byword-length would correct the expected skew in analysis mentioned in theprevious section. It shows the average string length of pronunciations byall respondents. It is obvious that these pronunciations are not significantlyshorter than the general average of 5.07. Moreover, with an average stringlength of 6.2 the pronunciations for the word ‘liver’ (the 3rd most borrowedword), are of among the longest string length in the entire dataset, ranking24th.8

Table 19 zooms in on a selection of the topmost borrowed words, showingtheir regional borrowing-percentage. The word ‘fog’ is only a member of theSwadesh-200 wordlist. It is clear that almost all speakers in the regionpronounce it the same, but even for this top word of the list, the degree of

8The longest average string length being 8.58 for ‘butterfly’, and the shortest being3.75 for ‘one’

39

Percent Total Word Sw-100 Sw-200 Length40.14% 1221 fog 0 1 4.94816.24% 494 breathe 0 1 4.80714.99% 456 liver 1 1 6.27313.38% 407 tree 1 1 5.93513.02% 396 left 0 1 4.52610.65% 324 dust 0 1 4.74010.16% 309 wind 0 1 5.0009.37% 285 animal 0 1 5.9747.96% 242 feather 1 1 3.5427.63% 232 river 0 1 5.1287.50% 228 horn 1 0 3.7536.84% 208 sky 0 1 4.8965.69% 173 fruit 0 1 5.1055.03% 153 sand 1 1 3.6364.60% 140 leaf 1 1 5.5954.27% 130 flower 0 1 3.0384.21% 128 short 0 1 5.5142.89% 88 cloud 1 1 4.2302.66% 81 big 1 1 5.449

8 19 (avg: 4.83; all: 5.07)

Table 18: Concepts most borrowed based on 0.02 threshold. Among thetwo language groups the listed words are borrowed most often. One hundredpercent would be 3042 times, in the hypothetical case the word was producedvirtually the same by all respondents. As can be seen, the ratio of words fromthe Swadesh-100 and Swadesh-200 group in this top set is almost the same asin Appendix A.1. ’Length’ is the average string length for the pronunciationsof this word.

40

fog Karakalpak Kazakh Kirgiz UzbekTajik 54.55% 90.91% 100.00% 100.00%Yagnobi 32.73% 54.55% 66.67% 60.00%

Total 80.28%

liver Karakalpak Kazakh Kirgiz UzbekTajik 8.82% 0.00% 24.26% 74.66%Yagnobi 9.09% 0.00% 20.00% 76.92%

Total 29.98%

horn Karakalpak Kazakh Kirgiz UzbekTajik 0.00% 0.00% 0.00% 44.80%Yagnobi 0.00% 0.00% 0.00% 46.15%

Total 14.99%

Table 19: Degree of regional borrowing for ‘fog’, ‘liver’ and ‘horn’

borrowing is much higher between Kirgiz, Uzbek and Tajik than betweenKarakalpak and Kazakh and Tajik. It can be speculated that, because ofthe fact that this concept describes a natural phenomenon and some of thespeakers are members of a nomadic tribe, this is not a modern loan, but has ashared history among the speakers. Without proper background-informationthis, of course, remains speculative. The word ‘liver’ is the top word, whichis a member of both Swadesh-lists. Like ‘fog’ — and all other borrowedconcepts, for that matter — it is mostly shared between Kirgiz/Uzbek andthe Indo-Iranian languages. For Kazakh, the percentage of borrowing is zero.‘Horn’ is only a member of the Swadesh-100 list. It is only shared betweenUzbek and the Indo-Iranian languages. When zooming in on all the otherwords it becomes clear that only a few words are actually shared betweenthe western Turkic (Kazakh and Karakalpak) respondents and the Indo-Iranian respondents and that almost all borrowing in this region is betweenthe more eastern Turkic languages (especially Uzbek and to a lesser degreeKirgiz) and the Indo-Iranian languages (more Tajik than Yagnobi).

41

5 Mix of Distributions Analysis


The full set of pronunciation-pairs can also be viewed as a mix of two sub-sets, the first subset being the set of pairs with respondents from the samelanguage group and the second being the set of pairs with respondents fromdifferent language groups. The challenge is to approximate the distributioncurves for these two subsets and a generalized curve for the entire set, basedon the distribution for the two subsets.

To achieve this, the ‘mixdist’-package for R can be used, which was writ-ten by Peter MacDonald (McMaster University, Ontario)[7]. The packageapproximates normal curves for both the subsets based on their general dis-tribution, mean and standard deviation. To this end the expectation max-imization (EM)-algorithm, often used in machine learning, is employed[5].Before employing the package the two subsets need to be tested for theirstatistical distinctiveness, by means of the t-test.

5.2 Results of Mix of Distributions Analysis

With the two-sample t-test between the ‘same’- and ‘different’-set resultingin a p-score of 2.2x10−16 the two sets prove amply distinct to analyse in amix of distributions. To reinforce this claim, consider the normal QQ-plotfor the the full set of pronunciation-pairs in Figure 10. The large chunkof zero scores belong mostly to the group preclassified as belonging to the‘same’-group. The greater part of the ‘different’-group for a gather mostlyaround the normal. This indicates that the full set is a mix of a uniformzero distribution, mixed with a normal distribution.

The analysis available through the ‘mixdist’-package does not, however,provide a means to mix these two types of distributions, being specificallydesigned for testing a mix of Gaussian distributions. While this may beproblematic and will generate results which are not specifically as desired,it will still provide some extra insight into the data [20]. The results arepresented in Figure 11. The histogram shown in this figure represents theprobabilities for the histogram shown in Figure 8 in the previous section. Itshows a drop in frequency after the (near-zero) region where expect to findthe words related by loan, which is represented by the left curve. This curveis now shown as a Gaussian, but would preferably have been a uniform-zero curve. The right region of the histogram represents the data for the

42

Figure 10: Normal QQ-plot for the full set of edit-distances forpronunciation-pairs from all languages, including both the ‘same-family-set’and the ‘different-family-set’.

43

Figure 11: Mixed distributions plot for all language groups. There is a greatdeal of overlap between the set of pairs preclassified as different and the setpreclassified as same.

‘different-family-set’, reflected in the right Gaussian curve.

Represented in this form, the results present a great deal of overlap be-tween both the two curves: the region which is below the two curves. This isquite undesirable when the challenge is to make a clear distinction betweenloanwords and words not related by loan. Whether the representation inthe form of two Gaussians is to blame for this large overlap between the twogroups, or the data themselves can at this stage not yet be determined. Alsothe sums of frequency for the respondents are much greater than representedin the approximation curves, as can be seen when comparing the frequenciesfor lower edit-distance scores in the histogram with the curve.

Provided a statistical package is available and adopted in which a mix ofdifferent types of distributions can be analysed, the test can be redone in amore accurate fashion. This, however, exceeds the scope of this MA-thesis.

44

6 Discussion

In this section both theses formulated in the introduction will again betreated separately for the sake of consistency. However, there is quite adegree of similarity between the conclusions drawn for both theses. Thisdegree of similarity is of course mainly due to the tertiary thesis stated forboth main theses: the inspection of wordlist reduction. Next, we resumewith discussing the first main thesis.

This research set out with the novelty of having a dataset which containstwo entirely different language groups and six different languages, whereaspreviously the methods applied in this thesis had only been used for di-alects within a single language. We proposed that it would be a straightfor-ward task to make a clean and reliable separation between the two languagegroups. It can be concluded that we succeeded in this regard. This becomesamply clear upon visual inspection of both the clustering and the MDS-results. In all clustering results the latest joining point is the split betweenall Indo-Iranian languages and all Turkic languages and the MDS-plots showthat the Indo-Iranian languages end up on one half of the spectrum and Tur-kic languages on the other.9 Even the anomaly concerning the respondentsfrom Hitoj ended up cleanly in the same place in both analyses.

Success in broad terms for such a straightforward task as posed in thefirst hypothesis does not imply that there were no flaws in the approach, northat there is nothing to improve upon. One of the very first concerns whichwe considered a minor flaw when interpreting results of the first thesis isthe fact that all the edit-distances were normalized by stringlength, which isnowadays considered an inferior means of applying edit-distance analysis.[4]Even though for the second thesis we motivate that this is a minor improve-ment in analysis, no comparisons with results leaving out normalization bystringlength have been made to reinforce this claim in either half of the re-search.

Another trail left unexplored in this thesis is the dramatic difference inresults between clustering without bootstrapping and with bootstrapping.The bootstrapped clustering figure shown in Appendix B.4 shows that per-centually, joining points occur at a much lower edit distance in most runs of

9See the MDS-results using Kruskal mapping in Appendix C.3 for further reinforcementof this claim

45

clustering using random samples. It could be that there are some extreme re-sults for a number of respondents, which greatly influence clustering resultswhen considering all samples, but it does not seem likely. For the purposeof considering robust results as opposed to a single run of the full sampleit would have been preferable to consider only the bootstrapped results forthis thesis. Bootstrapping was, however, performed in too late a stage ofresearch.

In the comparison of the reduced wordlists, based on the subsets of theentire list containing only words from the Swadesh-100 and Swadesh-200 list,we had to conclude that no significant differences occur in the results for thefirst thesis. An early indication for this lack of significant difference was thecomparison between the edit-distances using the Pearson-test. As the readerwill recall, the correlation between the results was almost perfect, indicatingno significant difference for these reduced wordlists. We can however drawa parallel here with results for clustering, mentioned in the previous para-graph on clustering. It might still be the case that applying the Pearson-teston edit-distances, which are based on multiple runs — leaving out randomsamples — will prove that the correlation is less great. Nonetheless, weconsider the chance of finding more significance in difference too small toinitiate this time-consuming process.

Disappointingly, these conclusions rebuke statements made in our earlierpublication, where we found indications of significant differences based onplacement of samples in MDS-plots.[20] This is also the main reason thatthis faulty interpretation has been given so much attention: when a mistakehas been made it begs detailed motivation. In the section referred to here,we mention that — when comparing results between different MDS-plots— exact placement should be disregarded and the actual distances betweendatapoints should be considered. One can seriously doubt the value of suchan approach, because these internal distances are already present in the orig-inal edit-distance matrix before applying MDS. This does not mean that thisproposition does not beg for any further exploration.

This thesis briefly mentions that generation of regional maps has beenomitted, but in the early stage of this research attempts have been made.The resulting figure is clearly a marginally successful trial-run, when com-pared to previous research based on different datasets, as Figure 12 shows.The main reasons were the lack of detailed coordinates per respondent, thefact that there were few — unequally distributed — respondents and mul-

46

Figure 12: Left map is a first attempt to visualize MDS-results for the Cen-tral Asia dataset. Right map is an example of successful application of map-ping dialect variation in the Netherlands.[15]

tiple respondents per site, even speaking a different language. This is some-thing that will be done more professionally in the near future, in collabora-tion with a geographer.

Results for the second main thesis, exploring the potential of Leven-shtein analysis for automatically detecting loanwords, can, in this stage,not be viewed as conclusively successful. The main reason to conclude thisis the fact that there is a large discrepancy in the proportions of degreesof borrowing between the preclassifications made by Mennecier and thosebased on the edit-distance thresholds defined using the P/R analysis. Theexact reasons for this discrepancy are not known in this stage, but one mainprobable cause can be speculated. It has already been mentioned briefly inthis thesis that there are some typological regularities at work, such as forinstance vowel-shifts. Whereas an experienced linguist would easily see therelation between two pronunciations where a vowel has shifted, the Leven-shtein algorithm does not take this into account. In short, it could be dueto older linguistic relations which can be recognised manually, but not bymeans of our software[10].

To come to a solution of the first problem mentioned in above para-graph, the data must be re-analysed manually at word level. A possibleway to filter out typological regularities automatically is to replace corre-sponding tokens — such as vowels — by a single token before subjecting thedata to edit-distance calculation. If the proportions of degrees of borrowingbecome more similar to the preclassification, when re-evaluating the results,then this is a good indication that these typological correspondences are toblame. Furthermore, this kind of approach to analysing typological regular-

47

ity might be welcomed by researchers applying quantitative typology[2].

To further test the effectiveness of this approach using P/R analysis theprocess should ideally be repeated on other, similar datasets (containingpronunciations from two different language-groups, which have also beenpreclassified). One can, however, seriously doubt the availability of such adetailed set of data, as was used for this thesis.

In the comparison of the reduced wordlists in this half of the research themost surprising result was the fact that the degree of borrowing appearedto be highest in the Swadesh-100 list. This discovery of course needs morescrutiny to be proved significant and should, as mentioned above, also betested on other datasets. If, however, it is proved significant, it stronglyundermines the long lasting assumption that the reduced Swadesh-100 listcontains words which are most resistant to lexical replacement.[8, 14] In thishypothetical case, we would propose that other approaches should also beadopted to statistically test the reliability of the reduced Swadesh list in thisregard.

One more task posed in the earlier stages of research needs to be men-tioned. It was to come to a regional map, which visualises the degree ofborrowing in the region. This was omitted not only for the same reasonsas mentioned in the previous paragraph, discussing why dialect-maps wereomitted, but also because the discrepancy in proportions should be analysedfirst. We propose using lines between the sites analysed, of which the line-width indicates the degree of borrowing. It will also be interesting to seewhether there are spatial or genetic correlations with the degree of borrow-ing, which is the aim of the Musee de l’Homme and the expertise of FranzManni.[13]

The approach of analysing the data as a mix of two distributions provedineffective at this stage, due to the fact that we did not have the meansof analysing a mix of two different distribution types: uniform zero andGaussian. This approach might provide a good means gauging whether agiven set of edit-distances contains loanwords. In that case there should littleoverlap between frequency of edit-distances for words classified as belongingto the same cognate groups and words belonging to a different group. We canconsider the explorations in this thesis as no more than a humble initiationof this kind of analysis.

48

References[1] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. ACM Press,

1 edition, 1999.

[2] B. Bickel. Typology in the 21st Century: Major Current Developments. Linguistic Typology,11:239–251, 2007.

[3] Wilbert Heeringa. Measuring Dialect Pronunciation Differences using Levenshtein Distance.PhD thesis. University of Groningen, 2004.

[4] Wilbert Heeringa, Peter Kleiweg, Charlotte Gooskens, and John Nerbonne. Evaluation ofString Distance Algorithms for Dialectology. Proceedings of the ACL Workshop on LinguisticDistances, 2006.

[5] Robert Hogg, Joseph McKean, and Allen Craig. Introduction to Mathematical Statistics.Upper Saddle River, NJ: Pearson Prentice Hall, 2005.

[6] S. C. Johnson. Hierarchical clustering schemes. Psychometrika, 1967.

[7] J. Ju. Combined Algorithms for Constrained Estimation of Finite Mixture Distributions withGrouped and Conditional Data. MA Thesis, McMaster University, Ontario, Canada, 2002.

[8] Brett Kessler. The Significance of Word Lists. CSLI Press, Stanford, 2001.

[9] Peter Kleiweg. RUG L04, software for dialectometrics and cartography.http://www.let.rug.nl/ kleiweg/l04/ (Last Visited: June 19, 2008).

[10] G. Kondrak and T. Sherif. Evaluation of Several Phonetic Similarity Algorithms on the Taskof Cognate Identification. Proceedings of the ACL Workshop on Linguistic Distances, pages43–50, 2006.

[11] V. I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals.Cybernetics and Control Theory, 10:707–710, 1966.

[12] Franz Manni, Etienne Guerard, and Evelyne Heyer. Geographic Patterns of (Genetic, Mor-phologic, Linguistic) Variation: How Barriers Can Be Detected by Using Monmonier’s Algo-rithm. Human Biology, 76:173–190, 2004.

[13] Franz Manni, Wilbert Heeringa, and John Nerbonne. To what Extent are Surnames Words?Comparing Geographic Patterns of Surname and Dialect Variation in the Netherlands. Lit-erary and Linguistic Computing, 21:507–527, 2006.

[14] April McMahon and Robert McMahon. Language Classification by Numbers. Oxford Uni-versity Press, Oxford, 2005.

[15] John Nerbonne, Wilbert Heeringa, and Peter Kleiweg. Edit Distance and Dialect Proxim-ity. In David Sankoff and Joseph Kruskal, editors, Time Warps, String Edits and Macro-molecules: The Theory and Practice of Sequence Comparison, pages v–xv.

[16] J. Reynaldo and A. Santos. Cronbach’s Alpha: A Tool for Assessing the Reliability of Scales.Journal of Extensions, 37, 1999.

[17] M. Swadesh. Salish International Relationships. IJAL, 16:161, 1950.

[18] Times-Books. The Times Atlas of the World. Times Books Group, Collins Bartholomew, 9edition, 2004.

49

[19] W. S. Togerson. Multidimensional Scaling - Theory and method. Psychometrika, 17:401–419,1952.

[20] Rene van der Ark, Philippe Mennecier, John Nerbonne, and Franz Manni. PreliminaryIdentification of Language Groups and Loan Words in Central Asia. Proceedings of theInternational Workshop on Computational Phonology ’2007 - Borovets Bulgaria, pages 13–20, 2007.

[21] John C. Wells. Computer-coding the IPA: a proposed extension of SAMPA. UniversityCollege London, 1995.

50

Appendix A.1: Word lists

Words Full Sw-200 Sw-100 Sb-200 Sb-100all 0 1 1 0 0and 0 1 0 0 0animal 1 1 0 1 0ash 1 1 1 1 1at 0 1 0 0 0autumn 1 0 0 0 0back 1 1 0 1 0bad 1 1 0 1 0bark 1 1 1 1 1because 0 1 0 0 0belly 1 1 1 1 1bice 1 0 0 0 0big 1 1 1 1 1bird 1 1 1 1 1bite 0 1 1 0 0black 1 1 1 1 1blood 1 1 1 1 1blow 1 1 0 1 0bone 1 1 1 1 1breast 1 0 1 0 1breathe 1 1 0 1 0burn 1 1 1 1 1butterfly 1 0 0 0 0child 1 1 0 1 0claw 0 0 1 0 0cloud 1 1 1 1 1cold 1 1 1 1 1come 1 1 1 1 1correct 1 0 0 0 0count 0 1 0 0 0cut 1 1 0 1 0day 1 1 0 1 0die 1 1 1 1 1dig 1 1 1 1 1dirty 1 1 0 1 0dog 1 1 0 1 0

Continued on next page

51

Words Full Sw-200 Sw-100 Sb-200 Sb-100drink 1 1 1 1 1dry 1 1 1 1 1dull 1 1 0 1 0dust 1 1 0 1 0ear 1 1 1 1 1earth 1 1 1 1 1eat 1 1 1 1 1egg 1 1 1 1 1eye 1 1 1 1 1fall 0 1 0 0 0far 1 1 0 1 0fat 1 1 0 1 0father 1 1 0 1 0fear 1 1 0 1 0feather 1 1 1 1 1few 0 1 0 0 0fight 0 1 0 0 0fingernail 1 0 0 0 0fire 1 1 1 1 1fish 1 1 1 1 1five 1 1 0 1 0flesh 0 0 1 0 0float 0 1 0 0 0flow 0 1 0 0 0flower 1 1 0 1 0fly 1 1 1 1 1fog 1 1 0 1 0foot 0 1 1 0 0forest 1 0 0 0 0four 1 1 0 1 0freeze 1 1 0 1 0fruit 1 1 0 1 0full 1 0 1 0 1give 1 1 1 1 1go 1 0 0 0 0good 1 1 1 1 1grass 1 1 0 1 0grease 0 0 1 0 0


52

Words Full Sw-200 Sw-100 Sb-200 Sb-100green 1 1 1 1 1guts 0 1 0 0 0hair 1 1 1 1 1hand 1 1 1 1 1he 1 1 0 1 0head 1 1 1 1 1hear 1 1 1 1 1heart 1 1 1 1 1heavy 1 1 0 1 0here 0 1 0 0 0hit 1 1 0 1 0hold 1 1 0 1 0horn 1 0 1 0 1hot 0 0 1 0 0how 1 1 0 1 0hunt 1 1 0 1 0husband 1 1 0 1 0i 1 1 1 1 1ice 1 1 0 1 0if 0 1 0 0 0in 0 1 0 0 0kill 1 1 1 1 1knee 1 0 1 0 1know 1 1 1 1 1lake 1 1 0 1 0laugh 1 1 0 1 0leaf 1 1 1 1 1left 1 1 0 1 0leg 1 1 0 1 0lie 1 1 1 1 1live 1 1 0 1 0liver 1 1 1 1 1long 1 1 1 1 1louse 1 1 1 1 1man 1 1 1 1 1man(person) 1 1 1 1 1many 0 1 1 0 0meat 1 1 0 1 0


53

Words Full Sw-200 Sw-100 Sb-200 Sb-100moon 1 1 1 1 1mother 1 1 0 1 0mountain 1 1 1 1 1mouth 1 1 1 1 1name 1 1 1 1 1narrow 1 1 0 1 0near 1 1 0 1 0neck 1 1 1 1 1new 1 1 1 1 1night 1 1 1 1 1nose 1 1 1 1 1not 1 1 1 1 1old 1 1 0 1 0one 1 1 1 1 1other 1 1 0 1 0path 0 0 1 0 0play 1 1 0 1 0pull 0 1 0 0 0push 1 1 0 1 0rain 1 1 1 1 1red 1 1 1 1 1right 1 1 0 1 0river 1 1 0 1 0road 1 1 0 1 0root 1 1 1 1 1rope 1 1 0 1 0rotten 1 1 0 1 0round 1 0 1 0 1rub 0 1 0 0 0salt 1 1 0 1 0sand 1 1 1 1 1say 1 1 1 1 1scratch 1 1 0 1 0sea 1 1 0 1 0see 1 1 1 1 1seed 0 1 1 0 0sew 1 1 0 1 0sharp 1 1 0 1 0


54

Words Full Sw-200 Sw-100 Sb-200 Sb-100short 1 1 0 1 0sing 0 1 0 0 0sit 1 1 1 1 1skin 1 1 1 1 1sky 1 1 0 1 0sleep 1 1 1 1 1small 1 1 1 1 1smell 1 1 0 1 0smoke 1 1 1 1 1smooth 1 1 0 1 0snake 1 1 0 1 0snow 1 1 0 1 0some 0 1 0 0 0spit 1 1 0 1 0split 0 1 0 0 0squeeze 1 1 0 1 0stab 0 1 0 0 0stand 1 1 1 1 1star 1 1 1 1 1stick 1 1 0 1 0stone 1 1 1 1 1straight 1 1 0 1 0suck 1 1 0 1 0sun 1 1 1 1 1swell 0 1 0 0 0swim 1 1 1 1 1tail 1 1 1 1 1that 0 1 1 0 0there 0 1 0 0 0they 1 1 0 1 0thick 1 1 0 1 0thin 1 1 0 1 0think 1 1 0 1 0this 0 1 1 0 0three 1 1 0 1 0throw 1 1 0 1 0tie 1 1 0 1 0tongue 1 1 1 1 1


55

Words Full Sw-200 Sw-100 Sb-200 Sb-100tooth 1 1 1 1 1tree 1 1 1 1 1turn 0 1 0 0 0two 1 1 1 1 1vomit 0 1 0 0 0walk 0 1 1 0 0warm 1 1 0 1 0wash 1 1 0 1 0water 1 1 1 1 1we 0 1 1 0 0wet 1 1 0 1 0what 1 1 1 1 1when 1 1 0 1 0where 1 1 0 1 0white 1 1 1 1 1who 1 1 1 1 1wide 1 1 0 1 0wife 0 1 0 0 0wind 1 1 0 1 0wing 0 1 0 0 0wipe 1 1 0 1 0with 0 1 0 0 0woman 1 1 1 1 1woods 0 1 0 0 0worm 1 1 0 1 0ye 0 1 0 0 0year 1 1 0 1 0yellow 1 1 1 1 1you 1 1 1 1 1you (pl) 1 0 0 0 0Sum (tot: 218) 176 200 100 163 86

Word lists. Columns: ’Full’: all words in the Central Asia dataset;’Sw-200’: full Swadesh-200 list; ’Sw-100’: full Swadesh-100 list; ’Sb-200’:subset with all Swadesh-200 words in Central Asia dataset; ’Sb-100’: all

Swadesh-100 words in Central Asia set.

56

Appendix A.2: Original Order of Samplesand Randomized Order of Samples

Index Original Randomized1 KK-Halqabad-1 U-Urtoqqishloq-32 KK-Halqabad-2 T-Kaptarhona-23 KK-Halqabad-3 U-Soj-Mahalla-14 KK-Kokdaria-1 U-Soj-Mahalla-25 KK-Kokdaria-2 Y-Safedorak-16 KK-Kokdaria-3 Ka-Gazli-57 KK-Kokdaria-4 T-Kamangaron-18 KK-Shege-1 U-Novmetan-39 KK-Shege-2 Ki-Ordaj-110 KK-Shege-3 T-Nushor-111 KK-Shege-4 T-Navdi-312 Ka-Gazli-1 T-Shink-113 Ka-Gazli-2 T-Kamangaron-314 Ka-Gazli-3 T-Nimich-315 Ka-Gazli-4 T-Shink-216 Ka-Gazli-5 T-Zarmanak-217 Ka-Gazli-6 KK-Shege-418 Ka-Gazli-7 T-Rishtan-219 Ka-Raushan-1 KK-Halqabad-120 Ka-Raushan-2 Ka-Raushan-221 Ka-Raushan-3 U-Soj-Mahalla-322 Ka-Raushan-4 Ka-Gazli-223 Ki-Ordaj-1 KK-Shege-324 Ki-Ordaj-2 KK-Shege-225 Ki-Ordaj-3 Ka-Gazli-726 Ki-Ordaj-4 T-Rishtan-327 T-Agalik Ki-Ordaj-228 T-Kamangaron-1 T-Nimich-429 T-Kamangaron-2 T-Nimich-230 T-Kamangaron-3 T-Novmetan-131 T-Kamangaron-4 Y-Dugova-132 T-Kaptarhona-1 T-Kamangaron-233 T-Kaptarhona-2 KK-Kokdaria-234 T-Kaptarhona-3 KK-Kokdaria-435 T-Kaptarhona-4 T-Urmetan-1Continued on next page

57

Index Original Randomized36 T-Kaptarhona-5 T-Navdi-237 T-Navdi-1 Ka-Raushan-138 T-Navdi-2 Y-Dugova-239 T-Navdi-3 U-Urtoqqishloq-440 T-Nimich-1 Ka-Raushan-441 T-Nimich-2 T-Novmetan-242 T-Nimich-3 T-Kamangaron-443 T-Nimich-4 U-Hitoj-344 T-Novmetan-1 KK-Halqabad-345 T-Novmetan-2 T-Kaptarhona-546 T-Nushor-1 Y-Safedorak-347 T-Nushor-2 Y-Safedorak-248 T-Nushor-3 U-Hitoj-149 T-Rishtan-1 T-Urmetan-350 T-Rishtan-2 U-Zarmanak-451 T-Rishtan-3 U-Urtoqqishloq-252 T-Shink-1 T-Agalik53 T-Shink-2 T-Rishtan-154 T-Shink-3 U-Urtoqqishloq-155 T-Urmetan-1 T-Nimich-156 T-Urmetan-2 Ka-Gazli-157 T-Urmetan-3 Ki-Ordaj-358 T-Urmetan-4 T-Navdi-159 T-Zarmanak-1 Ka-Gazli-460 T-Zarmanak-2 T-Kaptarhona-361 U-Hitoj-1 Ka-Gazli-362 U-Hitoj-2 KK-Shege-163 U-Hitoj-3 T-Kaptarhona-164 U-Novmetan-3 U-Hitoj-265 U-Soj-Mahalla-1 T-Nushor-366 U-Soj-Mahalla-2 KK-Kokdaria-367 U-Soj-Mahalla-3 KK-Kokdaria-168 U-Urtoqqishloq-1 T-Shink-369 U-Urtoqqishloq-2 Ka-Gazli-670 U-Urtoqqishloq-3 U-Zarmanak-371 U-Urtoqqishloq-4 KK-Halqabad-272 U-Zarmanak-3 T-Urmetan-473 U-Zarmanak-4 Ki-Ordaj-4Continued on next page

58

Index Original Randomized74 Y-Dugova-1 T-Urmetan-275 Y-Dugova-2 T-Zarmanak-176 Y-Safedorak-1 Ka-Raushan-377 Y-Safedorak-2 T-Kaptarhona-478 Y-Safedorak-3 T-Nushor-2

The ordering of respondents has been randomized in order to test whetherthe shift in placement in Multidimensional scaling from all words to only

Swadesh-100 words would persist. After using randomized samples as inputin MDS-analysis the seemingly significant shift almost disappeared entirely.MDS starts approximating the distances in the 78x78 matrix with the firstsample, after which the next samples are placed. This also explains why inthe original list, only the Karakalpak-group stayed mostly in place and the

placement-shift rose in distance for the following groups — except forYagnobi, which has fewer respondents to consider.

59

Appendix B.1:

Clustering Dendrogram (UPGMA) Full

Word List Without Bootstrapping

60

Appendix B.2:

Clustering Dendrogram (UPGMA)

Swadesh-100 List Without Bootstrapping

61

Appendix B.3:

Clustering Dendrogram (UPGMA)

Swadesh-200 List Without Bootstrapping

62

Appendix B.4:

Clustering Dendrogram Full List After

Bootstrapping

63

Appendix C.1:

MDS-Scatterplots of the Two Language

Separately

64

Appendix C.2:

MDS-Scatterplots of All Words and

Swadesh-100 list After Randomization of

Input

65

Appendix C.3:

MDS-Scatterplots of All Words using

Sammon- and Kruskal Mapping

66

Appendix D:

Graphical Interpretation of

Proportionality between Degrees of

Borrowing.

Top graphic shows the degree of borrowing between the Turkic languages andTajik, based on the preclassification by Mennecier and the two thresholds.Bottom graphic shows the same between Turkic languages and Yagnobi

67

Appendix E:Degree of Borrowing Based on 0.02

ThresholdWord KK-T Ka-T Ki-T U-T KK-Y Ka-Y Ki-Y U-Yfog 54.50% 90.90% 100.00% 100.00% 32.70% 54.50% 60.00% 60.00%breathe 0.00% 0.00% 72.80% 74.70% 0.00% 0.00% 75.00% 76.90%liver 8.80% 0.00% 24.30% 74.70% 9.10% 0.00% 25.00% 76.90%tree 8.60% 0.00% 0.00% 72.40% 9.10% 0.00% 0.00% 76.90%left 0.00% 0.00% 48.50% 74.70% 0.00% 0.00% 0.00% 0.00%wind 0.00% 0.30% 0.70% 61.30% 0.00% 0.00% 0.00% 55.40%back 0.00% 0.00% 0.00% 0.00% 47.30% 54.50% 0.00% 13.80%dust 5.30% 0.00% 58.80% 50.70% 0.00% 0.00% 0.00% 0.00%feather 26.70% 0.00% 20.60% 20.40% 27.30% 5.50% 0.00% 9.20%short 0.00% 0.00% 20.60% 16.30% 0.00% 0.00% 40.00% 30.80%animal 1.10% 7.80% 0.00% 50.50% 0.00% 9.10% 0.00% 36.90%horn 0.00% 0.00% 0.00% 44.80% 0.00% 0.00% 0.00% 46.20%big 1.30% 0.00% 3.70% 7.90% 7.30% 0.00% 20.00% 43.10%sand 14.70% 7.20% 2.90% 10.00% 20.00% 7.30% 0.00% 12.30%river 0.00% 0.00% 0.00% 48.90% 0.00% 0.00% 0.00% 24.60%fruit 0.30% 0.00% 11.80% 31.70% 0.00% 0.00% 0.00% 24.60%leaf 0.00% 0.00% 11.80% 23.50% 0.00% 0.00% 0.00% 30.80%sky 0.00% 0.00% 0.00% 45.00% 0.00% 0.00% 0.00% 13.80%flower 7.80% 4.00% 11.80% 13.60% 0.00% 0.00% 0.00% 15.40%cloud 0.00% 0.00% 23.50% 12.70% 0.00% 0.00% 0.00% 0.00%tail 0.00% 0.00% 1.50% 9.50% 1.80% 0.00% 0.00% 13.80%bark 0.00% 0.00% 8.10% 8.60% 0.00% 0.00% 0.00% 7.70%bird 2.10% 0.00% 5.90% 12.20% 0.00% 0.00% 0.00% 0.00%egg 0.50% 0.00% 5.90% 13.30% 0.00% 0.00% 0.00% 0.00%lake 0.00% 0.00% 5.10% 2.70% 0.00% 0.00% 10.00% 0.00%root 0.30% 0.00% 0.00% 6.60% 0.00% 0.00% 0.00% 6.20%mother 0.00% 0.00% 0.00% 11.50% 0.00% 0.00% 0.00% 0.00%fire 0.00% 0.00% 0.00% 10.40% 0.00% 0.00% 0.00% 0.00%name 0.00% 0.00% 0.00% 6.10% 0.00% 0.00% 0.00% 1.50%ice 0.00% 0.00% 0.00% 7.20% 0.00% 0.00% 0.00% 0.00%sea 0.00% 0.00% 0.00% 6.80% 0.00% 0.00% 0.00% 0.00%stick 0.00% 0.00% 0.00% 6.60% 0.00% 0.00% 0.00% 0.00%louse 2.90% 2.40% 0.00% 1.10% 0.00% 0.00% 0.00% 0.00%meat 0.00% 0.00% 0.00% 6.30% 0.00% 0.00% 0.00% 0.00%


68

Word KK-T Ka-T Ki-T U-T KK-Y Ka-Y Ki-Y U-Ysmoke 0.00% 0.00% 0.00% 2.50% 0.00% 0.00% 0.00% 3.10%forest 0.00% 0.00% 0.00% 5.00% 0.00% 0.00% 0.00% 0.00%correct 0.00% 2.40% 0.00% 2.30% 0.00% 0.00% 0.00% 0.00%fingernail 0.00% 0.00% 0.00% 4.30% 0.00% 0.00% 0.00% 0.00%father 0.00% 0.00% 0.00% 4.10% 0.00% 0.00% 0.00% 0.00%smooth 0.00% 0.00% 0.00% 3.60% 0.00% 0.00% 0.00% 0.00%mountain 0.00% 0.00% 0.00% 3.40% 0.00% 0.00% 0.00% 0.00%narrow 0.00% 0.00% 1.50% 1.80% 0.00% 0.00% 0.00% 0.00%straight 0.00% 2.40% 0.00% 0.50% 0.00% 0.00% 0.00% 0.00%butterfly 0.30% 0.00% 0.00% 2.50% 0.00% 0.00% 0.00% 0.00%bone 0.00% 0.00% 0.00% 2.70% 0.00% 0.00% 0.00% 0.00%head 0.00% 0.00% 0.00% 2.50% 0.00% 0.00% 0.00% 0.00%old 0.00% 0.00% 0.00% 2.50% 0.00% 0.00% 0.00% 0.00%grass 0.00% 0.00% 2.20% 0.20% 0.00% 0.00% 0.00% 0.00%dull 0.00% 0.00% 0.00% 1.80% 0.00% 0.00% 0.00% 0.00%dirty 0.00% 0.00% 0.00% 1.60% 0.00% 0.00% 0.00% 0.00%i (self) 0.30% 0.30% 0.00% 0.90% 0.00% 0.00% 0.00% 0.00%skin 0.00% 0.00% 0.00% 1.40% 0.00% 0.00% 0.00% 0.00%woman 0.00% 0.00% 0.00% 1.10% 0.00% 0.00% 0.00% 0.00%knee 0.00% 0.00% 0.00% 0.90% 0.00% 0.00% 0.00% 0.00%ashes 0.00% 0.00% 0.00% 0.70% 0.00% 0.00% 0.00% 0.00%dog 0.00% 0.00% 0.00% 0.70% 0.00% 0.00% 0.00% 0.00%worm 0.00% 0.00% 0.00% 0.50% 0.00% 0.00% 0.00% 0.00%sharp 0.00% 0.00% 0.00% 0.50% 0.00% 0.00% 0.00% 0.00%breast 0.00% 0.00% 0.00% 0.50% 0.00% 0.00% 0.00% 0.00%

Degree of borrowing between respondents from different language groups:KK = Karalpak, Ka = Kazakh, Ki = Kirgiz, U = Uzbek, T = Tajik, Y =

Yagnobi.

69