Dialect o Metrics

Preview:

Citation preview

NLP Seminar, January 2010

Yonatan Belinkov

Outline� Definitions

� Urban dialectology

� Dialect geography

Dialectometry� Dialectometry

� Measuring the Diffusion of Linguistic Change, John Nerbonne.

What is a dialect� “A dialect is a subdivision of a particular language.”

� E.g. the Parisian dialect of French, the Bavarian dialect of German, etc.

� “A language is a collection of mutually intelligible � “A language is a collection of mutually intelligible dialects.”

� But Scandinavian languages are mutually intelligible, while (some) German dialects are mutually unintelligible.

� Mutual intelligibility may not be bidirectional (Danes understand Norwegians better than the other way around).

What is a dialect (cont.)� Thus, language is not a pure linguistic term.

� It is influenced by other factors: political, geographical, historical, sociological and cultural.

� “A language is a dialect with an army and navy” � “A language is a dialect with an army and navy” � Max Weinreich: י און ַארמיַאן מיטט דיַאלעקַא איזך שּפרַאַא

טֿפלָא .

What is a dialect (cont.)� Difficulty in distinguishing between dialect and

language calls for more technical definitions.

� “A variety is any particular kind of language which is considered as a single entity.”considered as a single entity.”

� “A dialect is a variety of language which is grammatically and lexically different from a similar variety. “

Urban Dialectology� Traditional dialectology concentrated on regional or

geographical dialects and dialect continua.

� However, other factors also play an important role in the way one speaks. the way one speaks.

� In the 1960’s scholars began describing linguistic varieties by other criteria: social status, education, ethnic/religious affiliation, age, gender, etc.

� Example: Communal dialects in Baghdad (Blanc 1964)

Urban dialectology (cont.) � Socio-dialects are not discrete, they form a social

dialect continuum.

� Jamaican Creole: It’s my bookits mai bukits mai bukiz mai bukiz mi buka mi buk data fi mi buk dat

Dialect geography� Geographically, dialects form continua.

� West Romance dialect continuum:� While standard varieties of French, Spanish, Catalan

and Portuguese are not mutually intelligible,and Portuguese are not mutually intelligible,

� The rural dialects form a continuum with neighboring speakers easily understanding each other.

� Arabic dialect continuum:� Arabic dialects share the same standard language.

� Neighboring speakers communicate easily,

� But remote dialects are mutually unintelligible.

Dialect geography - history� First significant dialect survey: Wenker 1877-1887

� Sent a list of (short) sentences in standard German to schoolmasters.

� Example: Im Winter fliegen die trocknen Blätter durch � Example: Im Winter fliegen die trocknen Blätter durch die Luft herum.

� Received transcriptions of sentences into local dialects.

� 45,000 questionnaires from entire Germany.

� Published the first linguistic atlases (Sprachatlas)

Dialect geography - history� Field work gradually replaced postal questionnaires.

� In 1896-1900, Edmond Edmont interviewed 700 informants around the French countryside.

� His data were incorporated in Gilliéron’s French survey � His data were incorporated in Gilliéron’s French survey which was published between 1902-1910.

� Subsequent atlases published: Italy and southern Switzerland (1931-1940), US and Canada (1939-1943, 1949, 1953, 1961, 1973-1976, 1981-1992, 1994), England (1962-1978), Ḥōrān (1940-1946), Egypt (1985), Syria (1997)…

Dialect geography - methodology� Much of the methodology is shared with other

branches of linguistics:� Recording data (phonetics)

� Analyzing data (theoretical linguistics, sociolinguistics, � Analyzing data (theoretical linguistics, sociolinguistics, historical linguistics)

� Some methods are unique or especially important in dialect geography:� Devising questionnaires

� Building linguistic maps

� Selecting informants

Questionnaires� Using questionnaires ensures comparability of data

gathered by different fieldworkers in varying conditions.

� Questions can be direct (“what do you call a cup”) or, � Questions can be direct (“what do you call a cup”) or, preferably, indirect (“what is this?” ).

� Questionnaires are organized according to semantic fields (weather, social activities, etc.) so that the informant will focus on the subject matter and not on the form of his answer.

� Since tape-recording became available, it is easier to engage in casual, non-formal conversation.

Linguistic maps� Linguistic maps can be display maps, simply showing

the data on a map, or interpretive maps, showing distribution of predominant variants from region to region. region.

� Example: “What do you call that small, four-legged, long-tailed creature, blackish on top, it darts about in ponds?”

� Contrast the display map with the interpretive map (in the following slides).

Newt

Linguistic maps� Linguistic maps can be display maps , simply showing

the data on a map, or interpretive maps, showing distribution of predominant variants from region to region. region.

� Example: “What do you call that small, four-legged, long-tailed creature, blackish on top, it darts about in ponds?”

� Contrast the display map with the interpretive map (in the following slides).

Newt Display Map

Newt Interpretive Map

Linguistic maps� Linguistic maps can be display maps , simply showing

the data on a map, or interpretive maps, showing distribution of predominant variants from region to region. region.

� Example: “What do you call that small, four-legged, long-tailed creature, blackish on top, it darts about in ponds?”

� Contrast the display map with the interpretive map (in the above slides).

Informants� Historically, most surveys focused on nonmobile, old,

rural males.

� The motivation for this homogeneous background is that the informants’ speech should reflect the that the informants’ speech should reflect the authentic speech of the area in which the live.

� Fewer studies recorded more heterogeneous speakers (young, educated, female, etc.).

Dialectometry� The variable as a structural unit.

� Dialects may differ quantitatively with regards to variables

� Ex.: simplification of final consonant clusters (pos’card� Ex.: simplification of final consonant clusters (pos’cardfor postcard, han’ful for handful).� Subject to linguistic constraints such as environment

(before consonant/vowel/pause).

� But also to non- or extra-linguistic factors such as style or class.

� Varying frequencies in different dialects.

Dialectometry (cont.)� The term was coined by Séguy (1973).

� Séguy published the linguistic atlas of Gascony in 1950’s and 1960’s.

� First 5 volumes were within the framework of the � First 5 volumes were within the framework of the Gilliéron tradition.

� But Séguy looked for a more objective way to reveal the dialect regions of Gascony.

� He managed to do so in the 6th volume published in 1973.

Dialectometry (cont.)� Basic idea: devise a dissimilarity measure based on the

survey data.

� Algorithm:� Compare responses from every pair of neighboring � Compare responses from every pair of neighboring

sites.

� Count number of items on which the neighbors disagreed.

� Calculate percentage of disagreement.

� This gives the linguistic distance between two dialects.

Dialectometry (cont.)� Refinements:

� Calculate respective percentage agreement for different types of items (lexical, phonological, syntactic, etc.).

� Linguistic distance is the mean percentage of all types.� Linguistic distance is the mean percentage of all types.

� Map of linguistic distances in southwest Gascony.

� What can be inferred from the map?� Northwestern group with low linguistic distance (10-15 %).

� Site 693 is connected to similar neighbors on 3 sides (11-19%) and less-similar neighbors to the east (22-28%);

� Possible explanation: departmental boundary of Hautes-Pyrénées

Southwest Gascony

Dialectometry (cont.)� Refinements:

� Calculate respective percentage agreement for different types of items (lexical, phonological, syntactic, etc.).

� Linguistic distance is the mean percentage of all types.� Linguistic distance is the mean percentage of all types.

� Map of linguistic distances in southwest Gascony.

� What can be inferred from the map?� Northwestern group with low linguistic distance (10-15 %).

� Site 693 is connected to similar neighbors on 3 sides (11-19%) and less-similar neighbors to the east (22-28%);

� Possible explanation: departmental boundary of Hautes-Pyrénées

Multidimensional scaling� Séguy’s maps retain geographic distance and represent

linguistic distance as a number.

� In multidimensional scaling (MDS) linguistic distance is displayed spatially.is displayed spatially.

� We place data in a dissimilarity matrix:� Rows are variables, columns are informants.

� Entries are binary.

� We need to assign a vector to each informant.

MDS (cont.)� In Generalized MDS, given:

� k objects.

� A dissimilarity measure d.

� Natural number N.� Natural number N.

� Calculate dij, the distance between items i and j.

� Build a k*k matrix A where (A)ij = dij.

� Find k vectors x1…xk in RN s.t.� ||xi-xj||~ dij for all i,j.

� If N=2 or N=3 we can plot the vectors.

MDS - example� Davis & McDavid (1950) described the transition zone

in Northwestern Ohio.� 5 towns: Perrysburg, Defiance, Ottawa, Van wert and

Upper Sandusky. View map.Upper Sandusky. View map.� 10 informants, 2 from each town.� 56 variables; most have variants from two adjacent

dialect regions, Northern and Midland, from which immigrants arrived at the area. View table.

� Davis & McDavid could not “give convincing reasons for the restriction of some items and the spreading of others”.

Northwestern Ohio map

MDS - example� Davis & McDavid (1950) described the transition zone

in Northwestern Ohio.� 5 towns: Perrysburg, Defiance, Ottawa, Van wert and

Upper Sandusky. View map.Upper Sandusky. View map.� 10 informants, 2 from each town.� 56 variables; most have variants from two adjacent

dialect regions, Northern and Midland, from which immigrants arrived at the area. View table.

� Davis & McDavid could not “give convincing reasons for the restriction of some items and the spreading of others”.

Northwestern Ohio table

MDS - example� Davis & McDavid (1950) described the transition zone

in Northwestern Ohio.� 5 towns: Perrysburg, Defiance, Ottawa, Van wert and

Upper Sandusky. View map.Upper Sandusky. View map.� 10 informants, 2 from each town.� 56 variables; most have variants from two adjacent

dialect regions, Northern and Midland, from which immigrants arrived at the area. View table.

� Davis & McDavid could not “give convincing reasons for the restriction of some items and the spreading of others”.

MDS – example (cont.)� Two years later, Reed & Spicer (1952) did a statistical

analysis of covariance on the same data.

� They showed that the speech of informants who lived closer to each other resembled one another more than closer to each other resembled one another more than the speech of informants who liver afar from each other.

� Rees & Spicer were ahead of their time in the quantitative approach they took.

MDS – example (cont.)� Chambers (in Chambers & Trudgill, 1998) used

correspondence analysis with the same data, and arrived at the following figure.

� Interpretation: � Interpretation: � 3 clusters in different quadrants: P1 and P2; V1, V2, US1,

US2 and O2; D1, D2 and O1.

� 1st cluster tend to choose Northern variants.

� 2nd cluster tend to choose Midland variants.

� 3rd cluster have a mixed pattern of choosing.

� These observations correlate with the geographic map.

Northwestern Ohio MDS

MDS – example (cont.)� Chambers (in Chambers & Trudgill, 1998) used

correspondence analysis with the same data, and arrived at the following figure.

� Interpretation: � Interpretation: � 3 clusters in different quadrants:

� P1 and P2; V1, V2, US1, US2 and O2; D1, D2 and O1.

� 1st cluster tend to choose Northern variants.

� 2nd cluster tend to choose Midland variants.

� 3rd cluster have a mixed pattern of choosing.

� These observations correlate with the geographic map.

Goebl� After Séguy’s breakthrough in dialectometry, Goebl

(1982, 1984; taken from Nerbonne & Kretschmar 2003) extended and developed new methods for measuring dialect differences.

� Recall that Séguy’s measure counted differences in responses to questionnaires in pairs of sites.

� Goebl explored measures that gives more weight to less frequent words.

� He also studied the level of coherence between a certain site and other sites to discover whether it is an island or a transition area.

More recent work� Kessler (1995) first used (weighted-)Levenshtein distance

as a linguistic distance. � Calculated Levenshtein distances between phonetic strings

of Irish Gaelic words.Used 12 phonetic features (nasality, rounding, length, etc.), � Used 12 phonetic features (nasality, rounding, length, etc.), with values between 0-1, to describe phones; distance between two phones is the average difference between feature values.

� Applied clustering techniques to the calculated distances. � Obtained dialect boundaries which correspond to

provincial boundaries.

More recent work(cont.)� Heeringa & Nerbonne (2002) studied dialect areas and

dialect continua using Levenshtein distance.

� They calculated Levenshtein distances between all pairs of 27 Dutch dialects which lie on a straight line.pairs of 27 Dutch dialects which lie on a straight line.

� On the one hand, they used regression to account for linguistic distance by geographic distance, thus validating the continuum concept.

� On the other, they used clustering to detect dialect areas.

� Finally, MDS showed interrelations between dialects.

More recent work (cont.)� A number of refinements and alternatives to Levenshtein

distance have been suggested (surveyed in Nerbonne &

Kretschmar 2003).

� Kondrak notes that prefixes and suffixes tend to get deleted � Kondrak notes that prefixes and suffixes tend to get deleted and explores local alignments (or distances) between strings).

� Heeringa & Gooskens attempt to measure pronunciation differences in acoustic recordings instead of phonetic transcriptions.

� Nerbonne & Kleiweg deal with related but non-identical question responses (clears up, clears, clearing up).

Visualization� Already early dialectologists presented their findings

visually (in various linguistic maps).

� Computers enable us to visualize data in more vivid, telling ways.telling ways.

� Nerbonne (2005) present Dutch dialects, their distances and inner-groups, using several visualizations (pp. 18, 20-23, 25, 26).

Measuring the Diffusion of

Linguistic Change

John Nerbonne, 2009John Nerbonne, 2009

Models of diffusion� What is linguistic diffusion?

� Sociolinguistic vs. spatial diffusion

� The wave model: innovations spreading outwards in waves.waves.

� The skipping stone model: innovations leaping discontinuously between centers of influence.� Innovations spread locally in waves around each center.

� Centers of influence are usually larger cities or towns.

The gravity model� Developed by Peter Trudgill (1974).

� Geographic distance and population size predict the chance of communication and thus the degree of diffusion. diffusion.

� As in physical gravity, the most influential site (=body) is the nearest largest (=most massive) one.

� Influence is inversely proportional to the square of the distance between sites, and proportional to the multiplication of the population sizes:

� Iij = s*PiPj/(dij)2

Séguy’s Curve� Séguy measured lexical, or linguistic, distance and

compared it to geographic distance.

� He found that lexical distance is a sub-linear function of geographic distance (square root of logarithm).of geographic distance (square root of logarithm).

Dialectometric view of gravity� Why use dialectometric methods in this case?

� Avoid arbitrary choice of which features to focus on.

� Quantify influence to arrive at a more general perspective.perspective.

� Several studies attempted to test the validity of the gravity model using dialectometry (references in Nerbonne 2009).

Dialectometric view of gravity (cont.)� Nerbonne & Heeringa (2007) derived linguistic

(Levenshtein) distances from 52 towns in the Netherlands.� Interestingly, Levenshtein operation costs were derived from

comparing spectograms. � Since in the gravity model influence correlates inversely � Since in the gravity model influence correlates inversely

with geographic distance,� They stipulated that according to the gravity model,

linguistic distance should correlate with geographic distance directly.

� Indeed, they found direct correlation, but sub-linear and not quadratic (as predicted by the gravity model).

� They also found no effect of population size on linguistic distance.

Dialectometric view of gravity (cont.)� Heeringa (2007) included more Dutch data.

� He too found sub-linear connection between linguistic distance and geographic distance.

� However, he also found that population size contributes � However, he also found that population size contributes to the linguistic distance, as in the gravity model.

� Alewijnse et al. (2007) found sub-linear, logarithmic correlation between linguistic and geographic distance in Bantu data collected in Gabon.

� Prokić (2007) and Nerbonne & Siedle (2005) arrived at similar results with Bulgarian and German data, respectively.

Dialectometric view of gravity (cont.)

� The same correlation was found in other studies, in the US, Netherlands (again) and Norway.

� In the above studies geography accounted for 16-37% of the linguistic variation. of the linguistic variation.

� Note that in all of the above linguistic distance is narrowed down to phonetic distance.

� Spruit (2006) measured syntactic distance and found a linear correlation to geographic distance.

Individual vs. Aggregate Differences

� Dialectometry measures the influence of geography on aggregate, cumulative variation.

� Sociolinguistics, on the other hand, focus on diffusion of single items (words, sounds).of single items (words, sounds).

� What is the relation between diffusion of individual items and aggregate diffusion?

� Simulating the diffusion of individual items could save the time and effort that would take a researcher to examine distributions of many individual items.

Simulating Diffusion� Create several thousand sites.

� Sites are at different distances from a single reference site.

� Each site is represented by a 100-dimensional binary � Each site is represented by a 100-dimensional binary vector.

� Each dimension symbolizes a linguistic feature.

� Each dimension is a binary variable: “o” means that the site is the same as the reference site with respect to that feature; “1” means that it is different.

Simulating Diffusion (cont.)� Simulation is comprised of two views: linear and

quadratic, corresponding to Séguy’s curve and the gravity model.

� In both cases, a random change is created n times in � In both cases, a random change is created n times in each site, depending on its distance from the reference site.

� In the linear view, n depends linearly on the distance.

� In the quadratic view, n depends on the square of the distance.

Simulating Diffusion (cont.)� Creating the random change:

� Randomly select dimension i in the 100-dimensional vector.

� Create random number x between 0 and 1.� Create random number x between 0 and 1.

� If x > 0.5, set i=1; else, set i=0.

� Aggregate distance of a vector from the reference site is the sum of all its elements.

Results� The following figure shows the results of two

simulations: one when chance of change depends linearly on distance and one when it depends quadratically on distance.quadratically on distance.

� In both cases a logarithmic regression line is showed; this is the typical sub-linear Séguy curve.

� The results imply that geography has a linear effect on the likelihood of the diffusion of an individual item.

Results (cont.)

Results (cont.)� However, applying local regression gives a similar

logarithmic curve in the linear case, but reveals a different curve in the quadratic case.

� This suggests that quadratic influence also contributes � This suggests that quadratic influence also contributes to aggregate diffusion, as predicted by the gravity model.

� The following figure shows the results after applying local regression.

Results (cont.)

Conclusions� Several points need to be investigated in further

simulations, e.g.:� Restriction of changes to binary choices.

� Limiting influence to only one center.� Limiting influence to only one center.

� Further studies are required to test diffusion with individual items.

� Still, it was shown that models of diffusion can be effectively tested quantitatively.

� There is a (sub)-linear correlation between linguistic distance and geographic distance.

References� Chambers, J. & Trudgill, P. (1998). Dialectology.

Cambridge: Cambridge University Press, 2nd ed. � Goebl, H. (1982). Dialektometrie: Prinzipien und Methoden

des Einsatzes der Numerischen Taxonomie im Bereich der Dialektgeographie. Wien: Österreichischen Akademie der Dialektgeographie. Wien: Österreichischen Akademie der Wissenschaften.

� Goebl, H. (1984). Dialektometrische Studien: Anhand italoromanischer, rätoromanischer und galloromanischer Sprachmaterialien aus AIS und ALF. 3 Vol. Tübingen: Max Niemeyer.

� Heeringa, W. & Nerbonne, J. (2002). Dialect Areas and Dialect Continua. In Language Variation and Change 13, 375-398.

References (cont.)� Kessler, B. (1995). Computational dialectology in Irish

Gaelic. In Proceedings of the seventh conference of the European chapter of the Association for Computational Linguistics (pp. 60–66). San Francisco, CA: Morgan Kaufmann Publishers, 1995.Kaufmann Publishers, 1995.

� Nerbonne, J. (2005). Dialectology: Aggregate Dialectal Variation. Presentation in LSA linguistic Institute, Harvard and MIT. http://www.let.rug.nl/nerbonne/teach/dialectology/

� Nerbonne, J. & Kretzschmar, W. (2003). Introducing Computational Methods in Dialectometry. In Computational Methods in Dialectometry. Special issue of Computers and the Humanities, 37(3), 2003, 245-255.

References (cont.)� Nerbonne, J. (2009). Measuring the Diffusion of

Linguistic Change. To appear in Philosophical Transactions of the Royal Society B: Biological Sciences, ca. 2010, special issue with selection of Sciences, ca. 2010, special issue with selection of papers from "Cultural and Linguistic Diversity", conference held at AHRC Centre for Evolution of Cultural Diversity, London, Dec. 9-13, 2008.

Recommended