46
Linked Data in Language Typology Digital Humanities Research Seminar University of Helsinki February 1, 2018 Kaius Sinnemäki General Linguistics, University of Helsinki

Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

Linked Data in Language Typology

Digital Humanities Research SeminarUniversity of HelsinkiFebruary 1, 2018

Kaius SinnemäkiGeneral Linguistics, University of Helsinki

Page 2: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

Today’s talk• My background

• What is language typology?

• Rich data in typology

• Linked data possibilities in typology

• A case study

Page 3: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

My background

3

Page 4: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

MA 2004, general linguistics

• Corpus linguistics– Unix-based semi-

automatic detection of deep embedding

• Syntactic analysis of different genres– Incl. stream-of-

consciousness

Page 5: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

PhD 2011, general linguistics

• Language complexity, a typological viewpoint.• Testing domain:

– Case marking, agreement, linear order.

• Statistics and R.

5

Case markingAgreement

Word order

Page 6: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

PhD 2011, general linguistics

• Language complexity, a typological viewpoint.• Testing domain:

– Case marking, agreement, linear order.

• Statistics and R.

6

Case markingAgreement

Word order

Page 7: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

Recent & current projects• Post doc, 2013- (Collegium & Academy of Finland)

– How grammatical categories of the noun (e.g. case, definiteness, gender) interact with other each.

– How linguistic structures adapt to sociolinguistic context.– Combining typological and experimental evidence.

• Digital Humanities at UH (w/ Mikko Tolonen)– Conferences in 2014-2015; Fennica-project 2015.

• Sacred in Secular Societies (w/ Janne Saarikivi) 2018– How religious concepts have been transformed into

secular context and preserved there.

7

Page 8: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

What is language typology?

Page 9: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

1950s: Chomsky and Greenberg• Linguistics dominated by structuralism from late 19th

century to mid-20th century. Emphasis on variation:“Languages can differ from each other without limit and in unpredictable ways.” (Martin Joos 1957: 96)

• Chomsky: language is a separate module in the brain.ÆUniversal grammar: all languages fundamentally the same.ÆData from English. (Rationalism, cognitive science).

• Greenberg: is cross-linguistic diversity constrained?ÆLanguage universals (esp. on word order).ÆData from many languages. (Empirical, anthropology).

Page 10: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

• Language typology= It is worldwide comparison of languages to describe and

explain differences and similarities across languages.– Major research questions: 1) to what extent different

linguistic structures may interact among themselves or 2) with cognitive and cultural patterns (Bickel 2007).

• Well-known for word order correlations:– order of noun (N) and genitive (Gen) correlates

with the order of object (O) and verb (V)• NGen + VO (Book of John + eat an apple)• GenN + OV (John’s book + an apple eat)

Page 11: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

Cross-linguistic comparison• How are linguistic structures compared typologically?

• Key question: are there universal categories?– If yes Æ universal ontological system feasible.

• Cf. General Ontology for Linguistic Description (GOLD).– If no Æ comparison has to be based on researchers’ tools

• No right/wrong, just better/worse definitions for comparison

• Pooling/appending data from different sources– Are the definitions for a particular structure comparable?– If not, the only possibility is to analyse new data.

Page 12: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

Rich data / big data in typology

Page 13: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

“Big data” in cross-linguistic research

• Big data in linguistics: text corpora (e.g., Language Bank).

• Language typology is about language comparison.– How much is there to compare in languages?– A finite description of the grammar of even one language is

impossible (Moscoso 2010).

• Currently about 7000 languages spoken (Hammarström et al. 2017).

ÆThere should be many reasons for “big data”-research in language typology.

13

Page 14: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

Computer-assisted linguistic and cultural research

• Open access databases since 2005; linked data possibilities since 2008 (Word Atlas of Language Structures).

• Availability of new databases that contain linguistic and cultural data on languages and societies all over the world.

• Enable new research questions to be approached by new computer-assisted methods.

Page 15: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

15

Page 16: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

CLLD

• CLLD = Cross-Linguistic Linked Data (clld.org)– Hosts several large cross-linguistic datasets.– All openly available, repositories in GitHub.– Including (visit https://github.com/clld)

• The World Atlas of Language Structures.• The Atlas of Pidgin and Creole Language Structures.• The World Loanword Database.• The South American Indigenous Language Structures• Glottolog: catalogue of all languages, families and dialects

(including bibliographic information).– Data largely in database format, not many texts.

Page 17: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

D-PLACE• Cultural, linguistic, environmental and geographic data on 1400+

societies. (https://d-place.org).– Society = “represents a group of people in a particular locality, who

often share a language and cultural identity.”

• Cultural descriptions tagged with date and ethnographic sources. Ethnographies based on largely pre-1950s work.– Ethnographic Atlas (Murdock 1962-1971). Human Relations Area Files.– Data from pre-1950s ethnographies.

• Also phylogenetic treesÆphylogenetic comparative methods applicable

• Clone from https://github.com/D-PLACE/dplace-data

Page 18: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

• Is there any relationship:– Cultural trait ”presence of

trance”– Ecological factor ”rain

constancy”– (coursework w/Hilde

Schneemann, Andrea Bender and Mary Walworth)

• Phylogenetic computational methods:– Ancestral state reconstruction– Correlated changes.

• Result?– Negative coefficient, but non-

significant (p = .063)

A silly excerciseusing D-PLACE

Page 19: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

Data sources• Typologists’ data sources are reference grammars.

– Analysis ”by hand”, seldom computer-assisted.– Time-consuming.– Are observations “languages” or “constructions”?

• Samples around 200-300 languages.– WALS: data on 2600+ languages, 190 structures. Gaps

(Dryer & Haspelmath 2013).– Not big from statistical perspective, but “big” in

comparison to the history of language typology.

• Compare with corpus linguistics:– Datapoints counted in tens of thousands or more.– About 125 000 hits for the verb oleskella ‘stay, dwell’ in

the Finnish korp -corpus.

19

Page 20: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

Linked data in typology

Page 21: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

What to link?• Usually languages. Problem: many alternative names.

– Tenharim (Tupian) has 35 names (AUTOTYP).– Different databases name languages differently.– See e.g. discussion on http://dlc.hypotheses.org/623

• Solution: standard language identifiers– Problem:

• different databases/catalogues use different identifier systems• Ethnologue (ISO-639.3), Glottolog, WALS, AUTOTYP.

– Not a real problem, but there are still many-to-many mappings.

Page 22: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

Tenharim, alternative names• abahyba, Caripuna, Cauaiua, Cauhib, Cawahib• Diahoi, Diahói, Diahui, Diarroi, Diarrui, Djahui• Jahoi, Jahui, Jauareta-Tapiia, Jiahui, Juma, Yuma• Kagwahiv, Kagwahív, Kagwahiva• Karipuna, Karipuná, Kawahib, Kawaib• Paranawat, Parintintim, Parintintin, Parintintín• Pawaté-Wirafed• Tenharem, Tenharim, Tenharím, Tenharin• Tenharín, Tukumanfed, Uru-eu-uau-uau.

Page 23: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

What to link?• Usually languages. Problem: many alternative names.

– Tenharim (Tupian) has 35 names (AUTOTYP).– Different databases name languages differently.

• Solution: standard language identifiers– Problem:

• different databases/catalogues use different identifier systems• Ethnologue (ISO-639.3), Glottolog, WALS, AUTOTYP.

– Not a real problem, but there are still many-to-many mappings.

Page 24: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

• Conceptual work to be done:– What is a language in a database/catalogue?– Doculect, languoid, glossonym– Need to formalize the notion language

• See Cysouw & Good (2013).• Also discussion on Diversity Linguistics Comment

Page 25: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

A case study

Page 26: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

Creoles vs. regular languages

• Languages are transmitted in different social conditions. Usually faithful transmission, with some restructuring.

• Some languages under heavy language contact.– Break in normal transmission.– Restructuring, structural simplification /

complexification.ÆPidgins, jargons, creoles, mixed languages.

Page 27: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

• Creoles share many featuresÆA creole typological profile?– Used for arguing about language evolution.

• Questions:– Do creoles differ from regular lgs? (Bakker et al. 2011)– Do the contributing languages (mostly Indo-European)

differ from other languages of the world? (Cysouw2009; Blasi et al. 2017)

Page 28: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

28

(Cysouw 2009)

Page 29: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

29

(Cysouw 2009)

Page 30: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

30

(Cysouw 2009)

Page 31: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

Some of the often cited examples of the creole• -profile deal with word order and argument marking:

Creoles have SVO and very little morphological marking (e.g. no –case).

A boi lobi a umapikin.DET boylove DET girl

S V O'The boy loves the girl.‘ (Sranan; Winford & Plag 2013)

But: this correlation between SVO and no case marking (or •morphological marking) occurs also in regular languages.

Is the correlation stronger in Creoles than in regular languages?–If yes – Æ evidence for Creole profile. If not, then not.

Page 32: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

• Data on the case marking of 687 regular languages available in AUTOTYP (Bickel et al. 2017).

• Data on the word order of 1377 regular languages available in WALS (Dryer 2013).

• Data on the word order and case marking of 55 creoles available in APiCS (Michaelis et al. 2013).

Page 33: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

• AUTOTYP metadata files:– AUTOTYP’s own language identifier (integer)– ISO-639.3 code for each language– Glottocode (glottolog) for each language

• WALS and APiCS metadata files:– WALS code for each language (for WALS lgs)– ISO-639.3 code for each language– Glottocode (glottolog) for each language

Æ Should be straightforward to merge or join in R.

Page 34: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

• But no: several many-to-many mappings.– Not all language identifiers match one-to-one.

AUTOTYP WALS

Page 35: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

• Observations (lines) in AUTOTYP are constructions in languages

• Observations (lines) in WALS are languages

ÆMany researcher-based choices and lots of cleaning necessary before merging is possible

Page 36: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,
Page 37: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

• Observations (lines) in AUTOTYP are constructions in languages

• Observations (lines) in WALS are languages

ÆMany researcher-based choices and lots of cleaning necessary before merging is possible

Page 38: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,
Page 39: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

• Observations (lines) in AUTOTYP are constructions in languages

• Observations (lines) in WALS are languages

ÆMany researcher-based choices and lots of cleaning necessary before merging is possible.ÆBUT: once the cleaning script is ready, linking should

work automatically. Currently work in progress but almost there.

Page 40: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

And the preliminary result… (Sinnemäki 2017)

Page 41: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

41

- Data: 55 creoles

- logit estimates: -4.6 ± 1.7; p < .001***

ÆCorrelation between word order and

case marking.

Generalized mixed effects modeling

Page 42: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

42

- Data: 55 creoles

- logit estimates: -4.6 ± 1.7; p < .001***

ÆCorrelation between word order and

case marking.

- Data: 333 non-creoles

- logit estimate = -8.6 ± 3.8; p < .0001 ***

ÆCorrelation between word order and case

marking.

Generalized mixed effects modeling

Page 43: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

43

- Data: 55 creoles

- logit estimates: -4.6 ± 1.7; p < .001***

ÆCorrelation between word order and

case marking.

- Data: 333 non-creoles

- logit estimate = -8.6 ± 3.8; p < .0001 ***

ÆCorrelation between word order and case

marking.

Generalized mixed effects modeling

case x word_order x lg_type- estimates: -2.1 ± 1.6; p = .24

Page 44: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

Conclusions• More typological data openly accessible

• Universal ontological systems not necessarily feasible for a typologist

• Linking data between datasets possible but requires time-consuming cleaning

• The available datasets enable old questions to be answered in new ways with computational methods.

Page 45: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

Thank you!

Page 46: Linked data in language typology · –Incl. stream-of-consciousness. PhD 2011, general linguistics •Language complexity, a typological viewpoint. •Testing domain: –Case marking,

ReferencesBakker, P. et al. 2011. Creoles are typologically distinct from non-creoles. Journal of Pidgin and Creole Languages 26(1): 5-42.Blasi, D. et al. 2017. Grammars are robustly transmitted even during the emergence of creole languages. Nature Human Behaviour

1(10): 723-729.Bickel, B. 2007. Typology in the 21st century: Major current developments. Linguistic Typology 11(1): 239-251.Bickel, B. et al. 2017. The AUTOTYP typological databases. Version 0.1.0 https://github.com/autotyp/autotyp-data.Cysouw, M. 2009. APiCS, WALS, and the creole typological profile (if any). Presentation at the 1st APiCS Conference, 5-8

November 2009, Leipzig.Dryer, M. 2013. Order of subject, object and verb. In M. Dryer & M. Haspelmath (eds.).Dryer, M. & M. Haspelmath (eds.) 2013. The World Atlas of Language Structures Online. Leipzig: MPI for Evolutionary

Anthropology. http://wals.info.Good, J. & M. Cysouw 2013. Languoid, Doculect, and Glossonym: Formalizing the Notion 'Language’. Language Documentation

and Coservation 7: 331-359.Hammarström, H. et al. 2017. Glottolog 3.0. Jena: MPI for the Science of Human History. Joos, M. (ed.) 1957. Readings in Linguistics: The Development of Descriptive Linguistics in America Since 1925. Washington:

American Council of Learned Societies.Simons, G. F. & C. D. Fennig (eds.) 2017. Ethnologue: Languages of the World, 20th edn. Dallas, TX: SIL International.

http://www.ethnologue.com.Michaelis, S. et al. (eds.) 2013. Atlas of Pidgin and Creole Language Structures Online. Leipzig: MPI for Evolutionary Anthropology.Moscoso del Prado Martín, F. 2010. The effective complexity of language: English requires at least an infinite grammar. Ms.

http://www.moscosodelprado.net.Sinnemäki, K. 2017. How useful are creoles in language evolution research? Evaluating cross-linguistic universals of word order

and argument marking. Invited talk at the 30th Annual CUNY Conference on Human Sentence Processing, April 1, 2017, Massachussetts Institute of Technology.

Winford, D & I. Plag 2013. Sranan structure dataset. In S. Michaelis et al. (eds.).