Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Computational Humanities
John NerbonneRijksuniversiteit Groningen
Netherlands Institute for Advanced StudyWassenaar
Feb. 11, 2010
Computational Humanities Workshop Task
• ... with respect to the 10 KNAW institutes active in researchand documentation in the Humanities
• ... and especially with respect to the non-linguistic institutes
• ... to advise the KNAW how to structure a program to allowthese institutes to function as a “technical vanguard,” and inparticular what some “focal points” might be.
1
A View from Computational Linguistics
John Nerbonne
• Computational Linguist (CL), former ACL president
• Chair of alfa-informatica ’humanities computing’ (HC),Groningen
– ... including trade history and architectural history
• author of several articles on humanities computing, incl.authorship recognition, philosophy of science (via CLontology induction), HC as a discipline
• member of ALLC executive board, 2006-pres.
• program chair, Digital Humanities 2010 (London)
...sympathetic to the cause!
2
Committee’s Recommendations
• Formalization
• Representation, Frames and Associations
• Modeling
... all great stuff, but quite generic. I also miss detection andmetrics (measurement).
... Most of what follows sketches what I think has been asuccessful research line in dialectology with some reflection.
3
Computational Humanities
Theses:
• The goal of Humanities Computing is to further Humanitiesthrough computing.
• The primary advantage of computing in the Humanities is itsability to confront lots of data scientifically.
We contribute mainly to Humanities scholarship (Linguistics,History, Archaeology, Literature, . . .) and not to computerscience, digital culture, applications of humanities, the conceptof science, ...
4
Lest we Forget: We’re Humanists
Humanities Computing is “...die Fortsetzung derGeisteswissenschaften mit anderen Mitteln.” (with apologies toClausewitz, Brandt-Cortsius).
• Unsurprising, but significant
– HC must engage traditional Humanities– HC’s primary value is within traditional Humanities– First proof, then (perhaps) radical novelty
—Needing to “bootstrap” to new questions
Which (traditional) Humanities problems are we solving?
5
Good Humanities Computing Projects
• What humanities question is posed? Which humanitiesscholars want to know the answer?
• Is the computer essential? Is lots of data involved?
• Do we really have an answer, or just an analysis?
• Suppose you succeed, what will you be in a position to do?
6
Dialect Geography
Basic idea: apply string-distance measure to dialect atlasmaterial.
Results: new perspectives on language variation.
—but success creates lots of new questions
Collaboration with Wilbert Heeringa, Charlotte Gooskens, PeterKleiweg, Therese Leinonen, Jelena Prokic, Bob Shackleton,Christine Siedle, Marco Spruit and Martijn Wieling
7
What humanities question is posed?
Long-standing problem: How to characterize effect ofgeography on language variation?
• Detailed studies of single features ([aI] vs. [a], [æ] vs. [æ@],[u] vs. [y] vs [øy]) are at best inconclusive, at worstmisleading.
• Coseriu (1956) warned against “atomism” in dialectology
• Bloomfield (1933) “Every word has its history”
• Chambers & Trudgill (1999:97) “no one has succeeded indevising [...] principles to determine which isoglosses [...]should outrank others.”
Leading idea: characterize aggregates, exploiting computationalpower.
8
Bloomfield (1933)
9
Is the Computer Essential?
Levenshtein Distance: Cost of least costly set of operationsmapping one string into another.
Operation Costæ @ f t @ n 0næ f t @ n 0n delete @ d(@,[])=0.3æ f t @ r n 0n insert r d([],r)=0.2æ f t @ r n u n replace [0] with u d([0],[u])=0.1
Total 0.6
Operation costs derived from phonetic distance measures ofindividual sounds (ignored here, available in publications)
Used in genetics, software engineering, ethnology, ...
10
Computing Levenshtein Distance
æ f t @ r n u n0 1 2 3 4 5 6 7 8
æ 1 0 1 2 3 4 5 6 7@ 2 1 2 3 2 3 4 5 6f 3 2 1 2 3 4 5 6 7t 4 3 2 1 2 3 4 5 6@ 5 4 3 2 1 2 3 4 5n 6 5 4 3 2 3 2 3 40 7 6 5 4 3 4 3 4 5n 8 7 6 5 4 5 4 5 4
We simplify costs (everything = 1) for illustration.
11
Is lots of data involved?
• Reeks Nederlandse Dialectatlanten (RND)
– 2,000 sites, 141 sentences/site, undigitized– selection 350 sites, 150 wd/site– maps below involve ≈ 2.3× 108 sound comparisons
• Linguistic Atlas of the Middle and South Atlantic States(LAMSAS)
– 1,162 interviews, 150 wd/site, digitally available thanks toBill Kretzschmar
• Norwegian, German, Sardinian, Catalan, Tuscan (Italian),Bulgarian, Louisiana French, Forest Bantu, Turkic,non-Chinese Sinitic, ... . . .
12
Early Results
We obtain the sum of 150 edit distances on pairs ofcorresponding words.
Imagine an automobile club distance chart.
Assen Delft Kollum Nes SoestAssen 0 73 64 67 79Delft 73 0 81 74 68Kollum 64 81 0 43 91Nes 67 74 43 0 86Soest 79 68 91 86 0
13
Visualizing Early Results
Aalten
Almelo
Almkerk
Alveringem
Assen
Baelen
Bedum
Beek
Bekegem
Bergum
Bierum
BornBottelare
Boutersem
Breskens
Damme
Delft
Den Burg
Doetinchem
Dokkum
DrongelenDussen
Emmen
Eupen
Ferwerd
Geel
Gemert
Geraardsbergen
Gits
Goirle
Grimbergen
Groningen
Grouw
Haarlem
HardenbergHeemskerk
Heerenveen
Heerhugowaard
Hekelgem
Helmond
Hengelo
Hollum
Holwerd
Hoogstede
Houthalen
Huizen
Ingooigem
Kapelle
Kapelle−BroekKerkrade
Kieldrecht
Kinrooi
Klaaswaal
Kollum
Lamswaarde
Laren
Lebbeke
Leeuwarden
Lippelo
Makkum
Mechelen
Middelburg
Nes
Nieuwveen
Nordhorn
Norg
Ommen
Oosterhout
Oost−Vlieland
Polsbroek
Putten
RavensteinRenesse
Reninge
Riethoven
Roodeschool
Roosendaal
Roswinkel
RuinenSchagen
Schiermonnikoog
Sneek
Soest
Spankeren
Stadskanaal
Steenbeek
Steenwijk
Stiens
Surhuisterveen
Tienen
Vaassen
Veenendaal
Venray
Vianen
Vreren
Werchter
West−Terschelling
Wijhe
Wijnegem
Winschoten
Zelzate
Zevendonk
Zierikzee
Zoetermeer
14
Comparison to Traditional Scholarship
Aalten
Almelo
Almkerk
Alveringem
Assen
Baelen
Bedum
Beek
Bekegem
Bergum
Bierum
Born
Bottelare
Boutersem
Breskens
Damme
Delft
Den Burg
Doetinchem
Dokkum
DrongelenDussen
Emmen
Eupen
Ferwerd
Geel
Gemert
Geraardsbergen
Gits
Goirle
Grimbergen
Groningen
Grouw
Haarlem
HardenbergHeemskerk
Heerenveen
Heerhugowaard
Hekelgem
Helmond
Hengelo
Hollum
Holwerd
Hoogstede
Houthalen
Huizen
Ingooigem
Kapelle
Kapelle−BroekKerkrade
Kieldrecht
Kinrooi
Klaaswaal
Kollum
Lamswaarde
Laren
Lebbeke
Leeuwarden
Lippelo
Makkum
Mechelen
Middelburg
Nes
Nieuwveen
Nordhorn
Norg
Ommen
Oosterhout
Oost−Vlieland
Polsbroek
Putten
RavensteinRenesse
Reninge
Riethoven
Roodeschool
Roosendaal
Roswinkel
RuinenSchagen
Schiermonnikoog
Sneek
Soest
Spankeren
Stadskanaal
Steenbeek
Steenwijk
Stiens
Surhuisterveen
Tienen
Vaassen
Veenendaal
Venray
Vianen
Vreren
Werchter
West−Terschelling
Wijhe
Wijnegem
Winschoten
Zelzate
Zevendonk
Zierikzee
Zoetermeer
Obtained via clustering on distance table.
Jibes well with traditional scholarship, but without the subjectiveelement of emphasizing only a small selection of data
15
Piloting
• What humanities question is posed? Who wants to know?
• Is the computer essential? Is lots of data involved?
• Do we really have an answer, or just an analysis?
– You’ve just modified your algorithm, sifted through yourdata, tweaked your implementation, so . . .
– How can you show you’re right, and not just assiduous?
• Suppose you succeed, what will you be in a position to do?
16
Calibrating the instrument
• Consistency: Cronbach α ≥ 0.7 at 30 wd., ≥ 0.98 at 150 wds.
• Validity
– Application to unseen data, languages [successful]– Comparison to expert consensus– Comparison to lay dialect speakers’ judgments
Note that the validity question was latent in non-computationaltreatments
17
Consensus View: Van Ginneken ∩ Daan
Aalten
Almelo
Almkerk
Alveringem
Assen
Baelen
Bedum
Beek
Bekegem
Bergum
Bierum
Born
Bottelare
Boutersem
Breskens
Damme
Delft
Den Burg
Doetinchem
Dokkum
DrongelenDussen
Emmen
Eupen
Ferwerd
Geel
Gemert
Geraardsbergen
Gits
Goirle
Grimbergen
Groningen
Grouw
Haarlem
HardenbergHeemskerk
Heerenveen
Heerhugowaard
Hekelgem
Helmond
Hengelo
Hollum
Holwerd
Hoogstede
Houthalen
Huizen
Ingooigem
Kapelle
Kapelle−BroekKerkrade
Kieldrecht
Kinrooi
Klaaswaal
Kollum
Lamswaarde
Laren
Lebbeke
Leeuwarden
Lippelo
Makkum
Mechelen
Middelburg
Nes
Nieuwveen
Nordhorn
Norg
Ommen
Oosterhout
Oost−Vlieland
Polsbroek
Putten
RavensteinRenesse
Reninge
Riethoven
Roodeschool
Roosendaal
Roswinkel
RuinenSchagen
Schiermonnikoog
Sneek
Soest
Spankeren
Stadskanaal
Steenbeek
Steenwijk
Stiens
Surhuisterveen
Tienen
Vaassen
Veenendaal
Venray
Vianen
Vreren
Werchter
West−Terschelling
Wijhe
Wijnegem
Winschoten
Zelzate
Zevendonk
Zierikzee
Zoetermeer
Van Ginneken + Daan
18
Matrix Comparison
(Ward’s, UPGMA) Clusterings vs. Consensus Expert Opinion
See Proc. Gesellschaft für Klassifikation, 2002 application ofmodified Rand index of classification agreement.
Problem: Are we just reinforcing encrusted tradition here?Answer: Not if we don’t insist on 100% agreement.
The humanities feels a stronger need than other scholarly areasto compare current analyses to older one.
19
Lay Dialect Speakers’ Distance Judgments
Heeringa (Diss., 2004) and Gooskens and Heeringa (to appear,Lg. Var. & Change) demonstrate r > 0.7 between perceptualdistances and distances determined via Levenshtein algorithm.
Reflection:
• Old debate: subjectively vs. objectively based dialectology.
• Computationally based methods are extremely flexible
• Is it not imperative to examine correspondences withsubjective judgments (lay speakers’ judgments)?
—Else little is firm.
20
Piloting
• What humanities question is posed? Who wants to know?
• Is the computer essential? Is lots of data involved?
• Do really have an answer, or just an analysis?
• Suppose you succeed, what will you be in a position to do?
21
A New Perspective
Aalten
Almelo
Almkerk
Alveringem
Assen
Baelen
Bedum
Beek
Bekegem
Bergum
Bierum
BornBottelare
Boutersem
Breskens
Damme
Delft
Den Burg
Doetinchem
Dokkum
DrongelenDussen
Emmen
Eupen
Ferwerd
Geel
Gemert
Geraardsbergen
Gits
Goirle
Grimbergen
Groningen
Grouw
Haarlem
HardenbergHeemskerk
Heerenveen
Heerhugowaard
Hekelgem
Helmond
Hengelo
Hollum
Holwerd
Hoogstede
Houthalen
Huizen
Ingooigem
Kapelle
Kapelle−BroekKerkrade
Kieldrecht
Kinrooi
Klaaswaal
Kollum
Lamswaarde
Laren
Lebbeke
Leeuwarden
Lippelo
Makkum
Mechelen
Middelburg
Nes
Nieuwveen
Nordhorn
Norg
Ommen
Oosterhout
Oost−Vlieland
Polsbroek
Putten
RavensteinRenesse
Reninge
Riethoven
Roodeschool
Roosendaal
Roswinkel
RuinenSchagen
Schiermonnikoog
Sneek
Soest
Spankeren
Stadskanaal
Steenbeek
Steenwijk
Stiens
Surhuisterveen
Tienen
Vaassen
Veenendaal
Venray
Vianen
Vreren
Werchter
West−Terschelling
Wijhe
Wijnegem
Winschoten
Zelzate
Zevendonk
Zierikzee
Zoetermeer
Foundation for dialect continuum via MDS on distance table.
22
New Deployments
sd sg
schoonebeek
gramsbergen
coevorden
emlichheim
hoogstede
nieuw schoonebeek
bergentheim
radewijk
itterbeck
wilsum
langeveen
vasse
uelsen neuenhaus
lage
lattropnordhorn
• Effects of borders (Heeringa et al., 2000)
• Historical convergence, etc. (Heeringa & Nerbonne, 2000)
• Minority language status (Heeringa & Gooskens, 2004)
• Variation in lexicon, syntax (Nerbonne & Kleiweg, 2003)
23
Technical Innovation
Composite Clustering, repeated clustering w. noise “projects”distance table onto map, countering instability in clustering.
24
New Questions
Now that we can measure linguistic distances reliably, can weask the fundamental question more satisfactorily?
How does geography influence linguistic variation?
25
Trudgill’s Gravitational View
Peter Trudgill suggests that language varieties may be subjectto a “gravity law”, being attracted to one another in a way likethe way planets are attracted to the sun.
F = Gm1m2
r2
F Force due to gravity,
m1,m2 Masses of the two objects attracting each other,
r (Physical) Distance between them, and
G “Universal Gravitational constant.”
26
Linguistic Cohesion via Gravity
F = Gm1m2
r2∝ p1p2
r2
m1,m2 Populations (p1, p2) of settlements, and
r (Physical) Distance between them
D ∝ 1/F ∝ r2
p1p2
D Linguistic Distance
Idea: contact promotes ling. accommodation, similarity.
27
Motivating Linguistic Cohesion via Gravity
Chance of social contact should be
• Proportional to the product of settlement size and
• (if travel is random) Inversely proportional to squared distance
Notate bene: we measure linguistic dissimilarity, which wepostulate stands in inverse relation to the attractive force ofsocial contact.
28
Quadratic? or Logarithmic?
0 50000 100000 150000 200000
510
1520
Linguistic Distance vs. Geographic Distance
Gravity predicts a positive quadratic effect!Geographic Distance (m)
Dia
lect
Dis
tanc
e
Shape? Zero? (r2 = 0.50 vs. 0.58)
29
Interpreting Results
Trudgill’s gravity model
• Attraction is relatively stronger over short distances
• Therefore linguistic distances should be relatively smallerover these short geographic distances
Observations
• Linguistic distance increases positively with geographicdistance, but
• Effect is proportionately greater over short distances ratherthan proportionately smaller
30
Speculation on Cultural Dynamics
Not attraction, as Trudgill postulates, but rather differentiationis the fundamental cultural dynamic.
It is natural to see this grow relatively weaker over longdistances.
In spite of enormous linguistic pressures towardaccommodation.
Philosophical Transactions of the Royal Society, to appear
31
Further Results
• Very weak (inverse) correlation ling. distance with populationsize
• Similar curves in German, American English, Bantu,Bulgarian, Norwegian, French
• Gooskens (2004) improves curve fit using 19th cent. traveltime instead of geography
—No improvement in (flat) Netherlands (van Gemert),substantial improvement in rugged Norway (Gooskens)
32
Good Humanities Computing
Project guidelines:
• What humanities question is posed? Who wants to know?
• Is the computer essential? Is lots of data involved?
• Do really have an answer, or just an analysis?
• Suppose you succeed, what will you be in a position to do?
33
Conclusions and Prospects
Thesis: The challenge of Humanities Computing is to furtherHumanities scholarship by confronting lots of data scientifically.
Subthemes:
• Seek solid answers to Humanities questions.
• Let paradigm shifts take care of themselves.
• Our competitive edge is in confronting lots of data.
www.rug.nl/let/informatiekunde
34
Opportunity
Scholarly areas that focus on text are poised to take advantageof all the work in CL, text-based IR and stylometry.
• detect and identify differences in political, philosophicalvocabularies, phrasings
– synchronic differences: schools of thought?– diachronic differences: changes in thinking?
• detect and trace the diffusion of new concepts throughvocabulary
• detect and identify the expression of attitudes (sentiment)toward specific events, issues
35