44
A multivariate approach to linguistic variation and distribution Stefan Evert Corpus Linguis3cs Group FAU ErlangenNürnberg Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 1

A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

A multivariate approach to linguistic variation and distribution

Stefan  Evert  Corpus  Linguis3cs  Group  FAU  Erlangen-­‐Nürnberg  

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 1

Page 2: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Linguistic variation

Varia3on  of  a  quan3ta3ve  linguis3c  feature  –  frequency  of  passive,  past  perfect,  split  infini3ve,  …  –  frequency  of  expression,  seman3c  field,  topic,  …  –  etc.  

across  –  languages  and  language  varie3es  –  regions  –  social  strata  –  3me  –  individual  speakers  –  etc.  

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 2

Page 3: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

The traditional approach

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 3

rela

tive

frequ

ency

of p

assi

ves

(%)

20

40

60

80

AmE BrE

● ●

press reportage

AmE BrE

●●

press editorial

AmE BrE

skills / hobbies

AmE BrE

miscellaneous

AmE BrE

learned

AmE BrE

●●●

● ●

general fiction

AmE BrE

●●

science fiction

AmE BrE

● ●

adventure

AmE BrE

● ●

● ●

romance

§  Select  a  linguis3c  feature  (e.g.  passive  voice)  §  Compare  its  frequency  across  different  categories  (genres,  language  varie3es,  speakers,  …)  

Page 4: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 4

Passives in Brown corpus (1960s AmE)

proportion of passives (%)

num

ber o

f tex

ts

0 10 20 30 40 50 60

020

4060

80 observedbinomial

texts  as  random  samples  from  a  body  of  language  à  binomial  distribu3on    

Language variation as a nuisance parameter in corpus linguistics

Page 5: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

The multivariate approach

§  Different  linguis3c  features  oMen  show  similar  paPerns  of  varia3on  

§  E.g.  passives  and  nominaliza3ons  

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

0 10 20 30 40

010

2030

4050

60

passives / 1000 words

nom

inal

izat

ions

/ 10

00 w

ords

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 5

Page 6: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

The multivariate approach

§  Different  linguis3c  features  oMen  show  similar  paPerns  of  varia3on  

§  E.g.  passives  and  nominaliza3ons  

§  Such  correla3ons  can  be  exploited  to  determine  major  dimensions  of  var.  

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

0 10 20 30 40

010

2030

4050

60

passives / 1000 words

nom

inal

izat

ions

/ 10

00 w

ords

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 6

Page 7: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

The multivariate approach

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 7

Page 8: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

The multivariate approach

§  Mul3variate  analysis  exploits  correla3ons  between  features  in  order  to  determine  latent  dimensions  –  interpreted  as  underlying  “causes”  of  varia3on  

§  An  induc3ve,  data-­‐driven  approach  –  no  theore3cal  assump3ons  about  linguis3c  varia3on  and  categories  /  sub-­‐corpora  to  be  compared  

§  Pioneering  work  by  Doug  Biber  (1988,  1993,  1995,  …)  –  “mul3dimensional  analysis”  of  register  varia3on  

§  Related  approaches:  correspondence  analysis,  distribu3onal  seman3cs,  topic  modelling,  …  

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 8

Page 9: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Biber's multidimensional analysis

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 9

5.3 Linguistic features 95

Table 5.7 Linguistic features used in the analysis of English

A. Tense and aspect markers1 Past tense2 Perfect aspect3 Present tense

B. Place and time adverbials4 Place adverbials (e.g., above, beside, outdoors)5 Time adverbials (e.g., early, instantly, soon)

C. Pronouns and pro-verbs6 First-person pronouns7 Second-person pronouns8 Third-person personal pronouns (excluding it)9 Pronoun it

10 Demonstrative pronouns (that, this, these, those as pronouns)11 Indefinite pronouns (e.g., anybody, nothing, someone)12 Pro-verb do

D. Questions13 Direct WH questions

E. Nominal forms14 Nominalizations (ending in -tion, -ment, -ness, -ity)15 Gerunds (participial forms functioning as nouns)16 Total other nouns

F. Passives17 Agentless passives18 fy-passives

G. Stative forms19 be as main verb20 Existential there

H. Subordination features21 that verb complements (e.g., / said that he went)22 that adjective complements (e.g., I'm glad that you like it)23 WH-clauses (e.g., / believed what he told me)24 Infinitives25 Present participial adverbial clauses (e.g., Stuffing his mouth with cookies, Joe

ran out the door)26 Past participial adverbial clauses (e.g., Built in a single week, the house would

stand for fifty years)27 Past participial postnominal (reduced relative) clauses (e.g., the solution

produced by this process)28 Present participial postnominal (reduced relative) clauses (e.g., The event

causing this decline was ...)29 that relative clauses on subject position (e.g., the dog that bit me)30 that relative clauses on object position (e.g., the dog that I saw)31 WH relatives on subject position (e.g., the man who likes popcorn)32 WH relatives on object position (e.g., the man who Sally likes)33 Pied-piping relative clauses (e.g., the manner in which he was told)

96 Methodology

Table 5.7 (cont.)

34 Sentence relatives (e.g., Bob likes fried mangoes, which is the most disgustingthing I've ever heard of)

35 Causative adverbial subordinator (because)36 Concessive adverbial subordinators (although, though)37 Conditional adverbial subordinators (if unless)38 Other adverbial subordinators (e.g., since, while, whereas)

I. Prepositional phrases, adjectives, and adverbs39 Total prepositional phrases40 Attributive adjectives (e.g., the big horse)41 Predicative adjectives (e.g., The horse is big.)42 Total adverbs

J. Lexical specificity43 Type-token ratio44 Mean word length

K. Lexical classes45 Conjuncts (e.g., consequently, furthermore, however)46 Downtoners (e.g., barely, nearly, slightly)47 Hedges (e.g., at about, something like, almost)48 Amplifiers (e.g., absolutely, extremely, perfectly)49 Emphatics (e.g., a lot, for sure, really)50 Discourse particles (e.g., sentence-initial well, now, anyway)51 Demonstratives

L. Modals52 Possibility modals (can, may, might, could)53 Necessity modals (ought, should, must)54 Predictive modals (will, would, shall)

M. Specialized verb classes55 Public verbs (e.g., assert, declare, mention)56 Private verbs (e.g., assume, believe, doubt, know)57 Suasive verbs (e.g., command, insist, propose)58 seem and appear

N. Reduced forms and dispreferred structures59 Contractions60 Subordinator that deletion (e.g., / think [that] he went)61 Stranded prepositions (e.g., the candidate that I was thinking of)62 Split infinitives (e.g., He wants to convincingly prove that...)63 Split auxiliaries (e.g., They were apparently shown'to . . . )

O. Co-ordination64 Phrasal co-ordination (NOUN and NOUN; ADJ; and ADJ; VERB and VERB; ADV

and ADV)65 Independent clause co-ordination (clause-initial and)

P. Negation66 Synthetic negation (e.g., No answer is good enough for Jones)67 Analytic negation (e.g., That's not likely)

Page 10: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Biber's multidimensional analysis

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 10

factor  analysis  (FA)  

Page 11: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Biber's multidimensional analysis

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 11

T H E M U L T I - D I M E N S I O N A L A P P R O A C H T O L I N G U I S T I C A N A L Y S E S O F G E N R E V A R I A T I O N 335

co-occur frequently in texts because they serve some shared, underlying communicative functions associated with the situational contexts of the texts.

Table 2 summarizes the co-occurring features associated with each of the five dimensions. The decimal numbers on this table represent the factor "loadings" for each linguistic feature. Loadings can run from --1.0 to +1.0; the further from 0.0 a loading is, the more one can generalize from the factor in question to the particular linguistic feature. Features with higher loadings are thus better representatives of the dimension underlying a factor. In Table 2, only features with loadings larger than 0.35 (plus or minus) are included.

Most of the dimensions consist of two group- ings of features, having positive and negative loadings. Positive or negative sign does not indi- cate a more-or-less relationship; rather, these two

groups represent sets of features that occur in a complementary pattern. That is, when the features in one group occur together frequently in a text, the features in the other group are markedly less frequent in that text, and vice versa. To interpret the dimensions, it is important to consider likely reasons for the complementary distribution of these two groups of features as well as the reasons for the co-occurrence pattem within each group. 3

For example, consider Dimension 2. The fea- tures in the top group (the positive loadings above the dashed line on Table 2) are past tense verbs, perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom group (the negative loadings) are present tense verbs and adjectives. Considering all of the features on Dimension 2, this dimension is interpreted as distinguishing narrative discourse from other types of discourse,

TABLE 2 Summary of the co-occurrence patterns underlying five major dimensions of English.

DIMENSION 1 DIMENSION 2 DIMENSION 3 DIMENSION 4 DIMENSION 5 (Informational vs. (Narrative versus (Elaborated vs. (Overt Expression (Abstract versus

Involved) Non-Narrative) Situated Reference) of Persuasion) Non-Abstract Style)

nouns 0.80 past tense verbs 0.90 word length 0.58 third person pronouns 0.73 prepositional phrases 0.54 perfect aspect verbs 0.48 type / token ratio 0.54 public verbs 0.43 attributive adjs. 0.47 synthetic negation 0.40

present participial private verbs --0.96 clauses 0.39 that deletions --0.91 contractions --0.90 present tense verbs --0.47 present tense verbs --0.86 attributive adjs. --0.41 2nd person pronouns --0.86 do as pro-verb --0.82 analytic negation --0.78 demonstrative

pronouns --0.76 general emphatics --0.74 first person pronouns --0.74 pronoun it --0.71 be as main verb --0.71 causative

subordination --0.66 discourse particles _0.66 indefinite pronouns --0.62 general hedges --0.58 amplifiers --0.56 sentence relatives --0.55 WH questions --0.52 possibility modals --0.50 non-phrasal

coordination --0.48 WH clauses --0.47 final prepositions --0.43

WH relative clauses on infinitives 0.76 conjuncts 0.48 object positions 0.63 prediction modals 0.54 agentless passives 0.43

pied piping suasive verbs 0.49 past participial constructions 0.61 conditional clauses 0.42

WH relative clauses on subordination 0.47 BY-passives 0.41 subject position 0.45 necessity modals 0.46 past participial

phrasal coordination 0.36 split auxiliaries 0.44 WHIZ deletions 0.40 nominalizations 0.36 possibility modals 0.37 other adverbial

subordinators 0.39 time adverbials --0.60 place adverbials --0.49 other adverbs -0 .46

[No complementary features] [No complementary features]

Computational Linguistics Volume 19, Number 2

INFORMATIONAL

l 15 +

t 1 0 +

i 5 +

o ! I 0 +

E N S -5 + I i 0 N I

-i0 +

h -15 +

I - 2 0 +

I

i - 2 5 +

i - 3 0 +

I - 3 5 +

I I N V O L V E D

Newspaper reportage *

Academic * prose

Newspaper * editorials

Broadcasts Q

e

Fiction

Professional letters *

Personal * letters

spontaneous * speeches

Q

Conversations

..... + ..... + ..... + ..... + ..... +--+--+ ..... + ..... + ..... + .... -9 -7 -5 -3 -i 0 1 3 5 7

SITUATED ELABORATED

DIMENSION 3

Figure 1 Linguistic characterization of nine spoken and written registers with respect to Dimension 1 ('Informational versus Involved Production') and Dimension 3 ('Elaborated versus Situation-Dependent Reference').

230

Page 12: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Problems

§  Design  bias  –  choice  of  features  –  selec3on  of  text  samples  

§  Involves  a  miracle  –  and  it  isn't  even  a  very  robust  one  

§  Interpreta3on  bias  –  arbitrary  cutoff  for  feature  weights  (“loadings”)  –  risk  of  reading  one's  own  expecta3ons  into  features  

§  More  subtle  paPerns  of  varia3on  invisible  

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 12

Page 13: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Reproducing Biber's dimensions

§  Sample  of  923  medium-­‐length  published  texts  from  wriPen  part  of  Bri3sh  Na3onal  Corpus  (BNC)  

§  Covers  4  different  text  types  +  male/female  authors  –  academic  wri3ng,  non-­‐academic  prose,  fic3on,  misc.  

§  Biber  features  extracted  automa3cally  with  Python  script  (Gasthaus  2007)  

§  Factor  analysis  with  4  latent  dimensions  +  varimax  –  seems  to  yield  the  most  clearly  structured  analysis  

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 13

Page 14: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 14f26

past

par

ticip

lef4

3 ty

pe to

ken

f34

sent

ence

rela

tives

f36

thou

ghf3

9 pr

epos

ition

sf4

4 m

ean

word

leng

thf4

0 ad

j attr

f27

past

par

ticip

le w

hiz

f18

by p

assi

ves

f17

agen

tless

pas

sive

sf6

4 ph

rasa

l coo

rdin

atio

nf1

4 no

min

aliz

atio

nf4

5 co

njun

cts

f16

othe

r nou

nsf3

8 ot

her a

dv s

ubf3

1 w

h su

bjf3

2 w

h ob

jf3

3 pi

ed p

ipin

gf5

1 de

mon

stra

tives

f57

verb

sua

sive

f22

that

adj

com

pf3

0 th

at o

bjf2

1 th

at v

erb

com

pf0

4 pl

ace

adve

rbia

lsf0

5 tim

e ad

verb

ials

f25

pres

ent p

artic

iple

f47

hedg

esf0

1 pa

st te

nse

f02

perfe

ct a

spec

tf0

8 th

ird p

erso

n pr

onou

nsf6

1 st

rand

ed p

repo

sitio

nf1

3 w

h qu

estio

nf0

7 se

cond

per

son

pron

ouns

f23

wh

clau

sef4

2 ad

verb

sf5

0 di

scou

rse

parti

cles

f59

cont

ract

ions

f06

first

per

son

pron

ouns

f12

prov

erb

dof1

1 in

defin

ite p

rono

unf0

9 pr

onou

n it

f67

neg

anal

ytic

f56

verb

priv

ate

f49

emph

atic

sf5

5 ve

rb p

ublic

f58

verb

see

mf6

6 ne

g sy

nthe

ticf2

8 pr

esen

t par

ticip

le w

hiz

f15

geru

nds

f46

dow

nton

ers

f48

ampl

ifier

sf6

2 sp

lit in

finitv

ef2

9 th

at s

ubj

f20

exis

tent

ial t

here

f35

beca

use

f03

pres

ent t

ense

f53

mod

al n

eces

sity

f52

mod

al p

ossi

bilit

yf2

4 in

finiti

ves

f63

split

aux

iliary

f54

mod

al p

redi

ctive

f37

iff1

0 de

mon

stra

tive

pron

oun

f19

be m

ain

verb

f41

adj p

red

f41 adj predf19 be main verbf10 demonstrative pronounf37 iff54 modal predictivef63 split auxiliaryf24 infinitivesf52 modal possibilityf53 modal necessityf03 present tensef35 becausef20 existential theref29 that subjf62 split infinitvef48 amplifiersf46 downtonersf15 gerundsf28 present participle whizf66 neg syntheticf58 verb seemf55 verb publicf49 emphaticsf56 verb privatef67 neg analyticf09 pronoun itf11 indefinite pronounf12 proverb dof06 first person pronounsf59 contractionsf50 discourse particlesf42 adverbsf23 wh clausef07 second person pronounsf13 wh questionf61 stranded prepositionf08 third person pronounsf02 perfect aspectf01 past tensef47 hedgesf25 present participlef05 time adverbialsf04 place adverbialsf21 that verb compf30 that objf22 that adj compf57 verb suasivef51 demonstrativesf33 pied pipingf32 wh objf31 wh subjf38 other adv subf16 other nounsf45 conjunctsf14 nominalizationf64 phrasal coordinationf17 agentless passivesf18 by passivesf27 past participle whizf40 adj attrf44 mean word lengthf39 prepositionsf36 thoughf34 sentence relativesf43 type tokenf26 past participle

Design bias: choice of features

correlated  with  verb  frequency  

correlated  with  noun  frequency  

Page 15: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 15

●●

● ●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

−2 −1 0 1 2 3

−2−1

01

23

latent dimension 2: overt persuasion + other

late

nt d

imen

sion

1: n

arra

tive/

invo

lved

vs. n

on−n

arra

tive/

info

rmat

iona

l

academicfictionmisc_publishedprose

Design bias: choice of texts

Page 16: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 16

●●

● ●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

−2 −1 0 1 2 3

−2−1

01

23

latent dimension 2: overt persuasion + other

late

nt d

imen

sion

1: n

arra

tive/

invo

lved

vs. n

on−n

arra

tive/

info

rmat

iona

l

academicfictionmisc_publishedprose

Design bias: choice of texts Computational Linguistics Volume 19, Number 2

INFORMATIONAL

l 15 +

t 1 0 +

i 5 +

o ! I 0 +

E N S -5 + I i 0 N I

-i0 +

h -15 +

I - 2 0 +

I

i - 2 5 +

i - 3 0 +

I - 3 5 +

I I N V O L V E D

Newspaper reportage *

Academic * prose

Newspaper * editorials

Broadcasts Q

e

Fiction

Professional letters *

Personal * letters

spontaneous * speeches

Q

Conversations

..... + ..... + ..... + ..... + ..... +--+--+ ..... + ..... + ..... + .... -9 -7 -5 -3 -i 0 1 3 5 7

SITUATED ELABORATED

DIMENSION 3

Figure 1 Linguistic characterization of nine spoken and written registers with respect to Dimension 1 ('Informational versus Involved Production') and Dimension 3 ('Elaborated versus Situation-Dependent Reference').

230

Page 17: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 17

●●

● ●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

−2 −1 0 1 2 3

−2−1

01

23

latent dimension 2: overt persuasion + other

late

nt d

imen

sion

1: n

arra

tive/

invo

lved

vs. n

on−n

arra

tive/

info

rmat

iona

l

academicfictionmisc_publishedprose

Interpretation bias Computational Linguistics Volume 19, Number 2

INFORMATIONAL

l 15 +

t 1 0 +

i 5 +

o ! I 0 +

E N S -5 + I i 0 N I

-i0 +

h -15 +

I - 2 0 +

I

i - 2 5 +

i - 3 0 +

I - 3 5 +

I I N V O L V E D

Newspaper reportage *

Academic * prose

Newspaper * editorials

Broadcasts Q

e

Fiction

Professional letters *

Personal * letters

spontaneous * speeches

Q

Conversations

..... + ..... + ..... + ..... + ..... +--+--+ ..... + ..... + ..... + .... -9 -7 -5 -3 -i 0 1 3 5 7

SITUATED ELABORATED

DIMENSION 3

Figure 1 Linguistic characterization of nine spoken and written registers with respect to Dimension 1 ('Informational versus Involved Production') and Dimension 3 ('Elaborated versus Situation-Dependent Reference').

230 ?  

Page 18: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Blindness to subtle patterns §  But  research  shows  

that  author  gender  can  be  iden3fied  with  high  accuracy  –  Koppel  et  al.  (2003):  

77.3%  with  func3on  words  +  POS  n-­‐grams  

–  Gasthaus  (2007):  82.9%  with  SVM  on  Biber  features  

§  This  dataset:  82.3%  accuracy  –  baseline:  73.1%  

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 18

●●

● ●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

−2 −1 0 1 2 3

−2−1

01

23

latent dimension 2: overt persuasion + other

late

nt d

imen

sion

1: n

arra

tive/

invo

lved

vs. n

on−n

arra

tive/

info

rmat

iona

l

femalemale

Page 19: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Our approach (Diwersy,  Evert  &  Neumann  2014)

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 19

Page 20: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Our approach (Diwersy,  Evert  &  Neumann  2014) §  Assump3on:  (Euclidean)  distances  meaningful  –  as  a  measure  of  linguis3c  similarity  of  texts  –  depends  crucially  on  choice  of  features  

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 20

§  Visualiza3on  to  interpret  geometric  configura3on  §  Orthogonal  projec3on  =  perspec3ve  on  data  –  (squared)  distances  decompose  into  preserved  structure  +  orthogonal  (hidden)  component  

–  op3mal  projec3on:  principal  component  analysis  (PCA)  

§  Minimally  supervised  interven3on  –  based  on  externally  observable,  theory-­‐neutral  informa3on  –  method:  linear  discriminant  analysis  (LDA)  

Page 21: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Case studies

§  Transla3on  effects  and  register  varia3on  in  German  and  English  (Evert  &  Neumann  in  prep.)  

§  Regional  varie3es  of  French,  based  on  colliga3onal  frequencies  in  newspaper  texts  (Diwersy  et  al.  2014)  

§  Work  in  progress:  Authorship  aPribu3on  with  Burrows  Delta  (Evert  et  al.  2015)  

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 21

Page 22: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Case study 1: CroCo

§  CroCo:  parallel  corpus  English/German  –  English-­‐German  and  German-­‐English  transla3on  pairs  –  454  texts  from  8  different  genres  

§  28  lexico-­‐gramma3cal  features  (Neumann  2013)  –  comparable  btw.  languages,  try  to  reduce  correla3ons  –  inspired  by  SFL  and  transla3on  studies  

§  Text  =  point  in  28-­‐dimensional  feature  space  §  PCA  iden3fies  latent  dimensions  of  varia3on  –  FA  results  are  very  similar  à  comparable  to  Biber  approach  

§  Focus  on  English  texts  here  (originals  and  transla3ons)  Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 22

Diwersy,  Evert  &  Neumann  (2014);  Evert  &  Neumann  (in  prep.)  

Page 23: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Methodological issues

§  Feature  scaling  §  Choice  of  features  

§  Choice  of  texts  §  Delicate  effects  are  obscured  

●●●●●●●●●●

●●●● ●●●●●●

●●●●

●●●

●●

●●

●●

●●

●●●●●

●●

●●●

●●

●●

●●

●●●●

●●●

●●●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●●

●●●●●

●●

●●

●●●

●●●●●

●●●●●●

●●●●

●●●●●●●●●●●● ●●●●●●●●●●●

●●

●●

●●

●●●

●●

●●●●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●

nn_T infinitive_S modals_V pronouns_T verb.theme_TH titles_T

020

4060

80

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 23

Page 24: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Methodological issues

§  Feature  scaling  §  Choice  of  features  

§  Choice  of  texts  §  Delicate  effects  are  obscured  

●●●●●●●●

●●

●●● ●●●●●● ●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●●

●●●●●

●●

● ●

●●●

● ●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●● ●

●●

●●●●

●●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●●

●●●

●●

●●●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●●●

●●

●●

●●

●●

●●●●

nn_T infinitive_S modals_V pronouns_T verb.theme_TH titles_T

−20

24

68

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 24

Page 25: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 25lexi

cal d

ensi

tynn

/ T

adja

/ T

nom

inal

/ T

verb

them

e / T

Him

pera

tives

/ S

lexi

cal T

TRpa

ssive

/ V

infin

itive

s / F

mod

als

/ Vsu

bj th

eme

/ TH

coor

dina

tion

/ Ttit

les

/ Tob

j the

me

/ TH

prep

/ T

mod

al a

dv /

Tfin

ites

/ Sad

v th

eme

/ TH

toke

n / S

text

them

e / T

Hsu

bord

inat

ion

/ Ttim

e ad

v / T

plac

e ad

v / T

inte

rroga

tives

/ S

collo

quia

lism

/ T

past

/ V

cont

ract

ions

/ T

pron

ouns

/ T

pronouns / Tcontractions / Tpast / Vcolloquialism / Tinterrogatives / Splace adv / Ttime adv / Tsubordination / Ttext theme / THtoken / Sadv theme / THfinites / Smodal adv / Tprep / Tobj theme / THtitles / Tcoordination / Tsubj theme / THmodals / Vinfinitives / Fpassive / Vlexical TTRimperatives / Sverb theme / THnominal / Tadja / Tnn / Tlexical density

Case study 1: CroCo

Page 26: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Case study 1: CroCo

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 26

essayfictioninstructionpopscisharespeechtourismweb

Page 27: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Case study 1: CroCo

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 27

−5 0 5

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

−5 0 5

−50

5

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

−50

5

essayfictioninstructionpopsci

sharespeechtourismweb

●●

●●

●●

●● ● ●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●●● ●

● ●

●● ●

●●

● ●

●●●

● ●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●● ●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●● ●

●●

●●●

●●

●●

●● ●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

−50

5

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

● ●●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

● ●

−5 0 5

−50

5

−5 0 5

Page 28: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Case study 1: CroCo

§  Focus  on  first  two  latent  dimensions  (à  Biber's  map)  

§  Describe  genre  by  centroid  and  confidence  ellipse  

§  Comparison  with  Hotelling's  t2  test  –  essay  vs.  speech  –  t2=2.512,  df=2/80,  p=.0875  n.s.  

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

−4 −2 0 2 4 6

−50

510

latent dimension 2

late

nt d

imen

sion

1

essayfictioninstructionpopscisharespeechtourismweb

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 28

Page 29: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

§  PCA  dimensions  fail  to  dis3nguish  transla3ons  from  original  texts  

§  But  a  SVM  machine  learner  can  do  this  with  85%  accuracy  

§  Replace  one  PCA  dimension  with  LDA  discriminant  for  orig  vs.  trans  –  external  &  theory-­‐

neutral  informa3on  

−4 −2 0 2 4 6

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

−2 −1 0 1 2 3 4

−50

510

●●●

●●

●●

●●

●●

−4−2

02

46

● essayfictioninstructionpopsci

sharespeechtourismweb

●●●

●●

● ●

●●

● ●

● ●

●●

●● ● ●

●●

● ●

●●

●●

●●

●●●

●●

● ●

●●

● ●

●●

●●

●● ●●

●●

● ●

●●

●●

●●

origtrans

−4−2

01

23

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

−5 0 5 10

−2−1

01

23

4

−4 −2 0 1 2 3

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 29

How about subtle patterns?

Page 30: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Finding the right perspective

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 30

origtrans

Page 31: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Interpreting discriminant features

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

LDA for translated vs. original texts (English)

discriminant score

dens

ity

origtrans

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 31

originals   transla3ons  

Page 32: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Interpreting discriminant features

−1.5

−1.0

−0.5

0.0

0.5

nn_Tadja_T

nominal_T

finites_Spast_V

passive_V

modals_V

imperatives_S

interrogatives_S

coordination_T

subordination_T

pronouns_T

place.adv_T

time.adv_T

adv.theme_TH

text.theme_TH

obj.theme_TH

verb.theme_TH

subj.theme_THprep_T

modal.adv_T

contractions_T

colloquialism_T

titles_T

lexical.density

lexical.TTR

token_S

infinitives_F

weight

−1.5

−1.0

−0.5

0.0

0.5weight

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 32

Page 33: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Interpreting discriminant features (−

) nn_

T

adja

_T

nom

inal

_T

(−) f

inite

s_S

past

_V

pass

ive_V

mod

als_

V

impe

rativ

es_S

inte

rroga

tives

_S

(−) c

oord

inat

ion_

T

(−) s

ubor

dina

tion_

T

(−) p

rono

uns_

T

(−) p

lace

.adv

_T

time.

adv_

T

(−) a

dv.th

eme_

TH

(−) t

ext.t

hem

e_TH

obj.t

hem

e_TH

verb

.them

e_TH

(−) s

ubj.t

hem

e_TH

(−) p

rep_

T

mod

al.a

dv_T

cont

ract

ions

_T

(−) c

ollo

quia

lism

_T

(−) t

itles

_T

(−) l

exic

al.d

ensi

ty

(−) l

exic

al.T

TR

toke

n_S

(−) i

nfin

itive

s_F

0

5

0

5

orig

trans

orig

trans

orig

trans

orig

trans

orig

trans

orig

trans

orig

trans

orig

trans

orig

trans

orig

trans

orig

trans

orig

trans

orig

trans

orig

trans

raw

val

ue statusorigtrans

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 33

Page 34: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Interpreting discriminant features (−

) nn_

T

adja

_T

nom

inal

_T

(−) f

inite

s_S

past

_V

pass

ive_V

mod

als_

V

impe

rativ

es_S

inte

rroga

tives

_S

(−) c

oord

inat

ion_

T

(−) s

ubor

dina

tion_

T

(−) p

rono

uns_

T

(−) p

lace

.adv

_T

time.

adv_

T

(−) a

dv.th

eme_

TH

(−) t

ext.t

hem

e_TH

obj.t

hem

e_TH

verb

.them

e_TH

(−) s

ubj.t

hem

e_TH

(−) p

rep_

T

mod

al.a

dv_T

cont

ract

ions

_T

(−) c

ollo

quia

lism

_T

(−) t

itles

_T

(−) l

exic

al.d

ensi

ty

(−) l

exic

al.T

TR

toke

n_S

(−) i

nfin

itive

s_F

−5.0

−2.5

0.0

2.5

−5.0

−2.5

0.0

2.5

orig

trans

orig

trans

orig

trans

orig

trans

orig

trans

orig

trans

orig

trans

orig

trans

orig

trans

orig

trans

orig

trans

orig

trans

orig

trans

orig

trans

cont

ribut

ion

statusorigtrans

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 34

Page 35: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 35

● DEEN

origtrans

Interpreting geometric configurations: German vs. English

Page 36: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Discriminant for DE/EN: Evidence for shining through & prestige?

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 36

−5 0 5

0.0

0.1

0.2

0.3

0.4

0.5

discriminant score

dens

ity

DE: origDE: transEN: origEN: trans

Page 37: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Case study 2: French regional varieties

§  Lexical  differences  in  regional  varie3es  of  French  §  Two  na3on-­‐wide  newspapers  each  from  6  countries  –  Cameroon,  France,  Ivory  Coast,  Morocco,  Senegal,  Tunisia  –  two  consecu3ve  volumes  from  each  newspaper  –  total  size  approx.  14.5  million  tokens  

§  Text  samples  =  one  week  each  §  Features:  frequencies  of  shared  colliga3ons  –  lemma-­‐func3on  pairs  –  must  occur  in  all  subcorpora  with  f  ≥  100  

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 37

Diwersy,  Evert  &  Neumann  (2014)  

Page 38: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Case study 2: French regional varieties

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 38

−50 0 50

●●

●●

●●

● ●●

● ●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

● ●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

● ●●

● ●

●●

●●

●●

● ●

●●

●● ●

●●

● ●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●●

−60 −40 −20 0 20 40

−100

−50

050

●●

●●

●●

● ●●

● ●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

● ●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

−50

050 ●

MUTATRIBFRATVOIELFILM

AJDMATSOLWALFALAPRETEMPS

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ● ●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

CAMCIVFRA

MARSENTUN

−80

−40

020

40

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●

−100 −50 0 50

−60

−40

−20

020

40

−80 −40 0 20 40

PCA  including  country-­‐specific  words  as  features:  perfect  separa3on    Design  bias  results  in  a  completely  uninteres3ng  model        FA  not  applicable:  features  >>  texts  

Page 39: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Case study 2: French regional varieties

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 39

−60 −20 0 20 40 60

●●

●●

●●●

● ●

● ●

● ●

● ●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●●

● ●

● ●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●● ●

●● ●●

●● ●

●●

●●

●●● ●

●●

●● ●

●●

● ●

● ●

●●●

●●

● ●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●●

●● ● ●

● ●●

●●

●●

●●

−60 −40 −20 0 20 40

−80

−40

020

4060

● ●

●●

●● ●

● ●

● ●

●●

●●●

●●

● ●

● ●

● ●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●● ●

●●

●●

● ●●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

● ●●

●● ●●

●● ●

●●

●●

●●

−60

−20

020

4060

MUTATRIBFRATVOIELFILM

AJDMATSOLWALFALAPRETEMPS

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●●

●●

● ●

● ●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

● ●●

●●

●●

● ●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

CAMCIVFRA

MARSENTUN

−60

−20

020

4060

●●

●●

●●

● ●●●

●●

●●

● ●●

● ●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●●

●●●

●●

● ●

●●

● ●

●●

● ●

●●

● ●●

●●

●●

●●

● ●

●●

●●

● ●●

●●●

●●

●●●

● ●

●●

●●

● ●● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●● ●

●●

●●

−80 −40 0 20 40 60

−60

−40

−20

020

40

−60 −20 0 20 40 60

Using  only  shared  words  as  features,  PCA  no  longer  reveals  any  paPerns  (just  a  few  outliers)    Use  LDA  to  find  a  meaningful  per-­‐spec3ve,  based  on  newspaper  source    Country  would  presume  regional  varie3es  exist!  

Page 40: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Case study 2: French regional varieties

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 40

−10 −5 0 5 10

●●

●●

● ●

●●

●●

● ●

●● ●

●●

● ●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

● ●

●●

● ●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●●●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●●

●●●

● ●

●●

● ●

●●

● ●

●●

● ●

●●

●● ●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●●

●●●

●●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●●●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●● ●

●●●

● ●

●●

−10 −5 0 5 10

−10

−50

510

● ●

●●●

●●

●●

● ●

● ●

●● ●

●●

● ●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●●● ●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●●

●●●

● ●

●●●

−10

−50

510

MUTATRIBFRATVOIELFILM

AJDMATSOLWALFALAPRETEMPS

●●●

●●

●●

● ●

●●

●●

● ●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

● ●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●● ● ●

●●

●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●●

●●

● ●●

●●

●●

● ●●

● ●

●●

●●●

●●

● ●

●●

●●

●●

●●

●● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●●●

●●

●●

● ●●

●●

●●

●●

●●

● ●

● ●●

●●

●●

●●

● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●● ●

●●●

●●

● ● ●

● ●

●●

CAMCIVFRA

MARSENTUN

−10

−50

510

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●●

●●

● ●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

−10 −5 0 5 10

−10

−50

510

−10 −5 0 5 10

Page 41: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 41

MUTATRIBFRATVOIELFILMAJDMATSOLWALFALAPRETEMPS

CAMCIVFRAMARSENTUN

LDA dimensions (newspapers)

Page 42: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

Discriminant axes (newspapers)

−6 −4 −2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

0.5

CAM

discriminant score

dens

ity

CAMCIVFRAMARSENTUN

−6 −4 −2 0 2 4 60.

00.

10.

20.

30.

40.

5

CIV

discriminant score

dens

ity

CAMCIVFRAMARSENTUN

−5 0 5

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

FRA

discriminant score

dens

ity

CAMCIVFRAMARSENTUN

−8 −6 −4 −2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

0.5

MAR

discriminant score

dens

ity

CAMCIVFRAMARSENTUN

−6 −4 −2 0 2 4 6

0.0

0.1

0.2

0.3

0.4

0.5

SEN

discriminant score

dens

ity

CAMCIVFRAMARSENTUN

−4 −2 0 2 4 60.

00.

20.

40.

60.

8

TUN

discriminant score

dens

ity

CAMCIVFRAMARSENTUN

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 42

Page 43: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

THANK YOU!

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 43

Page 44: A multivariate approach to linguistic variation and ......perfect aspect verbs, third person pronouns and public verbs (primarily speech act verbs), while the features in the bottom

References Biber,  Douglas  (1988).  Varia&on  Across  Speech  and  Wri&ng.  Cambridge  University  

Press,  Cambridge.  Diwersy,  Sascha;  Evert,  Stefan;  Neumann,  Stella  (2014).  A  weakly  supervised  

mul3variate  approach  to  the  study  of  language  varia3on.  In  B.  Szmrecsanyi  &  B.  Wälchli  (eds.),  Aggrega&ng  Dialectology,  Typology,  and  Register  Analysis.  Linguis&c  Varia&on  in  Text  and  Speech.  De  Gruyter,  Berlin.  

Evert,  Stefan  &  Neumann,  Stella  (in  prep.).  The  impact  of  transla3on  direc3on  on  the  characteris3cs  of  translated  texts:  a  mul3variate  analysis  for  English  and  German.  

Evert,  Stefan;  Proisl,  Thomas;  Schöch,  Christof;  Jannidis,  Fo3s;  Pielström,  Steffen;  ViP,  Thorsten  (2015).  Explaining  Delta,  or:  How  do  distance  measures  for  authorship  aPribu3on  work?  Presenta&on  at  Corpus  Linguis&cs  2015,  Lancaster,  UK.  

Gasthaus,  Jan  (2007).  Prototype-­‐Based  Relevance  Learning  for  Genre  Classifica3on.  B.Sc.  thesis,  Universität  Osnabrück,  Ins3tute  of  Cogni3ve  Science.  

Koppel,  Moshe;  Argamon,  Shlomo;  Shimoni,  Anat  Rachel  (2003).  Automa3cally  categorizing  wriPen  texts  by  author  gender.  Literary  and  Linguis&c  Compu&ng,  17(4),  401–412.  

Neumann,  Stella  (2013).  Contras&ve  Register  Varia&on.  A  Quan&ta&ve  Approach  to  the  Comparison  of  English  and  German.  de  Gruyter  Mouton,  Berlin.  

Saarbrücken, 13 March 2015 www.linguistik.fau.de | www.stefan-evert.de 44