58
Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College (Hamar) Solstrand 2010-03-26

Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

Embed Size (px)

Citation preview

Page 1: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

Some statistical methods on syntactic variables in L1 writing

Report from an ongoing study

Bård Uri JensenPhD student

UiB / Hedmark University College (Hamar)Solstrand 2010-03-26

Page 2: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

Contents

• Introducing the project• The ELEV corpus vs the ASK corpus• Extracting data• Analysing data

Page 3: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

My doctoral project

• Research question– Do people tend to make different grammatical choices

when they type on keyboard rather than write by hand?• Hypotheses

– Higher production speed affects the choices in a ”spontaneous” direction

– Skilled writers may utilise the enhanced functionality and shift features in the opposite direction

– Other psychological factors may affect the choices• motivational factors• social media norms

Page 4: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

The ELEV corpus

• A ”parallel” corpus of hand-written and keyboarded texts– Two texts by each pupil

• The ASK corpus system• Manual syntactic segmentation

– t-units– clauses– fragments

• No error tags

Page 5: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

<t-unit>Alle mennesker er forskjellige,

</t-unit>

<t-unit>Kvinnfolk driver på data

</t-unit><t-unit>

og gutter leser bøker</t-unit>

<t-unit>Jeg liker å få på ski. Fordi det gir meg bedre kondisjon.

</t-unit>

<t-unit>All humans are different,

</t-unit>

<t-unit>Women use computers

</t-unit><t-unit>

and boys read books</t-unit>

<t-unit>I like cross-country skiing.Because it gives me better stamina.

</t-unit>

Page 6: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

<t-unit type="imp">drikk deg full.

</t-unit>

<t-unit type="spm">Er dette en sunn utvikling?

</t-unit>

<t-unit type="imp">get (yourself) drunk.

</t-unit>

<t-unit type="spm">Is this a healthy development?

</t-unit>

Page 7: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

<t-unit>Politiet vet <clause type="nominal">

det er folk under 18 <clause type="relativ">

som drikker der,</clause>

</clause> </t-unit>

<t-unit>The police know <clause type="nominal">

there are people under 18

<clause type="relativ">who drink

there,</clause>

</clause> </t-unit>

Page 8: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

<frag>Men hva med andre bøker?

</frag>

<t-unit type="frag"> men veit da om flere jenter <clause type="relativ">

som ikke gjør det også!</clause>

</t-unit>

<frag>But what about other books?

</frag>

<t-unit type="frag"> but [I] know about several girls<clause type="relativ">

who don’t do it also!</clause>

</t-unit>

Page 9: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

<t-unit type="spm">Er dette en <corr sic="sund">

sunn</corr> utvikling?

</t-unit>

<t-unit type="spm">Is this a <corr sic=”helthy">

healthy</corr> development?

</t-unit>

Page 10: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

Corpus searches

[features='.* subst .*'];

<t-unit>[]*</t-unit>;

<t-unit_type=”imp”>[]*</t-unit>;

<t-unit>[]{5,10}</t-unit>;

<t-unit>([lemma='\$.']*[!lemma='\$.']){5,10}[lemma='\$.']*</t-unit>;

Page 11: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

Corpus searches : frontal subclauses

<t-unit> [features='.* konj .*']?(<clause_type="nominal"> | <clause_type="relativ"> | <clause_type="adverbial">) [];

Page 12: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

Corpus searches : embedding

<t-unit>[!clause]+<clause>[]*</clause>[!clause]+</t-unit>;

<t-unit>[!clause]+<clause_type!="relativ">[]*</clause>[!clause]+</t-unit>;

Page 13: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

Corpus searches :lexical distribution

[lemma!='\$.'];

[features=".* verb .*"];

Page 14: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

Statistics : Three examples

• Some simple analyses– differences of mean – correlations

• Classification analysis

• Clustering

Page 15: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

Mean & correlation

Page 16: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

H T

0.1

80

.20

0.2

20

.24

0.2

6Verb frequency by mode

Ve

rbs

pe

r w

ord

++

Page 17: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

H T

0.1

80

.20

0.2

20

.24

0.2

6Verb frequency by mode

Ve

rbs

pe

r w

ord

++

0.14 0.16 0.18 0.20 0.22 0.24 0.26

05

10

15

20

Verb frequency by mode

N = 60 Bandwidth = 0.006501

De

nsi

ty

HandKeyboard

Page 18: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

H T

0.1

60

.18

0.2

00

.22

0.2

40

.26

0.2

80

.30

Noun frequency by mode

No

un

s p

er

wo

rd

+

+

Page 19: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

H T

0.1

60

.18

0.2

00

.22

0.2

40

.26

0.2

80

.30

Noun frequency by mode

No

un

s p

er

wo

rd

+

+

0.15 0.20 0.25 0.30

02

46

81

01

2

Noun frequency by mode

N = 60 Bandwidth = 0.01227

De

nsi

ty

HandKeyboard

Page 20: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

200 220 240 260 280 300 320

0.1

60

.18

0.2

00

.22

0.2

40

.26

0.2

80

.30

Nouns by pupil

pupils

no

un

fre

q

Page 21: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

200 220 240 260 280 300 320

0.1

60

.18

0.2

00

.22

0.2

40

.26

0.2

80

.30

Nouns by pupil

pupils

no

un

fre

q

mean

mean

Page 22: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

0 10 20 30 40 50 60

0.1

80

.20

0.2

20

.24

0.2

60

.28

0.3

0nouns sorted

pupils

no

un

s fr

eq

Page 23: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

0 10 20 30 40 50 60

0.1

80

.20

0.2

20

.24

0.2

60

.28

0.3

0nouns sorted

pupils

no

un

s fr

eq

HandKeyboard

Page 24: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

0 10 20 30 40 50 60

0.1

80

.20

0.2

20

.24

0.2

60

.28

0.3

0nouns sorted

pupils

no

un

s fr

eq

HandKeyboard

Page 25: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

200 400 600 800 1000 1200 1400

0.1

60

.18

0.2

00

.22

0.2

40

.26

0.2

80

.30

Noun freq by text length

No of words

No

un

fre

q

Page 26: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

200 400 600 800 1000 1200 1400

0.1

60

.18

0.2

00

.22

0.2

40

.26

0.2

80

.30

Noun freq by text length

No of words

No

un

fre

q

Page 27: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

200 400 600 800 1000 1200 1400

0.1

60

.18

0.2

00

.22

0.2

40

.26

0.2

80

.30

Noun freq by text length

No of words

No

un

fre

q

Page 28: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

200 400 600 800 1000 1200 1400

0.1

60

.18

0.2

00

.22

0.2

40

.26

0.2

80

.30

Noun freq by text length

No of words

No

un

fre

q

Page 29: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

200 400 600 800 1000 1200 1400

0.1

60

.18

0.2

00

.22

0.2

40

.26

0.2

80

.30

Nouns by length

Text lenght in words

No

un

fre

qu

en

cy

HandKeyboard

Page 30: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

5.5 6.0 6.5 7.0

-1.8

-1.7

-1.6

-1.5

-1.4

-1.3

-1.2

Nouns by text length

Text length in words

No

un

fre

q

HandKeyboard

Page 31: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

5.5 6.0 6.5 7.0

-1.8

-1.7

-1.6

-1.5

-1.4

-1.3

-1.2

Nouns by text length

Text length in words

No

un

fre

q

HandKeyboard

Page 32: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

5.5 6.0 6.5 7.0

-1.8

-1.7

-1.6

-1.5

-1.4

-1.3

-1.2

Nouns by text length

Text length in words

No

un

fre

q

HandKeyboard

Page 33: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

5.5 6.0 6.5 7.0

-1.8

-1.7

-1.6

-1.5

-1.4

-1.3

-1.2

Nouns by text length

Text length in words

No

un

fre

q

HandKeyboard

Page 34: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

5.5 6.0 6.5 7.0

-1.8

-1.7

-1.6

-1.5

-1.4

-1.3

-1.2

Nouns by text length

Text length in words

No

un

fre

q

HandKeyboard

Page 35: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

5.5 6.0 6.5 7.0

-1.8

-1.7

-1.6

-1.5

-1.4

-1.3

-1.2

Nouns by text length

Text length in words

No

un

fre

q

HandKeyboard

Page 36: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

0.15 0.20 0.25 0.30

02

46

81

01

2Noun frequency by mode

N = 60 Bandwidth = 0.01227

De

nsi

ty

HandKeyboard

Kolmogorov-Smirnov-test: D= 0.283 , p = 0.016

Page 37: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

0.15 0.20 0.25 0.30

02

46

81

01

2Noun frequency by mode

N = 60 Bandwidth = 0.01227

De

nsi

ty

HandKeyboard

me

an

me

an

Welch two-sample t-test: p = 0.007Student's paired t-test: p = 0.007

Page 38: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

0 10 20 30 40 50 60

0.1

80

.20

0.2

20

.24

0.2

60

.28

0.3

0nouns sorted

pupils

no

un

s fr

eq

HandKeyboard

Page 39: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

0 10 20 30 40 50 60

0.1

80

.20

0.2

20

.24

0.2

60

.28

0.3

0nouns sorted

pupils

no

un

s fr

eq

HandKeyboard

Page 40: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

0 10 20 30 40 50 60

0.1

80

.20

0.2

20

.24

0.2

60

.28

0.3

0nouns sorted

pupils

no

un

s fr

eq

HandKeyboard

Page 41: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

0 10 20 30 40 50 60

0.1

80

.20

0.2

20

.24

0.2

60

.28

0.3

0nouns sorted

pupils

no

un

s fr

eq

HandKeyboard

Page 42: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

0 10 20 30 40 50 60

0.1

80

.20

0.2

20

.24

0.2

60

.28

0.3

0nouns sorted

pupils

no

un

s fr

eq

HandKeyboard

Pearson's correlation = 0.107 , p= 0.416

Page 43: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

Classification analysis

• Independent variables (parameters)– writing mode

• hand ~ keyboard– writing skills

• medium ~ high– gender– essay question

• Dependent variable– freq of attributive adjectives– subclause freq

Page 44: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

Attributive adjectives per noun

TRUE = left branch

|Mode=H

gender=J gender=J

Text=A1 skills=M

0.1269n=30

0.1429n=30

0.1324n=14

0.1547n=16

0.149n=15

0.1802n=15

YES

Page 45: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

H T

0.0

50

.10

0.1

50

.20

0.2

5Attributive adjectives per noun

+

+

Page 46: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

H T

0.0

50

.10

0.1

50

.20

0.2

5Attributive adjectives per noun

+

+

0.05 0.10 0.15 0.20 0.25

02

46

81

0

Attributive adjectives per noun

N = 60 Bandwidth = 0.01271

De

nsi

ty

handkeyboard

me

an

me

an

Page 47: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

Subclauses per t-unit

TRUE = left branch

|Text=A1

Mode=H skills=M

0.8547n=30

1.135n=30

1.072n=30

1.217n=30

YES

Page 48: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

H T

0.5

1.0

1.5

Subclause freq by mode

+

+

Page 49: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

H T

0.5

1.0

1.5

Subclause freq by mode

+

+

H.A1 T.A1 H.A2 T.A2

0.5

1.0

1.5

Subclause freq by text & mode

+

++

+

Page 50: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

H.A1 T.A1 H.A2 T.A2

0.5

1.0

1.5

Subclause freq by text & mode

+

++

+

0.0 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Subclause per t-unit

N = 30 Bandwidth = 0.112

De

nsi

ty

T1 HandT1 KeybT2 HandT2 Keyb

Page 51: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

H.A1 T.A1 H.A2 T.A2

0.5

1.0

1.5

Subclause freq by text & mode

+

++

+

0.0 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Subclause per t-unit

N = 30 Bandwidth = 0.112

De

nsi

ty

T1 HandT1 KeybT2 HandT2 Keyb

Page 52: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

H.A1 T.A1 H.A2 T.A2

0.5

1.0

1.5

Subclause freq by text & mode

+

++

+

0.0 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Subclause per t-unit

N = 30 Bandwidth = 0.112

De

nsi

ty

T1 HandT1 KeybT2 HandT2 Keyb

Welch two-sample: p = 0.002

Page 53: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

Cluster analysis

• About 50 dependent variables

Page 54: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

-0.3 -0.2 -0.1 0.0 0.1 0.2 0.3

-0.3

-0.2

-0.1

0.0

0.1

0.2

0.3

Factor 1 (21.5 %)

Fa

cto

r 2

(1

4.1

%)

T

H

T

H

T

H

T

H

TT

HH

H

T

H

T

H

H

T

T

H

H H

H

T

T

T

HT

H

T

T

T

H

H

H

H

H

H

H

T

T

T

T

H

HTT

T

T

H

T

HT

H

T

TT

H

H

HT

H

T

H

TH

TH H

TT

T

H

TH

T

T H

H

T

T

T

T

H

H

H

TH

T

H

H

H

TT

T

TT

T

T

H

H

HH

T

T

H

H

H

H

T

H

T

H

T

HH

H

T

T

Nominaliseringer

Subkl3

RelCl

CausalAdvCl

KlAdvInnr9

CondAdvCl

KlNom0iForfelt

FrontalRel

infinitives

NOTModalPart

Doubling_SAA

Page 55: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

-0.3 -0.2 -0.1 0.0 0.1 0.2 0.3

-0.3

-0.2

-0.1

0.0

0.1

0.2

0.3

Factor 1 (21.5 %)

Fa

cto

r 2

(1

4.1

%)

T

H

T

H

T

H

T

H

TT

HH

H

T

H

T

H

H

T

T

H

H H

H

T

T

T

HT

H

T

T

T

H

H

H

H

HH

H

T

T

T

T

H

HTT

T

T

H

T

HT

H

T

TT

H

H

HT

H

T

H

TH

TH H

TT

T

H

TH

T

T H

H

T

T

T

T

H

H

H

T

H

T

H

H

H

TT

T

TT

T

T

H

H

HH

T

T

H

H

H

H

T

H

T

H

T

HH

H

T

T

Nominaliseringer

Subkl3

RelCl

CausalAdvCl

KlAdvInnr9

CondAdvCl

KlNom0iForfelt

FrontalRel

infinitives

NOTModalPart

Doubling_SAA

Page 56: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

-0.3 -0.2 -0.1 0.0 0.1 0.2 0.3

-0.3

-0.2

-0.1

0.0

0.1

0.2

0.3

Factor 1 (21.5 %)

Fa

cto

r 2

(1

4.1

%)

H

H

H

H

HH

H

H

H

H

H

H H

H

H

H

H

H

H

H

HH

H

H

H

H

H

H

H

H

H

H

H H

H H

H

HH

HH

H

H

HH

H

H

H

H

HH

H

H

H

H

H

H

HH

H

Nominaliseringer

Subkl3

Subkl5

RelCl

CausalAdvCl

KlAdvInnr9

CondAdvCl

KlNom0iForfelt

FrontalRel

infinitives

NOTModalPart

Doubling_SAA

-0.3 -0.2 -0.1 0.0 0.1 0.2 0.3

-0.3

-0.2

-0.1

0.0

0.1

0.2

0.3

Factor 1 (21.5 %)

Fa

cto

r 2

(1

4.1

%)

T

T

T

T

TT

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

TT

T

T

T

T T

TTT

T

T

TT

TT

T

T

T

T

T

T

T

T

T

TT

T

TT

T

T

T

T

T

T

T

T

T

Nominaliseringer

Subkl3

RelCl

CausalAdvCl

KlAdvInnr9

CondAdvCl

KlNom0iForfelt

FrontalRel

infinitives

NOTModalPart

Doubling_SAA

Page 57: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

References

• Baayen 2008: Analyzing linguistic : A practical introduction to statistics using R

• Dodge 2010: The concise encyclopedia of statistics

• Gries 2009: Statistics for linguistics with R : a practical introduction

• Zuur et al. 2009: A beginner’s guide to R

Page 58: Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri Jensen PhD student UiB / Hedmark University College

Bård Uri JensenHedmark University College (Hamar)

http://www.hihm.no

http://privat.hihm.no/[email protected]

http://privat.hihm.no/buj/solstrand2010.pdf