Some statistical methods on syntactic variables in L1 writing Report from an ongoing study Bård Uri...

Preview:

Citation preview

Some statistical methods on syntactic variables in L1 writing

Report from an ongoing study

Bård Uri JensenPhD student

UiB / Hedmark University College (Hamar)Solstrand 2010-03-26

Contents

• Introducing the project• The ELEV corpus vs the ASK corpus• Extracting data• Analysing data

My doctoral project

• Research question– Do people tend to make different grammatical choices

when they type on keyboard rather than write by hand?• Hypotheses

– Higher production speed affects the choices in a ”spontaneous” direction

– Skilled writers may utilise the enhanced functionality and shift features in the opposite direction

– Other psychological factors may affect the choices• motivational factors• social media norms

The ELEV corpus

• A ”parallel” corpus of hand-written and keyboarded texts– Two texts by each pupil

• The ASK corpus system• Manual syntactic segmentation

– t-units– clauses– fragments

• No error tags

<t-unit>Alle mennesker er forskjellige,

</t-unit>

<t-unit>Kvinnfolk driver på data

</t-unit><t-unit>

og gutter leser bøker</t-unit>

<t-unit>Jeg liker å få på ski. Fordi det gir meg bedre kondisjon.

</t-unit>

<t-unit>All humans are different,

</t-unit>

<t-unit>Women use computers

</t-unit><t-unit>

and boys read books</t-unit>

<t-unit>I like cross-country skiing.Because it gives me better stamina.

</t-unit>

<t-unit type="imp">drikk deg full.

</t-unit>

<t-unit type="spm">Er dette en sunn utvikling?

</t-unit>

<t-unit type="imp">get (yourself) drunk.

</t-unit>

<t-unit type="spm">Is this a healthy development?

</t-unit>

<t-unit>Politiet vet <clause type="nominal">

det er folk under 18 <clause type="relativ">

som drikker der,</clause>

</clause> </t-unit>

<t-unit>The police know <clause type="nominal">

there are people under 18

<clause type="relativ">who drink

there,</clause>

</clause> </t-unit>

<frag>Men hva med andre bøker?

</frag>

<t-unit type="frag"> men veit da om flere jenter <clause type="relativ">

som ikke gjør det også!</clause>

</t-unit>

<frag>But what about other books?

</frag>

<t-unit type="frag"> but [I] know about several girls<clause type="relativ">

who don’t do it also!</clause>

</t-unit>

<t-unit type="spm">Er dette en <corr sic="sund">

sunn</corr> utvikling?

</t-unit>

<t-unit type="spm">Is this a <corr sic=”helthy">

healthy</corr> development?

</t-unit>

Corpus searches

[features='.* subst .*'];

<t-unit>[]*</t-unit>;

<t-unit_type=”imp”>[]*</t-unit>;

<t-unit>[]{5,10}</t-unit>;

<t-unit>([lemma='\$.']*[!lemma='\$.']){5,10}[lemma='\$.']*</t-unit>;

Corpus searches : frontal subclauses

<t-unit> [features='.* konj .*']?(<clause_type="nominal"> | <clause_type="relativ"> | <clause_type="adverbial">) [];

Corpus searches : embedding

<t-unit>[!clause]+<clause>[]*</clause>[!clause]+</t-unit>;

<t-unit>[!clause]+<clause_type!="relativ">[]*</clause>[!clause]+</t-unit>;

Corpus searches :lexical distribution

[lemma!='\$.'];

[features=".* verb .*"];

Statistics : Three examples

• Some simple analyses– differences of mean – correlations

• Classification analysis

• Clustering

Mean & correlation

H T

0.1

80

.20

0.2

20

.24

0.2

6Verb frequency by mode

Ve

rbs

pe

r w

ord

++

H T

0.1

80

.20

0.2

20

.24

0.2

6Verb frequency by mode

Ve

rbs

pe

r w

ord

++

0.14 0.16 0.18 0.20 0.22 0.24 0.26

05

10

15

20

Verb frequency by mode

N = 60 Bandwidth = 0.006501

De

nsi

ty

HandKeyboard

H T

0.1

60

.18

0.2

00

.22

0.2

40

.26

0.2

80

.30

Noun frequency by mode

No

un

s p

er

wo

rd

+

+

H T

0.1

60

.18

0.2

00

.22

0.2

40

.26

0.2

80

.30

Noun frequency by mode

No

un

s p

er

wo

rd

+

+

0.15 0.20 0.25 0.30

02

46

81

01

2

Noun frequency by mode

N = 60 Bandwidth = 0.01227

De

nsi

ty

HandKeyboard

200 220 240 260 280 300 320

0.1

60

.18

0.2

00

.22

0.2

40

.26

0.2

80

.30

Nouns by pupil

pupils

no

un

fre

q

200 220 240 260 280 300 320

0.1

60

.18

0.2

00

.22

0.2

40

.26

0.2

80

.30

Nouns by pupil

pupils

no

un

fre

q

mean

mean

0 10 20 30 40 50 60

0.1

80

.20

0.2

20

.24

0.2

60

.28

0.3

0nouns sorted

pupils

no

un

s fr

eq

0 10 20 30 40 50 60

0.1

80

.20

0.2

20

.24

0.2

60

.28

0.3

0nouns sorted

pupils

no

un

s fr

eq

HandKeyboard

0 10 20 30 40 50 60

0.1

80

.20

0.2

20

.24

0.2

60

.28

0.3

0nouns sorted

pupils

no

un

s fr

eq

HandKeyboard

200 400 600 800 1000 1200 1400

0.1

60

.18

0.2

00

.22

0.2

40

.26

0.2

80

.30

Noun freq by text length

No of words

No

un

fre

q

200 400 600 800 1000 1200 1400

0.1

60

.18

0.2

00

.22

0.2

40

.26

0.2

80

.30

Noun freq by text length

No of words

No

un

fre

q

200 400 600 800 1000 1200 1400

0.1

60

.18

0.2

00

.22

0.2

40

.26

0.2

80

.30

Noun freq by text length

No of words

No

un

fre

q

200 400 600 800 1000 1200 1400

0.1

60

.18

0.2

00

.22

0.2

40

.26

0.2

80

.30

Noun freq by text length

No of words

No

un

fre

q

200 400 600 800 1000 1200 1400

0.1

60

.18

0.2

00

.22

0.2

40

.26

0.2

80

.30

Nouns by length

Text lenght in words

No

un

fre

qu

en

cy

HandKeyboard

5.5 6.0 6.5 7.0

-1.8

-1.7

-1.6

-1.5

-1.4

-1.3

-1.2

Nouns by text length

Text length in words

No

un

fre

q

HandKeyboard

5.5 6.0 6.5 7.0

-1.8

-1.7

-1.6

-1.5

-1.4

-1.3

-1.2

Nouns by text length

Text length in words

No

un

fre

q

HandKeyboard

5.5 6.0 6.5 7.0

-1.8

-1.7

-1.6

-1.5

-1.4

-1.3

-1.2

Nouns by text length

Text length in words

No

un

fre

q

HandKeyboard

5.5 6.0 6.5 7.0

-1.8

-1.7

-1.6

-1.5

-1.4

-1.3

-1.2

Nouns by text length

Text length in words

No

un

fre

q

HandKeyboard

5.5 6.0 6.5 7.0

-1.8

-1.7

-1.6

-1.5

-1.4

-1.3

-1.2

Nouns by text length

Text length in words

No

un

fre

q

HandKeyboard

5.5 6.0 6.5 7.0

-1.8

-1.7

-1.6

-1.5

-1.4

-1.3

-1.2

Nouns by text length

Text length in words

No

un

fre

q

HandKeyboard

0.15 0.20 0.25 0.30

02

46

81

01

2Noun frequency by mode

N = 60 Bandwidth = 0.01227

De

nsi

ty

HandKeyboard

Kolmogorov-Smirnov-test: D= 0.283 , p = 0.016

0.15 0.20 0.25 0.30

02

46

81

01

2Noun frequency by mode

N = 60 Bandwidth = 0.01227

De

nsi

ty

HandKeyboard

me

an

me

an

Welch two-sample t-test: p = 0.007Student's paired t-test: p = 0.007

0 10 20 30 40 50 60

0.1

80

.20

0.2

20

.24

0.2

60

.28

0.3

0nouns sorted

pupils

no

un

s fr

eq

HandKeyboard

0 10 20 30 40 50 60

0.1

80

.20

0.2

20

.24

0.2

60

.28

0.3

0nouns sorted

pupils

no

un

s fr

eq

HandKeyboard

0 10 20 30 40 50 60

0.1

80

.20

0.2

20

.24

0.2

60

.28

0.3

0nouns sorted

pupils

no

un

s fr

eq

HandKeyboard

0 10 20 30 40 50 60

0.1

80

.20

0.2

20

.24

0.2

60

.28

0.3

0nouns sorted

pupils

no

un

s fr

eq

HandKeyboard

0 10 20 30 40 50 60

0.1

80

.20

0.2

20

.24

0.2

60

.28

0.3

0nouns sorted

pupils

no

un

s fr

eq

HandKeyboard

Pearson's correlation = 0.107 , p= 0.416

Classification analysis

• Independent variables (parameters)– writing mode

• hand ~ keyboard– writing skills

• medium ~ high– gender– essay question

• Dependent variable– freq of attributive adjectives– subclause freq

Attributive adjectives per noun

TRUE = left branch

|Mode=H

gender=J gender=J

Text=A1 skills=M

0.1269n=30

0.1429n=30

0.1324n=14

0.1547n=16

0.149n=15

0.1802n=15

YES

H T

0.0

50

.10

0.1

50

.20

0.2

5Attributive adjectives per noun

+

+

H T

0.0

50

.10

0.1

50

.20

0.2

5Attributive adjectives per noun

+

+

0.05 0.10 0.15 0.20 0.25

02

46

81

0

Attributive adjectives per noun

N = 60 Bandwidth = 0.01271

De

nsi

ty

handkeyboard

me

an

me

an

Subclauses per t-unit

TRUE = left branch

|Text=A1

Mode=H skills=M

0.8547n=30

1.135n=30

1.072n=30

1.217n=30

YES

H T

0.5

1.0

1.5

Subclause freq by mode

+

+

H T

0.5

1.0

1.5

Subclause freq by mode

+

+

H.A1 T.A1 H.A2 T.A2

0.5

1.0

1.5

Subclause freq by text & mode

+

++

+

H.A1 T.A1 H.A2 T.A2

0.5

1.0

1.5

Subclause freq by text & mode

+

++

+

0.0 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Subclause per t-unit

N = 30 Bandwidth = 0.112

De

nsi

ty

T1 HandT1 KeybT2 HandT2 Keyb

H.A1 T.A1 H.A2 T.A2

0.5

1.0

1.5

Subclause freq by text & mode

+

++

+

0.0 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Subclause per t-unit

N = 30 Bandwidth = 0.112

De

nsi

ty

T1 HandT1 KeybT2 HandT2 Keyb

H.A1 T.A1 H.A2 T.A2

0.5

1.0

1.5

Subclause freq by text & mode

+

++

+

0.0 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Subclause per t-unit

N = 30 Bandwidth = 0.112

De

nsi

ty

T1 HandT1 KeybT2 HandT2 Keyb

Welch two-sample: p = 0.002

Cluster analysis

• About 50 dependent variables

-0.3 -0.2 -0.1 0.0 0.1 0.2 0.3

-0.3

-0.2

-0.1

0.0

0.1

0.2

0.3

Factor 1 (21.5 %)

Fa

cto

r 2

(1

4.1

%)

T

H

T

H

T

H

T

H

TT

HH

H

T

H

T

H

H

T

T

H

H H

H

T

T

T

HT

H

T

T

T

H

H

H

H

H

H

H

T

T

T

T

H

HTT

T

T

H

T

HT

H

T

TT

H

H

HT

H

T

H

TH

TH H

TT

T

H

TH

T

T H

H

T

T

T

T

H

H

H

TH

T

H

H

H

TT

T

TT

T

T

H

H

HH

T

T

H

H

H

H

T

H

T

H

T

HH

H

T

T

Nominaliseringer

Subkl3

RelCl

CausalAdvCl

KlAdvInnr9

CondAdvCl

KlNom0iForfelt

FrontalRel

infinitives

NOTModalPart

Doubling_SAA

-0.3 -0.2 -0.1 0.0 0.1 0.2 0.3

-0.3

-0.2

-0.1

0.0

0.1

0.2

0.3

Factor 1 (21.5 %)

Fa

cto

r 2

(1

4.1

%)

T

H

T

H

T

H

T

H

TT

HH

H

T

H

T

H

H

T

T

H

H H

H

T

T

T

HT

H

T

T

T

H

H

H

H

HH

H

T

T

T

T

H

HTT

T

T

H

T

HT

H

T

TT

H

H

HT

H

T

H

TH

TH H

TT

T

H

TH

T

T H

H

T

T

T

T

H

H

H

T

H

T

H

H

H

TT

T

TT

T

T

H

H

HH

T

T

H

H

H

H

T

H

T

H

T

HH

H

T

T

Nominaliseringer

Subkl3

RelCl

CausalAdvCl

KlAdvInnr9

CondAdvCl

KlNom0iForfelt

FrontalRel

infinitives

NOTModalPart

Doubling_SAA

-0.3 -0.2 -0.1 0.0 0.1 0.2 0.3

-0.3

-0.2

-0.1

0.0

0.1

0.2

0.3

Factor 1 (21.5 %)

Fa

cto

r 2

(1

4.1

%)

H

H

H

H

HH

H

H

H

H

H

H H

H

H

H

H

H

H

H

HH

H

H

H

H

H

H

H

H

H

H

H H

H H

H

HH

HH

H

H

HH

H

H

H

H

HH

H

H

H

H

H

H

HH

H

Nominaliseringer

Subkl3

Subkl5

RelCl

CausalAdvCl

KlAdvInnr9

CondAdvCl

KlNom0iForfelt

FrontalRel

infinitives

NOTModalPart

Doubling_SAA

-0.3 -0.2 -0.1 0.0 0.1 0.2 0.3

-0.3

-0.2

-0.1

0.0

0.1

0.2

0.3

Factor 1 (21.5 %)

Fa

cto

r 2

(1

4.1

%)

T

T

T

T

TT

T

T

T

T

T

T

T

T

T

T

T

T

T

T

T

TT

T

T

T

T T

TTT

T

T

TT

TT

T

T

T

T

T

T

T

T

T

TT

T

TT

T

T

T

T

T

T

T

T

T

Nominaliseringer

Subkl3

RelCl

CausalAdvCl

KlAdvInnr9

CondAdvCl

KlNom0iForfelt

FrontalRel

infinitives

NOTModalPart

Doubling_SAA

References

• Baayen 2008: Analyzing linguistic : A practical introduction to statistics using R

• Dodge 2010: The concise encyclopedia of statistics

• Gries 2009: Statistics for linguistics with R : a practical introduction

• Zuur et al. 2009: A beginner’s guide to R

Bård Uri JensenHedmark University College (Hamar)

http://www.hihm.no

http://privat.hihm.no/bujbard.jensen@hihm.no

http://privat.hihm.no/buj/solstrand2010.pdf

Recommended