Upload
cortez-sanburn
View
212
Download
0
Tags:
Embed Size (px)
Citation preview
Some statistical methods on syntactic variables in L1 writing
Report from an ongoing study
Bård Uri JensenPhD student
UiB / Hedmark University College (Hamar)Solstrand 2010-03-26
Contents
• Introducing the project• The ELEV corpus vs the ASK corpus• Extracting data• Analysing data
My doctoral project
• Research question– Do people tend to make different grammatical choices
when they type on keyboard rather than write by hand?• Hypotheses
– Higher production speed affects the choices in a ”spontaneous” direction
– Skilled writers may utilise the enhanced functionality and shift features in the opposite direction
– Other psychological factors may affect the choices• motivational factors• social media norms
The ELEV corpus
• A ”parallel” corpus of hand-written and keyboarded texts– Two texts by each pupil
• The ASK corpus system• Manual syntactic segmentation
– t-units– clauses– fragments
• No error tags
<t-unit>Alle mennesker er forskjellige,
</t-unit>
<t-unit>Kvinnfolk driver på data
</t-unit><t-unit>
og gutter leser bøker</t-unit>
<t-unit>Jeg liker å få på ski. Fordi det gir meg bedre kondisjon.
</t-unit>
<t-unit>All humans are different,
</t-unit>
<t-unit>Women use computers
</t-unit><t-unit>
and boys read books</t-unit>
<t-unit>I like cross-country skiing.Because it gives me better stamina.
</t-unit>
<t-unit type="imp">drikk deg full.
</t-unit>
<t-unit type="spm">Er dette en sunn utvikling?
</t-unit>
<t-unit type="imp">get (yourself) drunk.
</t-unit>
<t-unit type="spm">Is this a healthy development?
</t-unit>
<t-unit>Politiet vet <clause type="nominal">
det er folk under 18 <clause type="relativ">
som drikker der,</clause>
</clause> </t-unit>
<t-unit>The police know <clause type="nominal">
there are people under 18
<clause type="relativ">who drink
there,</clause>
</clause> </t-unit>
<frag>Men hva med andre bøker?
</frag>
<t-unit type="frag"> men veit da om flere jenter <clause type="relativ">
som ikke gjør det også!</clause>
</t-unit>
<frag>But what about other books?
</frag>
<t-unit type="frag"> but [I] know about several girls<clause type="relativ">
who don’t do it also!</clause>
</t-unit>
<t-unit type="spm">Er dette en <corr sic="sund">
sunn</corr> utvikling?
</t-unit>
<t-unit type="spm">Is this a <corr sic=”helthy">
healthy</corr> development?
</t-unit>
Corpus searches
[features='.* subst .*'];
<t-unit>[]*</t-unit>;
<t-unit_type=”imp”>[]*</t-unit>;
<t-unit>[]{5,10}</t-unit>;
<t-unit>([lemma='\$.']*[!lemma='\$.']){5,10}[lemma='\$.']*</t-unit>;
Corpus searches : frontal subclauses
<t-unit> [features='.* konj .*']?(<clause_type="nominal"> | <clause_type="relativ"> | <clause_type="adverbial">) [];
Corpus searches : embedding
<t-unit>[!clause]+<clause>[]*</clause>[!clause]+</t-unit>;
<t-unit>[!clause]+<clause_type!="relativ">[]*</clause>[!clause]+</t-unit>;
Corpus searches :lexical distribution
[lemma!='\$.'];
[features=".* verb .*"];
Statistics : Three examples
• Some simple analyses– differences of mean – correlations
• Classification analysis
• Clustering
Mean & correlation
H T
0.1
80
.20
0.2
20
.24
0.2
6Verb frequency by mode
Ve
rbs
pe
r w
ord
++
H T
0.1
80
.20
0.2
20
.24
0.2
6Verb frequency by mode
Ve
rbs
pe
r w
ord
++
0.14 0.16 0.18 0.20 0.22 0.24 0.26
05
10
15
20
Verb frequency by mode
N = 60 Bandwidth = 0.006501
De
nsi
ty
HandKeyboard
H T
0.1
60
.18
0.2
00
.22
0.2
40
.26
0.2
80
.30
Noun frequency by mode
No
un
s p
er
wo
rd
+
+
H T
0.1
60
.18
0.2
00
.22
0.2
40
.26
0.2
80
.30
Noun frequency by mode
No
un
s p
er
wo
rd
+
+
0.15 0.20 0.25 0.30
02
46
81
01
2
Noun frequency by mode
N = 60 Bandwidth = 0.01227
De
nsi
ty
HandKeyboard
200 220 240 260 280 300 320
0.1
60
.18
0.2
00
.22
0.2
40
.26
0.2
80
.30
Nouns by pupil
pupils
no
un
fre
q
200 220 240 260 280 300 320
0.1
60
.18
0.2
00
.22
0.2
40
.26
0.2
80
.30
Nouns by pupil
pupils
no
un
fre
q
mean
mean
0 10 20 30 40 50 60
0.1
80
.20
0.2
20
.24
0.2
60
.28
0.3
0nouns sorted
pupils
no
un
s fr
eq
0 10 20 30 40 50 60
0.1
80
.20
0.2
20
.24
0.2
60
.28
0.3
0nouns sorted
pupils
no
un
s fr
eq
HandKeyboard
0 10 20 30 40 50 60
0.1
80
.20
0.2
20
.24
0.2
60
.28
0.3
0nouns sorted
pupils
no
un
s fr
eq
HandKeyboard
200 400 600 800 1000 1200 1400
0.1
60
.18
0.2
00
.22
0.2
40
.26
0.2
80
.30
Noun freq by text length
No of words
No
un
fre
q
200 400 600 800 1000 1200 1400
0.1
60
.18
0.2
00
.22
0.2
40
.26
0.2
80
.30
Noun freq by text length
No of words
No
un
fre
q
200 400 600 800 1000 1200 1400
0.1
60
.18
0.2
00
.22
0.2
40
.26
0.2
80
.30
Noun freq by text length
No of words
No
un
fre
q
200 400 600 800 1000 1200 1400
0.1
60
.18
0.2
00
.22
0.2
40
.26
0.2
80
.30
Noun freq by text length
No of words
No
un
fre
q
200 400 600 800 1000 1200 1400
0.1
60
.18
0.2
00
.22
0.2
40
.26
0.2
80
.30
Nouns by length
Text lenght in words
No
un
fre
qu
en
cy
HandKeyboard
5.5 6.0 6.5 7.0
-1.8
-1.7
-1.6
-1.5
-1.4
-1.3
-1.2
Nouns by text length
Text length in words
No
un
fre
q
HandKeyboard
5.5 6.0 6.5 7.0
-1.8
-1.7
-1.6
-1.5
-1.4
-1.3
-1.2
Nouns by text length
Text length in words
No
un
fre
q
HandKeyboard
5.5 6.0 6.5 7.0
-1.8
-1.7
-1.6
-1.5
-1.4
-1.3
-1.2
Nouns by text length
Text length in words
No
un
fre
q
HandKeyboard
5.5 6.0 6.5 7.0
-1.8
-1.7
-1.6
-1.5
-1.4
-1.3
-1.2
Nouns by text length
Text length in words
No
un
fre
q
HandKeyboard
5.5 6.0 6.5 7.0
-1.8
-1.7
-1.6
-1.5
-1.4
-1.3
-1.2
Nouns by text length
Text length in words
No
un
fre
q
HandKeyboard
5.5 6.0 6.5 7.0
-1.8
-1.7
-1.6
-1.5
-1.4
-1.3
-1.2
Nouns by text length
Text length in words
No
un
fre
q
HandKeyboard
0.15 0.20 0.25 0.30
02
46
81
01
2Noun frequency by mode
N = 60 Bandwidth = 0.01227
De
nsi
ty
HandKeyboard
Kolmogorov-Smirnov-test: D= 0.283 , p = 0.016
0.15 0.20 0.25 0.30
02
46
81
01
2Noun frequency by mode
N = 60 Bandwidth = 0.01227
De
nsi
ty
HandKeyboard
me
an
me
an
Welch two-sample t-test: p = 0.007Student's paired t-test: p = 0.007
0 10 20 30 40 50 60
0.1
80
.20
0.2
20
.24
0.2
60
.28
0.3
0nouns sorted
pupils
no
un
s fr
eq
HandKeyboard
0 10 20 30 40 50 60
0.1
80
.20
0.2
20
.24
0.2
60
.28
0.3
0nouns sorted
pupils
no
un
s fr
eq
HandKeyboard
0 10 20 30 40 50 60
0.1
80
.20
0.2
20
.24
0.2
60
.28
0.3
0nouns sorted
pupils
no
un
s fr
eq
HandKeyboard
0 10 20 30 40 50 60
0.1
80
.20
0.2
20
.24
0.2
60
.28
0.3
0nouns sorted
pupils
no
un
s fr
eq
HandKeyboard
0 10 20 30 40 50 60
0.1
80
.20
0.2
20
.24
0.2
60
.28
0.3
0nouns sorted
pupils
no
un
s fr
eq
HandKeyboard
Pearson's correlation = 0.107 , p= 0.416
Classification analysis
• Independent variables (parameters)– writing mode
• hand ~ keyboard– writing skills
• medium ~ high– gender– essay question
• Dependent variable– freq of attributive adjectives– subclause freq
Attributive adjectives per noun
TRUE = left branch
|Mode=H
gender=J gender=J
Text=A1 skills=M
0.1269n=30
0.1429n=30
0.1324n=14
0.1547n=16
0.149n=15
0.1802n=15
YES
H T
0.0
50
.10
0.1
50
.20
0.2
5Attributive adjectives per noun
+
+
H T
0.0
50
.10
0.1
50
.20
0.2
5Attributive adjectives per noun
+
+
0.05 0.10 0.15 0.20 0.25
02
46
81
0
Attributive adjectives per noun
N = 60 Bandwidth = 0.01271
De
nsi
ty
handkeyboard
me
an
me
an
Subclauses per t-unit
TRUE = left branch
|Text=A1
Mode=H skills=M
0.8547n=30
1.135n=30
1.072n=30
1.217n=30
YES
H T
0.5
1.0
1.5
Subclause freq by mode
+
+
H T
0.5
1.0
1.5
Subclause freq by mode
+
+
H.A1 T.A1 H.A2 T.A2
0.5
1.0
1.5
Subclause freq by text & mode
+
++
+
H.A1 T.A1 H.A2 T.A2
0.5
1.0
1.5
Subclause freq by text & mode
+
++
+
0.0 0.5 1.0 1.5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Subclause per t-unit
N = 30 Bandwidth = 0.112
De
nsi
ty
T1 HandT1 KeybT2 HandT2 Keyb
H.A1 T.A1 H.A2 T.A2
0.5
1.0
1.5
Subclause freq by text & mode
+
++
+
0.0 0.5 1.0 1.5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Subclause per t-unit
N = 30 Bandwidth = 0.112
De
nsi
ty
T1 HandT1 KeybT2 HandT2 Keyb
H.A1 T.A1 H.A2 T.A2
0.5
1.0
1.5
Subclause freq by text & mode
+
++
+
0.0 0.5 1.0 1.5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Subclause per t-unit
N = 30 Bandwidth = 0.112
De
nsi
ty
T1 HandT1 KeybT2 HandT2 Keyb
Welch two-sample: p = 0.002
Cluster analysis
• About 50 dependent variables
-0.3 -0.2 -0.1 0.0 0.1 0.2 0.3
-0.3
-0.2
-0.1
0.0
0.1
0.2
0.3
Factor 1 (21.5 %)
Fa
cto
r 2
(1
4.1
%)
T
H
T
H
T
H
T
H
TT
HH
H
T
H
T
H
H
T
T
H
H H
H
T
T
T
HT
H
T
T
T
H
H
H
H
H
H
H
T
T
T
T
H
HTT
T
T
H
T
HT
H
T
TT
H
H
HT
H
T
H
TH
TH H
TT
T
H
TH
T
T H
H
T
T
T
T
H
H
H
TH
T
H
H
H
TT
T
TT
T
T
H
H
HH
T
T
H
H
H
H
T
H
T
H
T
HH
H
T
T
Nominaliseringer
Subkl3
RelCl
CausalAdvCl
KlAdvInnr9
CondAdvCl
KlNom0iForfelt
FrontalRel
infinitives
NOTModalPart
Doubling_SAA
-0.3 -0.2 -0.1 0.0 0.1 0.2 0.3
-0.3
-0.2
-0.1
0.0
0.1
0.2
0.3
Factor 1 (21.5 %)
Fa
cto
r 2
(1
4.1
%)
T
H
T
H
T
H
T
H
TT
HH
H
T
H
T
H
H
T
T
H
H H
H
T
T
T
HT
H
T
T
T
H
H
H
H
HH
H
T
T
T
T
H
HTT
T
T
H
T
HT
H
T
TT
H
H
HT
H
T
H
TH
TH H
TT
T
H
TH
T
T H
H
T
T
T
T
H
H
H
T
H
T
H
H
H
TT
T
TT
T
T
H
H
HH
T
T
H
H
H
H
T
H
T
H
T
HH
H
T
T
Nominaliseringer
Subkl3
RelCl
CausalAdvCl
KlAdvInnr9
CondAdvCl
KlNom0iForfelt
FrontalRel
infinitives
NOTModalPart
Doubling_SAA
-0.3 -0.2 -0.1 0.0 0.1 0.2 0.3
-0.3
-0.2
-0.1
0.0
0.1
0.2
0.3
Factor 1 (21.5 %)
Fa
cto
r 2
(1
4.1
%)
H
H
H
H
HH
H
H
H
H
H
H H
H
H
H
H
H
H
H
HH
H
H
H
H
H
H
H
H
H
H
H H
H H
H
HH
HH
H
H
HH
H
H
H
H
HH
H
H
H
H
H
H
HH
H
Nominaliseringer
Subkl3
Subkl5
RelCl
CausalAdvCl
KlAdvInnr9
CondAdvCl
KlNom0iForfelt
FrontalRel
infinitives
NOTModalPart
Doubling_SAA
-0.3 -0.2 -0.1 0.0 0.1 0.2 0.3
-0.3
-0.2
-0.1
0.0
0.1
0.2
0.3
Factor 1 (21.5 %)
Fa
cto
r 2
(1
4.1
%)
T
T
T
T
TT
T
T
T
T
T
T
T
T
T
T
T
T
T
T
T
TT
T
T
T
T T
TTT
T
T
TT
TT
T
T
T
T
T
T
T
T
T
TT
T
TT
T
T
T
T
T
T
T
T
T
Nominaliseringer
Subkl3
RelCl
CausalAdvCl
KlAdvInnr9
CondAdvCl
KlNom0iForfelt
FrontalRel
infinitives
NOTModalPart
Doubling_SAA
References
• Baayen 2008: Analyzing linguistic : A practical introduction to statistics using R
• Dodge 2010: The concise encyclopedia of statistics
• Gries 2009: Statistics for linguistics with R : a practical introduction
• Zuur et al. 2009: A beginner’s guide to R
Bård Uri JensenHedmark University College (Hamar)
http://www.hihm.no
http://privat.hihm.no/[email protected]
http://privat.hihm.no/buj/solstrand2010.pdf