Upload
phammien
View
212
Download
0
Embed Size (px)
Citation preview
Measuring Vocabulary Richness in
Teenage Learners of French
Brian Richards
David Malvern
The University of Reading, UK
Paper presented at the British Educational Research Association Conference, Cardiff University, September 7-10 2000
Address for Correspondence: Dr. B. J. Richards, The University of Reading, School of
Education, Bulmershe Court, Reading, RG6 1HY, UK (email [email protected]).
Abstract
Vocabulary richness, or lexical diversity, is an important measure of how language learners
deploy their active vocabulary. It is traditionally measured by the Type-Token Ratio (TTR),
the ratio of different words to total words used. Unfortunately, TTR is a function of sample
size: larger samples of word tokens will give a lower TTR. This flaw has distorted many
research findings. To overcome this, we have developed an innovative measure of vocabulary
richness, D, based on mathematically modelling how new words are introduced into larger
and larger language samples and have produced software (vocd) to calculate it.
We report the application of the new measure to the study of the GCSE oral
assessment of 34 British teenagers learning French as a foreign language. We investigate how
D relates to a variety of measures of foreign language proficiency and vocabulary richness,
including the scoring of “range of vocabulary” by 24 experienced teachers of French. We
also compare the teachers’ D with that of their students and look for evidence of teachers’
accommodation to students’ proficiency levels.
Results demonstrate firstly the validity of D as a measure of vocabulary richness and
the effectiveness of vocd as a tool to analyse language data in educational research. Secondly,
with regard to discourse accommodation by teachers, it was found that the two teachers did
not finely tune their vocabulary diversity to the proficiency of their pupils. Instead, they
roughly adjusted their language to the perceived ability of the group as a whole
2
Measuring Vocabulary Richness in Teenage
Learners of French1
1. Introduction
Measurements of vocabulary diversity are used in a wide range of educational and linguistic
research including child language development, language impairment, foreign and second
language learning, the development of literacy, authorship studies, forensic linguistics,
stylistics, studies of schizophrenia and many other areas. Many commonly used measures
have been based on the ratio of different words (Types) to the total number of words
(Tokens), known as the Type-Token Ratio (TTR). TTRs are output by a number of widely
available computer programs dedicated to linguistic analysis, such as Systematic Analysis of
Language Transcripts (SALT) (Miller and Chapman, 1993), the Oxford Concordance
Program (Hockey, 1988) and the Computerized Language Analysis (CLAN) programs
(MacWhinney, 1995; MacWhinney and Snow, 1990). Unfortunately such measures,
including mathematical transformations of the TTR such as Root TTR (Guiraud, 1960) and
Corrected TTR (Carroll, 1964), are functions of the number of tokens in the transcript or
language sample - samples containing larger numbers of tokens give lower values for TTR
and vice versa (for a demonstration see Richards, 1987; Richards and Malvern, 1997a;
Tweedie and Baayen, 1998). The reason for this is very simple – as longer and longer
samples of language are produced, more and more of the active vocabulary is likely to be
included and the available pool of new word types that can be introduced steadily diminishes.
It is obvious that once a sample is large enough to have included all of the subject’s active
vocabulary, any further sampling of tokens can only result in an hyperbolic decline in the
3
values for TTR. But, it is also the case that however small the sample is, as more and more
tokens are taken, the likelihood is that (because of repetition of previously included types) the
cumulative number of types will increase at a slower rate than the number of tokens and the
TTR values inevitably fall. This problem has frequently distorted research findings (Richards
and Malvern, 1997b), but unfortunately research is still being published where TTR has been
used without any attempt to control for variation in the size of language samples (see
Tweedie and Baayen, 1998, for examples). Previous attempts to overcome the problem, for
example by standardizing the number of tokens to be analyzed from each subject, have failed
to ensure that measures are comparable across researchers who use different baselines of
tokens, and inevitably waste data in reducing analyses to the size of the smallest sample.
To overcome these problems we developed a computer program called vocd as part
of the project “A new research tool: mathematical modelling in the measurement of vocabulary
diversity” 1. The approach is based on an analysis of the probability of new vocabulary being
introduced into longer and longer samples of speech or writing. This yields a mathematical
model of how TTR varies with token size. By comparing the mathematical model with
empirical data in a transcript, it provides a new measure of vocabulary diversity that we refer
to as D. The measure has three advantages:
it is not a function of the number of words in the sample;
it uses all the data available;
it is more informative because, as opposed to a single value of TTR, it represents how the
TTR varies over a range of token size for each speaker or writer (i.e. it is based on the
TTR versus token curve calculated from data for the transcript as a whole rather than a
particular TTR value on it).
4
D has been shown to be superior to previous measures in both avoiding the inherent flaw in
raw TTR with varying sample sizes and in discriminating across a wide range of language
learners and users (Malvern and Richards, 2000; Richards and Malvern, 1998).
1.1 Origin of the Measure, D
TTRs inevitably fall with increasing size of token sample and consequently any single value
of TTR lacks reliability, as it will depend on the length in words of the language sample used.
A graph of TTR against tokens (N) for a transcript will lie in a curve beginning at the point
(1,1) and falling with a negative gradient which becomes progressively less steep (see
Malvern and Richards, 1997). All language samples will follow this trend, but transcripts
from speakers or writers with high vocabulary diversity will produce curves that lie above
those with low diversity (see Fig. 1). That TTR falls in a predictable way as the token size
increases provides the basis for our approach to finding a valid and reliable measure. Various
analyses have been carried out to model the relationship between growth in types and tokens
and a useful summary of many of these can be found in Tweedie and Baayen (1998). In
particular they investigate various transformations of TTR, including those proposed or cited
by Guiraud (1954), Herdan (1960), Maas (1972), Dugast (1979), Tuldava (1978) and Brunet
(1978); and probabilistic measures based on an assumed frequency distribution of different
types, such as the methods of Yule (1944), Good (1953), Honoré (1979), and Sichel (1975).
The analysis of Tweedie and Baayen (1998) shows that all these measures have problems
with very large sample sizes. This is perhaps not surprising. As has been pointed out above,
once the active vocabulary is exhausted the total number of types remains constant, thereafter
the type-token characteristic curve falls hyperbolically and all these models no longer apply
exactly. Very large samples, and Tweedie and Baayen used samples of the order 26,500 to
120,000 tokens, will either reach this state or, for a large proportion of the sample,
5
approximate to it as the late introduction of the occasional rare type will make small
difference compared to the token count.
In many educational and developmental applications, however, much smaller
language samples are typically used and in some cases such as early language development,
language delay, early writing development or writing in a foreign language, often only very
small samples are available. The method developed for vocd, then, builds on previous
theoretical analyses, notably by Brainerd (1982) and in particular Sichel (1986), which model
the TTR versus token curve mathematically with reasonable accuracy for relatively small
language samples of up to a few thousand tokens. Unfortunately neither Brainerd’s nor
Sichel’s model is mathematically simple and both need two parameters to describe an
individual’s lexical diversity.
We developed and investigated various probabilistic models in order to arrive at a
model whose characteristics yield a valid measure of vocabulary diversity by containing only
one parameter which increases with increasing diversity, and falls into a range suitable for
discriminating among the range of transcripts found in various language studies. The model
chosen is derived from a simplification of Sichel’s (1986) type-token characteristic curve and
is in the form of the following equation, where N is the number of tokens and D a parameter:
This equation yields a family of curves (Fig. 1) all of the same general and appropriate shape,
with different values for the parameter D distinguishing different members of this family (see
also Malvern and Richards, 1997). In the model D itself is used directly as an index of lexical
diversity with predicted values of the order 101 to 102.
In order to calculate D from a transcript, the vocd program first plots the empirical
TTR versus tokens curve for the speaker. It derives each point on the curve from an average
6
of 100 trials on subsamples of words of the token size for that point. The subsamples are
made up of words randomly chosen (without replacement) from throughout the transcript.
The program then finds the best fit between the theoretical model and the empirical data by a
curve-fitting procedure which adjusts the value of the parameter (D) in the equation until a
match is obtained between the actual curve for the transcript and the closest member of the
family of curves represented by the mathematical model. This value of the parameter for best
fit is the index of lexical diversity. High values of D reflect a high level of lexical diversity
and lower diversity produces lower values of D. It should be emphasized that, being based on
a probabilistic model which predicts expected values, the calculating processes are stochastic,
and averaging is intrinsic at all stages.
1.2 Calculation of D
The program vocd is written in ‘C’ and exists in UNIX, PC and Macintosh versions. It
operates on ASCII files of transcripts set out and coded according to the Codes for the
Human Analysis of Transcripts (CHAT) system developed by Brian MacWhinney as part of
the Child Language Data Exchange System (CHILDES). The goal of the CHILDES project is
to facilitate the study of child language development through a common transcription format,
the sharing of transcript data and the provision of computerized tools for analysis
(MacWhinney, 1995; MacWhinney and Snow, 1990). The text-handling features of vocd
have been modelled on MacWhinney’s Computerized Language Analysis (CLAN) programs,
particularly a program called freq whose function is to make frequency counts of words or
codes. Freq and vocd are compatible with regard to the items they include in word counts by
default. As with freq, however, the use of various switches and exclude files, in combination
with the careful use of CHAT transcription conventions, enables linguistic items to be
selected or rejected in a way that corresponds with researchers’ theoretical perspective on
7
what should count as a word and what should count as a different word. A description of the
text-handling options is given in Section 1.3 below.
In calculating D, vocd uses random sampling (without replacement) of tokens in
plotting the curve of TTR against increasing token size for the transcript under investigation.
Random sampling has two advantages over sequential sampling. Firstly, it matches the
assumptions underlying the probabilistic model. Secondly, unlike sequential sampling, it
avoids the problem of the curve being distorted by the clustering of the same vocabulary
items at particular points in the transcript.
In practice each empirical point on the curve is calculated from averaging the TTRs of
100 trials on subsamples consisting of the number of tokens for that point, drawn at random
from throughout the transcripts. This default number was found by experimentation and
balanced the wish to have as many trials as possible with the desire for the program to run
reasonably quickly. The run time has not been reduced at the expense of reliability, however,
as it was found that taking 100 trials for each point on the curve produced consistency in the
values output for D without unacceptable delays.
Which part of the curve is used to calculate D is crucial. Firstly, the final value of N
(the number of tokens in a subsample) cannot be larger than the transcript itself and, in order
to have subsamples to average for the final point on the curve, needs to be smaller. Moreover,
although the transcripts and texts to be analyzed will vary hugely in total token count, the
range chosen for the users of vocd must be applicable to investigations such as child language
studies, where typically transcripts may contain only one hundred to a few hundred tokens in
total. Secondly, the equation is derived by approximation from Sichel’s (1986) model and the
approximations apply with greater accuracy at lower numbers of tokens (and note that
Tweedie and Baayen (1998) demonstrated that all the models they investigated, including
Sichel’s, fail for large token sizes). We conducted extensive trials calculating D over different
8
parts of the curve to find a portion for which the approximation held good and averaging
worked well. As a result of these trials the default is for the curve to be drawn and fitted for
N=35 to N=50 tokens in steps of 1 token. We should emphasize that, as noted above, each of
these points is calculated from averaging 100 subsamples, each drawn from the whole of the
transcript, so that although only a relatively small part of the curve is fitted, it uses all the
information available in the transcript. This also has the advantage of calculating D from a
standard part of the curve for all transcripts regardless of their total size, further providing for
reliable comparisons between subjects and between the work of different researchers.
The procedure then depends on finding the best fit between the empirical and
theoretically derived curves by the least square difference method. Extensive testing
confirmed that the best-fit procedure was valid and was reliably finding a unique minimum at
the least square difference.
Finally, as the points on the curve are averages of random samples, a slightly different
value of D is to be expected each time the program is run. Tests showed that with the chosen
defaults these differences are relatively small, but consistency was improved by vocd
calculating D three times by default and giving the average value as output (see Fig. 2).
As noted above, the software plots the TTR versus token curve from 35 tokens to 50
tokens and each point on the curve is produced by random sampling without replacement.
The software therefore requires a minimum of 50 tokens to operate. We should point out,
however, that the matter of minimum sample size needs further investigation: the fact that the
software will satisfactorily output a value of D from a sample as small as 50 tokens does not
guarantee that values obtained from such small samples will be reliable.
9
1.3 Text-handling Options
The software can be used to investigate lexical diversity in single or multiple files of speech
or writing provided that they are transcribed accurately in legal CHAT format. This can be
ascertained by running them through the CHILDES check program (MacWhinney, 1995)
which will also help to track down typographical and spelling errors, and by using freq (see
above) to create a complete word list that can be scanned for further errors.
Various switches and options can be entered into a UNIX-style command line in order
to select text for analysis and to determine what is counted as a word and a different word.
Examples of these functions are listed below.
Specification of speaker.
Morphemicization. By default, vocd treats inflected words (transcribed as “go”, “go-es”,
“go-ing”) as different word types. This switch enables them to be treated as tokens of a
single word type.
Inclusion or exclusion. Single words, a list of words in a separate ASCII file, or whole
utterances whose codings are selected by the user can be excluded or included in the
analysis. Exclude files contain a list of items to be omitted (for example, hesitation
markers, laughter, etc.). Include files consist of only those words one wishes to be
included. This would be useful, for example if a lexical diversity value for just closed-
class items were required.
Split-half reliability. This switch allows separate analysis of odd and even words in the
transcript. Results can then be fed into a split-half reliability analysis.
Include capitalized words only in the analysis.
Exclude retracing. This switch will remove self-repetitions and self-corrections.
Upper case/lower case. By default, vocd would treat “may” (modal verb) and “May” (the
month) as the same word type. This switch allows them to be counted as different types.
10
Non-completed words. Words frequently occur in a shortened form such as “till” for
“until” or “cos” for “because”. Sometimes a speaker will alternate between the full and
shortened form, which might distort analyses if vocd treated them as different word types.
There are three options for dealing with, for example, a word transcribed as “(be)cause”:
it can be processed as “because” (the default), as “(be)cause”, or as “cause”.
Text replacement. In the utterance “it goed [: went] in there” a child’s “goed” is followed
by the adult form in square brackets. By default preceding text is replaced by the text in
square brackets, so the form processed by would be “went”. This allows alternative
realizations of a word (e.g. yes, yeah, yep) to be treated as the same word type. This
facility can also be overridden so that “goed” would be returned rather than “went”.
1.4 Validation
The validity of D has been the subject of extensive investigation (Malvern and
Richards, 1997; Richards and Malvern, 1997a; Richards and Malvern, 1998; Malvern and
Richards, 2000) on samples of child language, children with Specific Language Impairment
(SLI), children’s narratives, adult learners of English as a second language, and academic
writing. In these validation trials, the empirical TTR versus token curves for a total of 128
transcripts from four corpora covering ages from 24 months to adult, three languages and a
variety of settings, all fitted the model. The model produced consistent values for D which,
unlike TTR and even Mean Segmental TTR (the average TTR for segments of a given token
size), correlated well with other well-validated measures of language (see Richards and
Malvern, 1997a: pp. 35-38 for an overview of Mean Segmental TTR). These five corpora
also provide useful indications of the scale for D.
We now turn our attention to the relationship between D and a number of measures of
foreign language proficiency in a sample of 34 GCSE candidates. We will also compare the
11
teachers’ D with that of the pupils and look for evidence of teachers’ accommodation to
students’ proficiency levels.
2. Method
We obtained the tape recordings of 34 state secondary school students taking their oral
examination in French for the General Certificate of Secondary Education, a national
examination taken by students in the United Kingdom at the age of 16. Oral interviews
(officially described as “free conversation”), which averaged just over five minutes in length,
had been conducted by two non-native, but experienced, teachers of French, each of whom
examined the students in his or her own French class. The interviews were transcribed in
CHAT format (MacWhinney 1995) by a native French speaker (Francine Chambers) who
was an experienced teacher of French to English-speaking children of this age, and were re-
transcribed by the first author of this article (Brian Richards). Discrepancies were resolved
with the assistance of a near-native speaker of French who was also an experienced teacher of
French as a foreign language in the UK (Mair Richards). Where discrepancies could not be
resolved, the utterance or word was coded as “unintelligible” and omitted from the analysis.
CHAT format enabled computer-assisted analyses to be carried out by the CLAN programs
of the CHILDES project (MacWhinney & Snow 1990)2.
2.1 The Students
The students had been learning French for five years, receiving four 35-minute lessons a
week. In theory, their ability should have been above average because they had all been
entered for the “higher level” examination, designed to extend more able pupils. In practice,
it became clear that some students had been inappropriately entered for the higher
examination, and proficiency varies in the sample from those who made very little
12
contribution to the conversation to one student who performed at a level comparable to
native-speakers. This student obtained extreme values on some measures and had to be
omitted from some of the analyses reported below. Twelve pupils were interviewed by
Teacher A and 22 by Teacher B.
2.2 Student Variables
Student variables are listed in Table 1. These are of three types. Firstly there are objective
measures taken from the transcripts with the assistance of the CLAN software (MacWhinney
1995). Of these, the vocabulary diversity measure, MSTTR-30, needs some explanation. The
mean segmental type-token ratio (MSTTR) has a long history as a measure of range of
vocabulary (see Johnson 1944) and, unlike many other measures of vocabulary diversity
(Malvern & Richards 1997), is not affected by variations in sample size (i.e. the number of
words produced by each speaker). In this case we divided each transcript into 30-word
segments of student speech and calculated the average number of different words contained
in each segment.
The second type of student measure comes from the final results of the GCSE
examination. The examining group converted the scores on each of the four skills to a mark
out of seven, and it is these that are reported here.
Thirdly, six further measures were obtained from the mean ratings of the tape
recordings by 24 experienced teachers of French. Range of vocabulary, fluency and
complexity of structure were rated on eight-point scales (0-7) and content, accuracy and
pronunciation on four-point scales (0-3). Full details of the scales and the procedures used
can be found in Richards & Chambers (1996).
Finally, using the vocd software, values of D were obtained for both the teachers and
the students. Seven students produced less than 50 word tokens in their oral examination and
13
for these no D values could be calculated. Sample sizes for the analyses are therefore either
27 or 26 depending on whether the extreme case was included. The extreme case was
included only when non-parametric statistics were used.
3. Results
3.1 The Students
It had been predicted that, as a valid measure of vocabulary diversity which is independent of
the quantity of language produced, D would correlate most strongly with the other vocabulary
measures except for TTR which is invalid when sizes of language sample vary. In Table 1 it
can be seen that this is indeed the case: the correlation with MSTTR-30 stands apart even
from other significant correlations as the most powerful in the set, the other significant
relationships being with number of different words, the oral score and fluency. Clearly then,
D is sensitive to general language proficiency but at a relatively low level. As expected, there
is no correlation with TTR.
A more surprising result is the lack of any correlation with teachers’ ratings of Range
of Vocabulary. This is, however, more understandable when the full matrix of inter-
correlations is inspected. Range of Vocabulary correlates very highly indeed (always over .9)
with the other scales used by the 24 teachers of French. The highest figure was with Content
at .996. Unlike with the more objective measures, therefore, it seems that the teachers were
unable to discriminate between vocabulary deployment and other areas of proficiency. The
rating of Range of Vocabulary is likely to be heavily contaminated by halo effects.
3.2 The Relationship between Teacher D and Student Measures
The aim of this analysis was to assess whether the teachers’ deployment of vocabulary was
finely tuned to the language proficiency of the students. A positive correlation between
14
teachers’ D and student variables would be indicative of conscious or sub-conscious
accommodation strategies.
A comparison between the average D for teachers and students is revealing. Even
with the extreme case excluded, and confining the analysis to the remaining 26 teacher and
student scores for whom D could be calculated, the mean for the students (55.1) is higher
than for the teachers (46.7). At the same time the range (29.6-77.8) and standard deviation
(13.8) for the students are also higher than for the teachers (29.9-63.9 and 7.5). A Pearson
correlation between the two sets of Ds is not significant ( r = .14; df = 24; ns). Correlations
computed for each teacher separately are also non-significant.
By contrast, teachers’ Ds do enter into significant, positive correlations with 12 out of
14 measures of the students language. These are shown in Table 2. At first sight it would
appear, therefore, that the teachers are using greater lexical diversity with pupils whose
language is more proficient, thus engaging in the kind of discourse accommodation, or even
fine-tuning which is considered by some (e.g. Ross & Berwick, 1992) to be essential to a
valid oral interview.
The above interpretation would only be correct, however, if each teacher could be
shown to be behaving in the way suggested by their pooled data. Separate correlations for the
Ds of Teacher A and Teacher B with the language measures of their pupils show that this is
far from being the case. These two sets of correlations are shown in Tables 3 and 4. Teacher
A shows no evidence of such discourse accommodation other than a positive correlation with
the length of student turn. The remaining correlations hover around zero or tend towards the
negative (GCSE points). Accommodation to students’ intelligibility appears to be the reverse
of what would be expected: the teacher produces greater vocabulary diversity in response to
more unintelligible words from the students. The correlations for Teacher B overwhelmingly
15
tend towards the negative, and are all statistically significant where the ratings of the 24
teacher of French are concerned.
How is it possible for there to be a positive correlation for pooled data and yet a
strong tendency towards a negative relationship when the analysis is performed for each
teacher separately? The answer appears to lie in the difference in ability between the two
classes. This can be demonstrated by concentrating on the strongest correlation in the pooled
data, that between Teacher D and students’ GCSE points. Figure 3 shows a scatterplot of this
correlation in which the two teachers are indicated separately. From this it can be seen that
there is a strong tendency for Teacher A and her pupils to score lower both on D and in the
GCSE French examination than is the case for Teacher B. In fact the mean number of GCSE
points for Teacher A’s pupils is 13.3 compared with 20.3 for Teacher B. This difference is
statistically significant (t = -7.31; df = 31; p = .000). Similarly, the average D for Teacher A
is 35.4 compared with 49.2 for teacher B. This difference is also significant (t = -6.57; df =
31; p = .000). This separation between the two groups can be seen even more starkly in
Figure 4 which shows the mean D plus and minus two standard deviations plotted against the
pupils’ rank order on GCSE points. There is little overlap between the groups either on their
GCSE results or the Ds of their teachers.
4. Conclusion
The findings reported above provide further evidence of the validity of mathematical
modelling the relationship between TTR and token size to assess vocabulary diversity. As
predicted, D correlated with another measure of vocabulary diversity, MSTTR-30 rather than
with measures of general language proficiency. As expected, there was no correlation
between D and TTR. One question raised is why D should be preferred to MSTTR as a
measure of lexical diversity. After all, MSTTR, because it averages TTRs from fixed-size
16
segments of the language sample, also overcomes the problem of variation in sample size.
However, there are several problems with MSTTR. The first is that different researchers use
different sized segments, meaning that their results cannot be compared. Secondly, splitting
the language sample into segments and calculating TTRs from those is less reliable than the
random sampling carried out by vocd. D is based on the lexical diversity of the whole sample
rather than isolated segments and takes account of the whole TTR versus Token curve rather
than a single point on it. MSTTR fails to capture the repetition of words between segments.
The second aim of this research was to investigate whether variation in teachers’
vocabulary diversity was itself a form of accommodation to the linguistic proficiency of their
students. At first sight this seemed to be the case—there was a significant correlation between
the Ds of the teachers and a wide range of language measures for the pupils. Closer
investigation, however, showed that this overall effect was not replicated for either teacher
when their data were analysed separately. It has been brought about by a significant
difference in the ability of each class which corresponded with a significant difference in the
average D for each teacher. What appears to be happening is that, while the language of each
teacher is not finely tuned to the ability of the pupils, they are roughly pitching their language
to their perception of the ability of the class as a whole. It will be recalled that there was far
more variation in the D values for the pupils than for the teachers. There is a tendency
therefore, at least in the context of conducting a public examination, for each teacher to
provide an approximately standard level of language across all pupils which is judged to be
appropriate for the group as a whole.
17
Notes
1. The vocd program was written by Gerard McKee of the Department of Computer
Science, The University of Reading. The research project “A new research tool:
mathematical modelling in the measurement of vocabulary diversity” was funded by
grants awarded to Brian Richards and David Malvern from the Research Endowment
Trust Fund of The University of Reading and by the Economic and Social Research
Council, UK (no. R000221995). We would also like to acknowledge the help and
advice of Raymond Knott who acted as consultant to the project.
2. The 34 transcripts are available from the CHILDES data base on the World-Wide
Web. This can be found at http://childes.psy.cmu.edu/database.html/ where
“bilingual corpora”, then “Reading” should be selected.
18
References
Brainerd, B. (1982). The Type-token Relation in the Works of S. Kierkegaard. In R. W. Bailey
(ed), Computing in the Humanities. North-Holland, Amsterdam, pp. 97-109.
Brunet, E. (1978). Vocabulaire de Jean Giraudoux: Structure et Evolution. Slatkine, Geneva.
Carroll, J. B. (1964). Language and Thought. Prentice-Hall, Englewood Cliffs, New Jersey.
Dugast, D. (1979). Vocabulaire et Stylistique. I Théâtre et Dialogue. Travaux de Linguistique
Quantitative. Slatkine-Champion, Geneva.
Good, I. J. (1953). The Population Frequencies of Species and the Estimation of Population
Parameters. Biometrika 40: 237-264.
Guiraud, H. (1954). Les Caractères Statistiques du Vocabulaire. Presses Universitaires de
France, Paris.
Guiraud, P. (1960). Problèmes et Méthodes de la Statistique Linguistique. D. Reidel,
Dordrecht.
Herdan, G. (1960). Quantitative Linguistics. Butterworth, London.
Hockey, S. (1988). Micro-OCP. Oxford University Press, Oxford.
Honoré (1979). Simple Measures of Richness of Vocabulary. Association for Literacy and
Linguistics Computing Bulletin 7(2): 172-179.
Johnson, W. (1944). Studies in language behavior: I. A program of research. Psychological
Monographs 56, 1-15.
Maas, H.-D. (1972). Zusammenhang zwischen Wortschatzumfang und Länge Eines Textes.
Zeitschrift für Literaturwissenschaft und Linguistik 8: 73-79.
MacWhinney, B. (1995). The CHILDES Project: Tools for Analyzing Talk (2nd ed). Erlbaum,
Hillsdale, New Jersey.
MacWhinney, B. and Snow, C. E. (1990). The Child Language Data Exchange System: an
Update. Journal of Child Language, 17: 457-72.
19
Malvern, D. D., and Richards, B. J. (1997). A new measure of lexical diversity. In A. Ryan and
A. Wray (eds), Evolving models of language. Multilingual Matters, Clevedon, pp. 58-
71.
Malvern, D. D., and Richards, B. J. (2000). Validation of a New Measure of Lexical
Diversity. In M. Beers, B. v. d. Bogaerde, G. Bol, J. de Jong and C. Rooijmans (eds),
From Sound to Sentence: Studies on First Language Acquisition. Centre for Language
and Cognition, Groningen, pp. 81-96
Miller, J. F. and Chapman, R. (1993). SALT: Systematic Analysis of Language Transcripts,
Version 3.0. University Park Press, Baltimore.
Richards, B. J. (1987). Type/Token Ratios: what do they Really Tell us? Journal of Child
Language 14: 201-9.
Richards, B. J. & Chambers, F. (1996). Reliability and validity in the GCSE oral
examination. Language Learning Journal 14, 28-34.
Richards, B. J., and Malvern, D. D. (1997a). Quantifying Lexical Diversity in the Study of
Language Development. The University of Reading New Bulmershe Papers, Reading.
Richards, B. J., and Malvern, D. D. (1997b). Type-Token and Type-Type Measures of
Vocabulary Diversity and Lexical Style: an Annotated Bibliography. Faculty of
Education and Community Studies, The University of Reading, Reading. (Also
available on the World Wide Web at: http://www.rdg.ac.uk/~ehsrichb/)
Richards, B. J., and Malvern, D. D, (1998). A New Research Tool: Mathematical Modelling in
the Measurement of Vocabulary Diversity (Award reference no. R000221995). Final
Report to the Economic and Social Research Council, Swindon, UK.
Ross, S. & Berwick, R. (1992). The discourse of accommodation in oral proficiency
interviews. Studies in Second Language Acquisition 14, 159-76.
20
Sichel, H.S. (1975). On a Distributive Law for Word Frequencies. Journal of the American
Statistical Association 70: 542-547.
Sichel, H. S. (1986). Word frequency distributions and type-token characteristics.
Mathematical Scientist, 11: 45-72.
Tuldava, J. (1978). Quantitative Relations between the Size of the Text and the Size of
Vocabulary. SMIL Quarterly, Journal of Linguistic Calculus 4: 28-35.
Tweedie F.J. and Baayen, R.H. (1998). How Variable may a Constant be? Measures of
Lexical Richness in Perspective. Computers and the Humanities 32: 323-352.
Yule, G.U. (1944). The Statistical study of Literary Vocabulary. Cambridge University Press,
Cambridge.
21
Figure 1: Family of curves showing increasing diversity with increasing values of D.
22
Figure 2: Flow chart for producing values of D
User selects options for text handling
vocd reads text and selects and excludes
parts of text as specified by the user
vocd outputs a sequential list of utterances
containing only the word tokens to be
entered into the analysis
vocd calculates average TTRs for 100
random samples of a given token size from
35-50 tokens, producing a sequence of
points on a TTR versus token curve
3x
vocd adjusts the parameter D until the best
fit between the TTR versus token curve
and the theoretical model is found
vocd outputs the value of D which provides
the best fit, and the residuals
The average of 3 Ds is output
23
24
25
0 10 20 300
20
40
60
80
D Teach BD Teach A
Figure 3: Scatterplot of D values for each teacher
GCSE Points
D Te
ache
rs
against pupils' GCSE points
26
0 10 20 30 400
20
40
60
80
Figure 4: D for each teacher against students in order
Students in order of GCSE points
D Te
ache
rs
of their GCSE points, showing mean D and the interval of plus and minus 2 standard deviations for each teacher
Teacher A
Teacher B
Table 1: Rank order correlations between student D and other student measures (N=27)
Student Measure r
Measures Taken from Transcripts
Number of Words .18
Number of Different Words .35 *
MSTTR-30 .59 **
Mean Length of Utterance (MLU words) .23
Utterances per Turn (MLT) .09
Percentage unintelligible words .02
Words per Minute .23
Type-Token Ratio .20
GCSE Examination Results
Score for Oral Examination (out of 7) .34 *
GCSE points .31
Mean Ratings from 24 Teachers of French
Range of Vocabulary .31
Fluency .33 *
Complexity of Structure .31
Content .30
Accuracy .31
Pronunciation .23
* p<.05 ** p<.01
27
Table 2: Pearson product moment correlations between Teacher D and measures of
students’ language (N=33)
Student Measure r
Measures Taken from Transcripts
Number of Words .41 **
MSTTR-30 .07
Mean Length of Utterance (MLU words) .43 **
Utterances per Turn (MLT) .35 *
Percentage unintelligible words -.02
Words per Minute .49 **
GCSE Examination Results
Score for Oral Examination (out of 7) .43 **
GCSE points .52 **
Mean Ratings from 24 Teachers of French
Range of Vocabulary .40 *
Fluency .40 *
Complexity of Structure .35 *
Content .44 **
Accuracy .42 **
Pronunciation .33 *
* p<.05 ** p<.01
28
Table 3: Pearson product moment correlations between the Ds for Teacher A and
measures of students’ language (N=12)
Student Measure r
Measures Taken from Transcripts
Number of Words -.03
MSTTR-30 .43
Mean Length of Utterance (MLU words) -.19
Utterances per Turn (MLT) .62 *
Percentage unintelligible words .50 *
Words per Minute .07
GCSE Examination Results
Score for Oral Examination (out of 7) -.22
GCSE points -.44
Mean Ratings from 24 Teachers of French
Range of Vocabulary -.01
Fluency -.03
Complexity of Structure -.05
Content .01
Accuracy -.03
Pronunciation -.08
* p<.05 ** p<.01
29
Table 4: Pearson product moment correlations between the Ds for Teacher B and
measures of students’ language (N=21)
Student Measure r
Measures Taken from Transcripts
Number of Words -.10
MSTTR-30 .14
Mean Length of Utterance (MLU words) -.13
Utterances per Turn (MLT) -.12
Percentage unintelligible words -.03
Words per Minute -.23
GCSE Examination Results
Score for Oral Examination (out of 7) -.27
GCSE points -.10
Mean Ratings from 24 Teachers of French
Range of Vocabulary -.39 *
Fluency -.44 *
Complexity of Structure -.41 *
Content -.38 *
Accuracy -.42 *
Pronunciation -.40 *
* p<.05 ** p<.01
30