VOCD: SOFTWARE FOR MEASURING VOCABULARY … · Web viewThe text-handling features of vocd have been modelled on MacWhinney’s Computerized Language Analysis (CLAN) programs, particularly

Measuring Vocabulary Richness in

Teenage Learners of French

Brian Richards

David Malvern

The University of Reading, UK

Paper presented at the British Educational Research Association Conference, Cardiff University, September 7-10 2000

Address for Correspondence: Dr. B. J. Richards, The University of Reading, School of

Education, Bulmershe Court, Reading, RG6 1HY, UK (email [email protected]).

Abstract

Vocabulary richness, or lexical diversity, is an important measure of how language learners

deploy their active vocabulary. It is traditionally measured by the Type-Token Ratio (TTR),

the ratio of different words to total words used. Unfortunately, TTR is a function of sample

size: larger samples of word tokens will give a lower TTR. This flaw has distorted many

research findings. To overcome this, we have developed an innovative measure of vocabulary

richness, D, based on mathematically modelling how new words are introduced into larger

and larger language samples and have produced software (vocd) to calculate it.

We report the application of the new measure to the study of the GCSE oral

assessment of 34 British teenagers learning French as a foreign language. We investigate how

D relates to a variety of measures of foreign language proficiency and vocabulary richness,

including the scoring of “range of vocabulary” by 24 experienced teachers of French. We

also compare the teachers’ D with that of their students and look for evidence of teachers’

accommodation to students’ proficiency levels.

Results demonstrate firstly the validity of D as a measure of vocabulary richness and

the effectiveness of vocd as a tool to analyse language data in educational research. Secondly,

with regard to discourse accommodation by teachers, it was found that the two teachers did

not finely tune their vocabulary diversity to the proficiency of their pupils. Instead, they

roughly adjusted their language to the perceived ability of the group as a whole

2

Measuring Vocabulary Richness in Teenage

Learners of French1

1. Introduction

Measurements of vocabulary diversity are used in a wide range of educational and linguistic

research including child language development, language impairment, foreign and second

language learning, the development of literacy, authorship studies, forensic linguistics,

stylistics, studies of schizophrenia and many other areas. Many commonly used measures

have been based on the ratio of different words (Types) to the total number of words

(Tokens), known as the Type-Token Ratio (TTR). TTRs are output by a number of widely

available computer programs dedicated to linguistic analysis, such as Systematic Analysis of

Language Transcripts (SALT) (Miller and Chapman, 1993), the Oxford Concordance

Program (Hockey, 1988) and the Computerized Language Analysis (CLAN) programs

(MacWhinney, 1995; MacWhinney and Snow, 1990). Unfortunately such measures,

including mathematical transformations of the TTR such as Root TTR (Guiraud, 1960) and

Corrected TTR (Carroll, 1964), are functions of the number of tokens in the transcript or

language sample - samples containing larger numbers of tokens give lower values for TTR

and vice versa (for a demonstration see Richards, 1987; Richards and Malvern, 1997a;

Tweedie and Baayen, 1998). The reason for this is very simple – as longer and longer

samples of language are produced, more and more of the active vocabulary is likely to be

included and the available pool of new word types that can be introduced steadily diminishes.

It is obvious that once a sample is large enough to have included all of the subject’s active

vocabulary, any further sampling of tokens can only result in an hyperbolic decline in the

3

values for TTR. But, it is also the case that however small the sample is, as more and more

tokens are taken, the likelihood is that (because of repetition of previously included types) the

cumulative number of types will increase at a slower rate than the number of tokens and the

TTR values inevitably fall. This problem has frequently distorted research findings (Richards

and Malvern, 1997b), but unfortunately research is still being published where TTR has been

used without any attempt to control for variation in the size of language samples (see

Tweedie and Baayen, 1998, for examples). Previous attempts to overcome the problem, for

example by standardizing the number of tokens to be analyzed from each subject, have failed

to ensure that measures are comparable across researchers who use different baselines of

tokens, and inevitably waste data in reducing analyses to the size of the smallest sample.

To overcome these problems we developed a computer program called vocd as part

of the project “A new research tool: mathematical modelling in the measurement of vocabulary

diversity” 1. The approach is based on an analysis of the probability of new vocabulary being

introduced into longer and longer samples of speech or writing. This yields a mathematical

model of how TTR varies with token size. By comparing the mathematical model with

empirical data in a transcript, it provides a new measure of vocabulary diversity that we refer

to as D. The measure has three advantages:

it is not a function of the number of words in the sample;

it uses all the data available;

it is more informative because, as opposed to a single value of TTR, it represents how the

TTR varies over a range of token size for each speaker or writer (i.e. it is based on the

TTR versus token curve calculated from data for the transcript as a whole rather than a

particular TTR value on it).

4

D has been shown to be superior to previous measures in both avoiding the inherent flaw in

raw TTR with varying sample sizes and in discriminating across a wide range of language

learners and users (Malvern and Richards, 2000; Richards and Malvern, 1998).

1.1 Origin of the Measure, D

TTRs inevitably fall with increasing size of token sample and consequently any single value

of TTR lacks reliability, as it will depend on the length in words of the language sample used.

A graph of TTR against tokens (N) for a transcript will lie in a curve beginning at the point

(1,1) and falling with a negative gradient which becomes progressively less steep (see

Malvern and Richards, 1997). All language samples will follow this trend, but transcripts

from speakers or writers with high vocabulary diversity will produce curves that lie above

those with low diversity (see Fig. 1). That TTR falls in a predictable way as the token size

increases provides the basis for our approach to finding a valid and reliable measure. Various

analyses have been carried out to model the relationship between growth in types and tokens

and a useful summary of many of these can be found in Tweedie and Baayen (1998). In

particular they investigate various transformations of TTR, including those proposed or cited

by Guiraud (1954), Herdan (1960), Maas (1972), Dugast (1979), Tuldava (1978) and Brunet

(1978); and probabilistic measures based on an assumed frequency distribution of different

types, such as the methods of Yule (1944), Good (1953), Honoré (1979), and Sichel (1975).

The analysis of Tweedie and Baayen (1998) shows that all these measures have problems

with very large sample sizes. This is perhaps not surprising. As has been pointed out above,

once the active vocabulary is exhausted the total number of types remains constant, thereafter

the type-token characteristic curve falls hyperbolically and all these models no longer apply

exactly. Very large samples, and Tweedie and Baayen used samples of the order 26,500 to

120,000 tokens, will either reach this state or, for a large proportion of the sample,

5

approximate to it as the late introduction of the occasional rare type will make small

difference compared to the token count.

In many educational and developmental applications, however, much smaller

language samples are typically used and in some cases such as early language development,

language delay, early writing development or writing in a foreign language, often only very

small samples are available. The method developed for vocd, then, builds on previous

theoretical analyses, notably by Brainerd (1982) and in particular Sichel (1986), which model

the TTR versus token curve mathematically with reasonable accuracy for relatively small

language samples of up to a few thousand tokens. Unfortunately neither Brainerd’s nor

Sichel’s model is mathematically simple and both need two parameters to describe an

individual’s lexical diversity.

We developed and investigated various probabilistic models in order to arrive at a

model whose characteristics yield a valid measure of vocabulary diversity by containing only

one parameter which increases with increasing diversity, and falls into a range suitable for

discriminating among the range of transcripts found in various language studies. The model

chosen is derived from a simplification of Sichel’s (1986) type-token characteristic curve and

is in the form of the following equation, where N is the number of tokens and D a parameter:

This equation yields a family of curves (Fig. 1) all of the same general and appropriate shape,

with different values for the parameter D distinguishing different members of this family (see

also Malvern and Richards, 1997). In the model D itself is used directly as an index of lexical

diversity with predicted values of the order 101 to 102.

In order to calculate D from a transcript, the vocd program first plots the empirical

TTR versus tokens curve for the speaker. It derives each point on the curve from an average

6

of 100 trials on subsamples of words of the token size for that point. The subsamples are

made up of words randomly chosen (without replacement) from throughout the transcript.

The program then finds the best fit between the theoretical model and the empirical data by a

curve-fitting procedure which adjusts the value of the parameter (D) in the equation until a

match is obtained between the actual curve for the transcript and the closest member of the

family of curves represented by the mathematical model. This value of the parameter for best

fit is the index of lexical diversity. High values of D reflect a high level of lexical diversity

and lower diversity produces lower values of D. It should be emphasized that, being based on

a probabilistic model which predicts expected values, the calculating processes are stochastic,

and averaging is intrinsic at all stages.

1.2 Calculation of D

The program vocd is written in ‘C’ and exists in UNIX, PC and Macintosh versions. It

operates on ASCII files of transcripts set out and coded according to the Codes for the

Human Analysis of Transcripts (CHAT) system developed by Brian MacWhinney as part of

the Child Language Data Exchange System (CHILDES). The goal of the CHILDES project is

to facilitate the study of child language development through a common transcription format,

the sharing of transcript data and the provision of computerized tools for analysis

(MacWhinney, 1995; MacWhinney and Snow, 1990). The text-handling features of vocd

have been modelled on MacWhinney’s Computerized Language Analysis (CLAN) programs,

particularly a program called freq whose function is to make frequency counts of words or

codes. Freq and vocd are compatible with regard to the items they include in word counts by

default. As with freq, however, the use of various switches and exclude files, in combination

with the careful use of CHAT transcription conventions, enables linguistic items to be

selected or rejected in a way that corresponds with researchers’ theoretical perspective on

7

what should count as a word and what should count as a different word. A description of the

text-handling options is given in Section 1.3 below.

In calculating D, vocd uses random sampling (without replacement) of tokens in

plotting the curve of TTR against increasing token size for the transcript under investigation.

Random sampling has two advantages over sequential sampling. Firstly, it matches the

assumptions underlying the probabilistic model. Secondly, unlike sequential sampling, it

avoids the problem of the curve being distorted by the clustering of the same vocabulary

items at particular points in the transcript.

In practice each empirical point on the curve is calculated from averaging the TTRs of

100 trials on subsamples consisting of the number of tokens for that point, drawn at random

from throughout the transcripts. This default number was found by experimentation and

balanced the wish to have as many trials as possible with the desire for the program to run

reasonably quickly. The run time has not been reduced at the expense of reliability, however,

as it was found that taking 100 trials for each point on the curve produced consistency in the

values output for D without unacceptable delays.

Which part of the curve is used to calculate D is crucial. Firstly, the final value of N

(the number of tokens in a subsample) cannot be larger than the transcript itself and, in order

to have subsamples to average for the final point on the curve, needs to be smaller. Moreover,

although the transcripts and texts to be analyzed will vary hugely in total token count, the

range chosen for the users of vocd must be applicable to investigations such as child language

studies, where typically transcripts may contain only one hundred to a few hundred tokens in

total. Secondly, the equation is derived by approximation from Sichel’s (1986) model and the

approximations apply with greater accuracy at lower numbers of tokens (and note that

Tweedie and Baayen (1998) demonstrated that all the models they investigated, including

Sichel’s, fail for large token sizes). We conducted extensive trials calculating D over different

8

parts of the curve to find a portion for which the approximation held good and averaging

worked well. As a result of these trials the default is for the curve to be drawn and fitted for

N=35 to N=50 tokens in steps of 1 token. We should emphasize that, as noted above, each of

these points is calculated from averaging 100 subsamples, each drawn from the whole of the

transcript, so that although only a relatively small part of the curve is fitted, it uses all the

information available in the transcript. This also has the advantage of calculating D from a

standard part of the curve for all transcripts regardless of their total size, further providing for

reliable comparisons between subjects and between the work of different researchers.

The procedure then depends on finding the best fit between the empirical and

theoretically derived curves by the least square difference method. Extensive testing

confirmed that the best-fit procedure was valid and was reliably finding a unique minimum at

the least square difference.

Finally, as the points on the curve are averages of random samples, a slightly different

value of D is to be expected each time the program is run. Tests showed that with the chosen

defaults these differences are relatively small, but consistency was improved by vocd

calculating D three times by default and giving the average value as output (see Fig. 2).

As noted above, the software plots the TTR versus token curve from 35 tokens to 50

tokens and each point on the curve is produced by random sampling without replacement.

The software therefore requires a minimum of 50 tokens to operate. We should point out,

however, that the matter of minimum sample size needs further investigation: the fact that the

software will satisfactorily output a value of D from a sample as small as 50 tokens does not

guarantee that values obtained from such small samples will be reliable.

9

1.3 Text-handling Options

The software can be used to investigate lexical diversity in single or multiple files of speech

or writing provided that they are transcribed accurately in legal CHAT format. This can be

ascertained by running them through the CHILDES check program (MacWhinney, 1995)

which will also help to track down typographical and spelling errors, and by using freq (see

above) to create a complete word list that can be scanned for further errors.

Various switches and options can be entered into a UNIX-style command line in order

to select text for analysis and to determine what is counted as a word and a different word.

Examples of these functions are listed below.

Specification of speaker.

Morphemicization. By default, vocd treats inflected words (transcribed as “go”, “go-es”,

“go-ing”) as different word types. This switch enables them to be treated as tokens of a

single word type.

Inclusion or exclusion. Single words, a list of words in a separate ASCII file, or whole

utterances whose codings are selected by the user can be excluded or included in the

analysis. Exclude files contain a list of items to be omitted (for example, hesitation

markers, laughter, etc.). Include files consist of only those words one wishes to be

included. This would be useful, for example if a lexical diversity value for just closed-

class items were required.

Split-half reliability. This switch allows separate analysis of odd and even words in the

transcript. Results can then be fed into a split-half reliability analysis.

Include capitalized words only in the analysis.

Exclude retracing. This switch will remove self-repetitions and self-corrections.

Upper case/lower case. By default, vocd would treat “may” (modal verb) and “May” (the

month) as the same word type. This switch allows them to be counted as different types.

10

Non-completed words. Words frequently occur in a shortened form such as “till” for

“until” or “cos” for “because”. Sometimes a speaker will alternate between the full and

shortened form, which might distort analyses if vocd treated them as different word types.

There are three options for dealing with, for example, a word transcribed as “(be)cause”:

it can be processed as “because” (the default), as “(be)cause”, or as “cause”.

Text replacement. In the utterance “it goed [: went] in there” a child’s “goed” is followed

by the adult form in square brackets. By default preceding text is replaced by the text in

square brackets, so the form processed by would be “went”. This allows alternative

realizations of a word (e.g. yes, yeah, yep) to be treated as the same word type. This

facility can also be overridden so that “goed” would be returned rather than “went”.

1.4 Validation

The validity of D has been the subject of extensive investigation (Malvern and

Richards, 1997; Richards and Malvern, 1997a; Richards and Malvern, 1998; Malvern and

Richards, 2000) on samples of child language, children with Specific Language Impairment

(SLI), children’s narratives, adult learners of English as a second language, and academic

writing. In these validation trials, the empirical TTR versus token curves for a total of 128

transcripts from four corpora covering ages from 24 months to adult, three languages and a

variety of settings, all fitted the model. The model produced consistent values for D which,

unlike TTR and even Mean Segmental TTR (the average TTR for segments of a given token

size), correlated well with other well-validated measures of language (see Richards and

Malvern, 1997a: pp. 35-38 for an overview of Mean Segmental TTR). These five corpora

also provide useful indications of the scale for D.

We now turn our attention to the relationship between D and a number of measures of

foreign language proficiency in a sample of 34 GCSE candidates. We will also compare the

11

teachers’ D with that of the pupils and look for evidence of teachers’ accommodation to

students’ proficiency levels.

2. Method

We obtained the tape recordings of 34 state secondary school students taking their oral

examination in French for the General Certificate of Secondary Education, a national

examination taken by students in the United Kingdom at the age of 16. Oral interviews

(officially described as “free conversation”), which averaged just over five minutes in length,

had been conducted by two non-native, but experienced, teachers of French, each of whom

examined the students in his or her own French class. The interviews were transcribed in

CHAT format (MacWhinney 1995) by a native French speaker (Francine Chambers) who

was an experienced teacher of French to English-speaking children of this age, and were re-

transcribed by the first author of this article (Brian Richards). Discrepancies were resolved

with the assistance of a near-native speaker of French who was also an experienced teacher of

French as a foreign language in the UK (Mair Richards). Where discrepancies could not be

resolved, the utterance or word was coded as “unintelligible” and omitted from the analysis.

CHAT format enabled computer-assisted analyses to be carried out by the CLAN programs

of the CHILDES project (MacWhinney & Snow 1990)2.

2.1 The Students

The students had been learning French for five years, receiving four 35-minute lessons a

week. In theory, their ability should have been above average because they had all been

entered for the “higher level” examination, designed to extend more able pupils. In practice,

it became clear that some students had been inappropriately entered for the higher

examination, and proficiency varies in the sample from those who made very little

12

contribution to the conversation to one student who performed at a level comparable to

native-speakers. This student obtained extreme values on some measures and had to be

omitted from some of the analyses reported below. Twelve pupils were interviewed by

Teacher A and 22 by Teacher B.

2.2 Student Variables

Student variables are listed in Table 1. These are of three types. Firstly there are objective

measures taken from the transcripts with the assistance of the CLAN software (MacWhinney

1995). Of these, the vocabulary diversity measure, MSTTR-30, needs some explanation. The

mean segmental type-token ratio (MSTTR) has a long history as a measure of range of

vocabulary (see Johnson 1944) and, unlike many other measures of vocabulary diversity

(Malvern & Richards 1997), is not affected by variations in sample size (i.e. the number of

words produced by each speaker). In this case we divided each transcript into 30-word

segments of student speech and calculated the average number of different words contained

in each segment.

The second type of student measure comes from the final results of the GCSE

examination. The examining group converted the scores on each of the four skills to a mark

out of seven, and it is these that are reported here.

Thirdly, six further measures were obtained from the mean ratings of the tape

recordings by 24 experienced teachers of French. Range of vocabulary, fluency and

complexity of structure were rated on eight-point scales (0-7) and content, accuracy and

pronunciation on four-point scales (0-3). Full details of the scales and the procedures used

can be found in Richards & Chambers (1996).

Finally, using the vocd software, values of D were obtained for both the teachers and

the students. Seven students produced less than 50 word tokens in their oral examination and

13

for these no D values could be calculated. Sample sizes for the analyses are therefore either

27 or 26 depending on whether the extreme case was included. The extreme case was

included only when non-parametric statistics were used.

3. Results

3.1 The Students

It had been predicted that, as a valid measure of vocabulary diversity which is independent of

the quantity of language produced, D would correlate most strongly with the other vocabulary

measures except for TTR which is invalid when sizes of language sample vary. In Table 1 it

can be seen that this is indeed the case: the correlation with MSTTR-30 stands apart even

from other significant correlations as the most powerful in the set, the other significant

relationships being with number of different words, the oral score and fluency. Clearly then,

D is sensitive to general language proficiency but at a relatively low level. As expected, there

is no correlation with TTR.

A more surprising result is the lack of any correlation with teachers’ ratings of Range

of Vocabulary. This is, however, more understandable when the full matrix of inter-

correlations is inspected. Range of Vocabulary correlates very highly indeed (always over .9)

with the other scales used by the 24 teachers of French. The highest figure was with Content

at .996. Unlike with the more objective measures, therefore, it seems that the teachers were

unable to discriminate between vocabulary deployment and other areas of proficiency. The

rating of Range of Vocabulary is likely to be heavily contaminated by halo effects.

3.2 The Relationship between Teacher D and Student Measures

The aim of this analysis was to assess whether the teachers’ deployment of vocabulary was

finely tuned to the language proficiency of the students. A positive correlation between

14

teachers’ D and student variables would be indicative of conscious or sub-conscious

accommodation strategies.

A comparison between the average D for teachers and students is revealing. Even

with the extreme case excluded, and confining the analysis to the remaining 26 teacher and

student scores for whom D could be calculated, the mean for the students (55.1) is higher

than for the teachers (46.7). At the same time the range (29.6-77.8) and standard deviation

(13.8) for the students are also higher than for the teachers (29.9-63.9 and 7.5). A Pearson

correlation between the two sets of Ds is not significant ( r = .14; df = 24; ns). Correlations

computed for each teacher separately are also non-significant.

By contrast, teachers’ Ds do enter into significant, positive correlations with 12 out of

14 measures of the students language. These are shown in Table 2. At first sight it would

appear, therefore, that the teachers are using greater lexical diversity with pupils whose

language is more proficient, thus engaging in the kind of discourse accommodation, or even

fine-tuning which is considered by some (e.g. Ross & Berwick, 1992) to be essential to a

valid oral interview.

The above interpretation would only be correct, however, if each teacher could be

shown to be behaving in the way suggested by their pooled data. Separate correlations for the

Ds of Teacher A and Teacher B with the language measures of their pupils show that this is

far from being the case. These two sets of correlations are shown in Tables 3 and 4. Teacher

A shows no evidence of such discourse accommodation other than a positive correlation with

the length of student turn. The remaining correlations hover around zero or tend towards the

negative (GCSE points). Accommodation to students’ intelligibility appears to be the reverse

of what would be expected: the teacher produces greater vocabulary diversity in response to

more unintelligible words from the students. The correlations for Teacher B overwhelmingly

15

tend towards the negative, and are all statistically significant where the ratings of the 24

teacher of French are concerned.

How is it possible for there to be a positive correlation for pooled data and yet a

strong tendency towards a negative relationship when the analysis is performed for each

teacher separately? The answer appears to lie in the difference in ability between the two

classes. This can be demonstrated by concentrating on the strongest correlation in the pooled

data, that between Teacher D and students’ GCSE points. Figure 3 shows a scatterplot of this

correlation in which the two teachers are indicated separately. From this it can be seen that

there is a strong tendency for Teacher A and her pupils to score lower both on D and in the

GCSE French examination than is the case for Teacher B. In fact the mean number of GCSE

points for Teacher A’s pupils is 13.3 compared with 20.3 for Teacher B. This difference is

statistically significant (t = -7.31; df = 31; p = .000). Similarly, the average D for Teacher A

is 35.4 compared with 49.2 for teacher B. This difference is also significant (t = -6.57; df =

31; p = .000). This separation between the two groups can be seen even more starkly in

Figure 4 which shows the mean D plus and minus two standard deviations plotted against the

pupils’ rank order on GCSE points. There is little overlap between the groups either on their

GCSE results or the Ds of their teachers.

4. Conclusion

The findings reported above provide further evidence of the validity of mathematical

modelling the relationship between TTR and token size to assess vocabulary diversity. As

predicted, D correlated with another measure of vocabulary diversity, MSTTR-30 rather than

with measures of general language proficiency. As expected, there was no correlation

between D and TTR. One question raised is why D should be preferred to MSTTR as a

measure of lexical diversity. After all, MSTTR, because it averages TTRs from fixed-size

16

segments of the language sample, also overcomes the problem of variation in sample size.

However, there are several problems with MSTTR. The first is that different researchers use

different sized segments, meaning that their results cannot be compared. Secondly, splitting

the language sample into segments and calculating TTRs from those is less reliable than the

random sampling carried out by vocd. D is based on the lexical diversity of the whole sample

rather than isolated segments and takes account of the whole TTR versus Token curve rather

than a single point on it. MSTTR fails to capture the repetition of words between segments.

The second aim of this research was to investigate whether variation in teachers’

vocabulary diversity was itself a form of accommodation to the linguistic proficiency of their

students. At first sight this seemed to be the case—there was a significant correlation between

the Ds of the teachers and a wide range of language measures for the pupils. Closer

investigation, however, showed that this overall effect was not replicated for either teacher

when their data were analysed separately. It has been brought about by a significant

difference in the ability of each class which corresponded with a significant difference in the

average D for each teacher. What appears to be happening is that, while the language of each

teacher is not finely tuned to the ability of the pupils, they are roughly pitching their language

to their perception of the ability of the class as a whole. It will be recalled that there was far

more variation in the D values for the pupils than for the teachers. There is a tendency

therefore, at least in the context of conducting a public examination, for each teacher to

provide an approximately standard level of language across all pupils which is judged to be

appropriate for the group as a whole.

17

Notes

1. The vocd program was written by Gerard McKee of the Department of Computer

Science, The University of Reading. The research project “A new research tool:

mathematical modelling in the measurement of vocabulary diversity” was funded by

grants awarded to Brian Richards and David Malvern from the Research Endowment

Trust Fund of The University of Reading and by the Economic and Social Research

Council, UK (no. R000221995). We would also like to acknowledge the help and

advice of Raymond Knott who acted as consultant to the project.

2. The 34 transcripts are available from the CHILDES data base on the World-Wide

Web. This can be found at http://childes.psy.cmu.edu/database.html/ where

“bilingual corpora”, then “Reading” should be selected.

18

References

Brainerd, B. (1982). The Type-token Relation in the Works of S. Kierkegaard. In R. W. Bailey

(ed), Computing in the Humanities. North-Holland, Amsterdam, pp. 97-109.

Brunet, E. (1978). Vocabulaire de Jean Giraudoux: Structure et Evolution. Slatkine, Geneva.

Carroll, J. B. (1964). Language and Thought. Prentice-Hall, Englewood Cliffs, New Jersey.

Dugast, D. (1979). Vocabulaire et Stylistique. I Théâtre et Dialogue. Travaux de Linguistique

Quantitative. Slatkine-Champion, Geneva.

Good, I. J. (1953). The Population Frequencies of Species and the Estimation of Population

Parameters. Biometrika 40: 237-264.

Guiraud, H. (1954). Les Caractères Statistiques du Vocabulaire. Presses Universitaires de

France, Paris.

Guiraud, P. (1960). Problèmes et Méthodes de la Statistique Linguistique. D. Reidel,

Dordrecht.

Herdan, G. (1960). Quantitative Linguistics. Butterworth, London.

Hockey, S. (1988). Micro-OCP. Oxford University Press, Oxford.

Honoré (1979). Simple Measures of Richness of Vocabulary. Association for Literacy and

Linguistics Computing Bulletin 7(2): 172-179.

Johnson, W. (1944). Studies in language behavior: I. A program of research. Psychological

Monographs 56, 1-15.

Maas, H.-D. (1972). Zusammenhang zwischen Wortschatzumfang und Länge Eines Textes.

Zeitschrift für Literaturwissenschaft und Linguistik 8: 73-79.

MacWhinney, B. (1995). The CHILDES Project: Tools for Analyzing Talk (2nd ed). Erlbaum,

Hillsdale, New Jersey.

MacWhinney, B. and Snow, C. E. (1990). The Child Language Data Exchange System: an

Update. Journal of Child Language, 17: 457-72.

19

Malvern, D. D., and Richards, B. J. (1997). A new measure of lexical diversity. In A. Ryan and

A. Wray (eds), Evolving models of language. Multilingual Matters, Clevedon, pp. 58-

71.

Malvern, D. D., and Richards, B. J. (2000). Validation of a New Measure of Lexical

Diversity. In M. Beers, B. v. d. Bogaerde, G. Bol, J. de Jong and C. Rooijmans (eds),

From Sound to Sentence: Studies on First Language Acquisition. Centre for Language

and Cognition, Groningen, pp. 81-96

Miller, J. F. and Chapman, R. (1993). SALT: Systematic Analysis of Language Transcripts,

Version 3.0. University Park Press, Baltimore.

Richards, B. J. (1987). Type/Token Ratios: what do they Really Tell us? Journal of Child

Language 14: 201-9.

Richards, B. J. & Chambers, F. (1996). Reliability and validity in the GCSE oral

examination. Language Learning Journal 14, 28-34.

Richards, B. J., and Malvern, D. D. (1997a). Quantifying Lexical Diversity in the Study of

Language Development. The University of Reading New Bulmershe Papers, Reading.

Richards, B. J., and Malvern, D. D. (1997b). Type-Token and Type-Type Measures of

Vocabulary Diversity and Lexical Style: an Annotated Bibliography. Faculty of

Education and Community Studies, The University of Reading, Reading. (Also

available on the World Wide Web at: http://www.rdg.ac.uk/~ehsrichb/)

Richards, B. J., and Malvern, D. D, (1998). A New Research Tool: Mathematical Modelling in

the Measurement of Vocabulary Diversity (Award reference no. R000221995). Final

Report to the Economic and Social Research Council, Swindon, UK.

Ross, S. & Berwick, R. (1992). The discourse of accommodation in oral proficiency

interviews. Studies in Second Language Acquisition 14, 159-76.

20

Sichel, H.S. (1975). On a Distributive Law for Word Frequencies. Journal of the American

Statistical Association 70: 542-547.

Sichel, H. S. (1986). Word frequency distributions and type-token characteristics.

Mathematical Scientist, 11: 45-72.

Tuldava, J. (1978). Quantitative Relations between the Size of the Text and the Size of

Vocabulary. SMIL Quarterly, Journal of Linguistic Calculus 4: 28-35.

Tweedie F.J. and Baayen, R.H. (1998). How Variable may a Constant be? Measures of

Lexical Richness in Perspective. Computers and the Humanities 32: 323-352.

Yule, G.U. (1944). The Statistical study of Literary Vocabulary. Cambridge University Press,

Cambridge.

21

Figure 1: Family of curves showing increasing diversity with increasing values of D.

22

Figure 2: Flow chart for producing values of D

User selects options for text handling

vocd reads text and selects and excludes

parts of text as specified by the user

vocd outputs a sequential list of utterances

containing only the word tokens to be

entered into the analysis

vocd calculates average TTRs for 100

random samples of a given token size from

35-50 tokens, producing a sequence of

points on a TTR versus token curve

3x

vocd adjusts the parameter D until the best

fit between the TTR versus token curve

and the theoretical model is found

vocd outputs the value of D which provides

the best fit, and the residuals

The average of 3 Ds is output

23

24

25

0 10 20 300

20

40

60

80

D Teach BD Teach A

Figure 3: Scatterplot of D values for each teacher

GCSE Points

D Te

ache

rs

against pupils' GCSE points

26

0 10 20 30 400

20

40

60

80

Figure 4: D for each teacher against students in order

Students in order of GCSE points

D Te

ache

rs

of their GCSE points, showing mean D and the interval of plus and minus 2 standard deviations for each teacher

Teacher A

Teacher B

Table 1: Rank order correlations between student D and other student measures (N=27)

Student Measure r

Measures Taken from Transcripts

Number of Words .18

Number of Different Words .35 *

MSTTR-30 .59 **

Mean Length of Utterance (MLU words) .23

Utterances per Turn (MLT) .09

Percentage unintelligible words .02

Words per Minute .23

Type-Token Ratio .20

GCSE Examination Results

Score for Oral Examination (out of 7) .34 *

GCSE points .31

Mean Ratings from 24 Teachers of French

Range of Vocabulary .31

Fluency .33 *

Complexity of Structure .31

Content .30

Accuracy .31

Pronunciation .23

* p<.05 ** p<.01

27

Table 2: Pearson product moment correlations between Teacher D and measures of

students’ language (N=33)

Student Measure r


Number of Words .41 **

MSTTR-30 .07

Mean Length of Utterance (MLU words) .43 **

Utterances per Turn (MLT) .35 *

Percentage unintelligible words -.02

Words per Minute .49 **


Score for Oral Examination (out of 7) .43 **

GCSE points .52 **


Range of Vocabulary .40 *

Fluency .40 *

Complexity of Structure .35 *

Content .44 **

Accuracy .42 **

Pronunciation .33 *

* p<.05 ** p<.01

28

Table 3: Pearson product moment correlations between the Ds for Teacher A and

measures of students’ language (N=12)

Student Measure r


Number of Words -.03

MSTTR-30 .43

Mean Length of Utterance (MLU words) -.19

Utterances per Turn (MLT) .62 *

Percentage unintelligible words .50 *

Words per Minute .07


Score for Oral Examination (out of 7) -.22

GCSE points -.44


Range of Vocabulary -.01

Fluency -.03

Complexity of Structure -.05

Content .01

Accuracy -.03

Pronunciation -.08

* p<.05 ** p<.01

29

Table 4: Pearson product moment correlations between the Ds for Teacher B and

measures of students’ language (N=21)

Student Measure r


Number of Words -.10

MSTTR-30 .14

Mean Length of Utterance (MLU words) -.13

Utterances per Turn (MLT) -.12

Percentage unintelligible words -.03

Words per Minute -.23


Score for Oral Examination (out of 7) -.27

GCSE points -.10


Range of Vocabulary -.39 *

Fluency -.44 *

Complexity of Structure -.41 *

Content -.38 *

Accuracy -.42 *

Pronunciation -.40 *

* p<.05 ** p<.01

30

Documents

VOCD: SOFTWARE FOR MEASURING VOCABULARY … · Web viewThe text-handling features of vocd have been modelled on MacWhinney’s Computerized Language Analysis (CLAN) programs, particularly