106
Inter-textual Vocabulary Inter-textual Vocabulary Growth Patterns Growth Patterns for Marine Engineering for Marine Engineering English English FENG Zhi-wei FENG Zhi-wei Institute of Applied Linguistics, The Institute of Applied Linguistics, The Ministry of Education, Beijing, China Ministry of Education, Beijing, China Communication University of China Communication University of China E-mail: E-mail: [email protected] [email protected]

Inter-textual Vocabulary Growth Patterns for Marine Engineering English

  • Upload
    bryce

  • View
    28

  • Download
    0

Embed Size (px)

DESCRIPTION

Inter-textual Vocabulary Growth Patterns for Marine Engineering English. FENG Zhi-wei Institute of Applied Linguistics, The Ministry of Education, Beijing, China Communication University of China E-mail: [email protected]. ABSTRACT. - PowerPoint PPT Presentation

Citation preview

Page 1: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Inter-textual Vocabulary Inter-textual Vocabulary Growth Patterns Growth Patterns

for Marine Engineering Englishfor Marine Engineering English FENG Zhi-weiFENG Zhi-wei

Institute of Applied Linguistics, The Ministry of Institute of Applied Linguistics, The Ministry of Education, Beijing, ChinaEducation, Beijing, China

Communication University of ChinaCommunication University of ChinaE-mail: E-mail: [email protected]@hotmail.com

Page 2: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

ABSTRACTABSTRACT This talk will explore two fundamental issues concerning the inter-This talk will explore two fundamental issues concerning the inter-

textual vocabulary growth patterns for Marine Engineering English textual vocabulary growth patterns for Marine Engineering English based on the large-scale authentic corpus. They are vocabulary growth based on the large-scale authentic corpus. They are vocabulary growth models and newly occurring vocabulary distributions in cumulative models and newly occurring vocabulary distributions in cumulative texts. texts.

Four mathematical models (Brunet’s model, Guiraud’s model, Tuldava’s Four mathematical models (Brunet’s model, Guiraud’s model, Tuldava’s model, and Herdan’s model) are tested against the empirical growth model, and Herdan’s model) are tested against the empirical growth curve for Marine Engineering English(WEE). curve for Marine Engineering English(WEE).

A new growth model is derived from the logarithmic function and the A new growth model is derived from the logarithmic function and the power law. power law.

The theoretical mean vocabulary size and the 95% upper and lower The theoretical mean vocabulary size and the 95% upper and lower bound values are calculated and plotted as functions of the sample bound values are calculated and plotted as functions of the sample size. The research is carried out on the basis of the DMMEE (Dalian size. The research is carried out on the basis of the DMMEE (Dalian Maritime University Marine Engineering English) corpus of DMU (Dalian Maritime University Marine Engineering English) corpus of DMU (Dalian Maritime University, China).Maritime University, China).

This research has application in explicit EFL (English as Foreign This research has application in explicit EFL (English as Foreign Language) teaching and learning. The new growth model can make Language) teaching and learning. The new growth model can make reliable estimates not only on the vocabulary size and its 95% reliable estimates not only on the vocabulary size and its 95% confidence intervals for a given textbook, but also on the volume of confidence intervals for a given textbook, but also on the volume of individual texts that are needed to produce a particular vocabulary size.individual texts that are needed to produce a particular vocabulary size.

Page 3: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

KeywordsKeywords

Inter-textual Vocabulary Growth Model; Inter-textual Vocabulary Growth Model; Newly Occurring Vocabulary Distributions; Newly Occurring Vocabulary Distributions; Logarithmic Function; Power LawLogarithmic Function; Power Law

Page 4: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

11 .. IntroductionIntroduction

Page 5: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

apart from differences in the lexicon, apart from differences in the lexicon, there is only one human languagethere is only one human language

Lexis is the core of description in Firthian linguistics. In Lexis is the core of description in Firthian linguistics. In 1957, Firth first put forward the theory of collocation, which 1957, Firth first put forward the theory of collocation, which separates the study of lexis from the traditional grammar separates the study of lexis from the traditional grammar and semantics. and semantics.

Neo-Firthians built on the lexis-oriented research approach Neo-Firthians built on the lexis-oriented research approach and further developed Firth’s lexical theory. Halliday (1966: and further developed Firth’s lexical theory. Halliday (1966: 148-162) presupposes the operation of lexis is at an 148-162) presupposes the operation of lexis is at an independent linguistic level and is more than filling independent linguistic level and is more than filling grammatical slots in a clause. grammatical slots in a clause.

As Smith (1999:50) once commented on the significance of As Smith (1999:50) once commented on the significance of vocabulary, “the lexicon is of central importance and is vocabulary, “the lexicon is of central importance and is even described as being potentially the locus of all even described as being potentially the locus of all variation between languages, so that apart from differences variation between languages, so that apart from differences in the lexicon, there is only one human language.” in the lexicon, there is only one human language.”

Page 6: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Effective vocabulary acquisition is Effective vocabulary acquisition is particular importantparticular important

Effective vocabulary acquisition is of particular importance for EFL Effective vocabulary acquisition is of particular importance for EFL learners (Ellis 2004) who frequently acquire impoverished lexicons learners (Ellis 2004) who frequently acquire impoverished lexicons despite years of formal study. despite years of formal study.

The mushrooming amount of experimental studies and pedagogical The mushrooming amount of experimental studies and pedagogical publications also demonstrates beyond all doubt that the field of publications also demonstrates beyond all doubt that the field of vocabulary studies is now anything but a neglected area (Schmitt vocabulary studies is now anything but a neglected area (Schmitt 2002). 2002).

Since lexicons are items that should be acquired in certain quantities at Since lexicons are items that should be acquired in certain quantities at certain rates, it is essential for both teachers and learners to be aware certain rates, it is essential for both teachers and learners to be aware of the characteristics of vocabulary growth patterns such as vocabulary of the characteristics of vocabulary growth patterns such as vocabulary growth models and word frequency distributions. growth models and word frequency distributions.

These characteristics are of practical significance not only for the These characteristics are of practical significance not only for the resolution of a series of problems in vocabulary acquisition but also for resolution of a series of problems in vocabulary acquisition but also for the theoretical explanation of some important issues in quantitative the theoretical explanation of some important issues in quantitative lexical studies.lexical studies.

Page 7: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Qualitative and statistical aproachesQualitative and statistical aproaches

At present, lexical study shows At present, lexical study shows characteristics of a multilayer development characteristics of a multilayer development supported by computer technology and supported by computer technology and corpus evidence; the description of lexis is corpus evidence; the description of lexis is not confined to one qualitative approach but not confined to one qualitative approach but introduced with other statistical approaches.introduced with other statistical approaches.

Page 8: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Meara’s lexical uptake modelMeara’s lexical uptake model Meara (1997) proposed a lexical uptake model (Equation 1.1) for Meara (1997) proposed a lexical uptake model (Equation 1.1) for

incidental teaching and learning. A learner’s acquisition of vocabulary is incidental teaching and learning. A learner’s acquisition of vocabulary is expressed by a complex interaction between three variables: expressed by a complex interaction between three variables: accumulated capacity of text input (N), number of new words provided in accumulated capacity of text input (N), number of new words provided in input texts (input texts (V(N)iV(N)i), and likelihood of the learner picking up any individual ), and likelihood of the learner picking up any individual word from those texts (p).word from those texts (p).

(( 1.11.1 ))

V(N)V(N) : size of acquired vocabulary : size of acquired vocabulary V(N)V(N)ii : the new vocabulary the : the new vocabulary the ithith input text provides input text provides NN: total number of input texts: total number of input texts pp is an empirical parameter and its actual value is affected by teacher’s is an empirical parameter and its actual value is affected by teacher’s

and learner’s experience.and learner’s experience.

n

iiNVpNV

1

)()(

n

iiNVpNV

1

)()(

n

iiNVpNV

1

)()(

Page 9: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Meara’s lexical uptake modelMeara’s lexical uptake model

model 1.1 is factually determined by two factors, model 1.1 is factually determined by two factors, i.e. accumulated capacity of text input (N) and i.e. accumulated capacity of text input (N) and number of new words provided in input texts number of new words provided in input texts ((V(N)V(N)ii). These two factors constrain each other by ). These two factors constrain each other by some complex function relationship and decide the some complex function relationship and decide the vocabulary size V(N) of a learner. vocabulary size V(N) of a learner.

In other words, as long as we manage to simulate In other words, as long as we manage to simulate the function relationship of the function relationship of N N and and V(N) V(N) (also called (also called vocabulary growth model in this paper), we can vocabulary growth model in this paper), we can calculate the acquired vocabulary size and the calculate the acquired vocabulary size and the vocabulary growth rate of a learner. vocabulary growth rate of a learner.

Page 10: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Why we select DMMEE?Why we select DMMEE?

Some models describe the vocabulary growth pattern for Some models describe the vocabulary growth pattern for general English with sufficient precision, but none of them general English with sufficient precision, but none of them have ever been tested against the vocabulary development have ever been tested against the vocabulary development for Marine Engineering English (MEE), a language for for Marine Engineering English (MEE), a language for special purpose (LSP).special purpose (LSP).

In this paper, four existing models (Brunet’s model, In this paper, four existing models (Brunet’s model, GuirauGuiraud’s model, Tuldava’s modeld’s model, Tuldava’s model and Herdan’s model) will be and Herdan’s model) will be tested against the empirical growth curve for MEE.tested against the empirical growth curve for MEE.

A new growth model will be proposed and evaluated on the A new growth model will be proposed and evaluated on the basis of Dalian Maritime University Marine Engineering basis of Dalian Maritime University Marine Engineering English Corpus (DMMEE).English Corpus (DMMEE).

Page 11: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

V(N) and NV(N) and N

the vocabulary size the vocabulary size V(N) V(N) in this paper is defined as the set in this paper is defined as the set of all the lemmas in a given text, excluding Arabic of all the lemmas in a given text, excluding Arabic numerals, non-word strings, and personal and place numerals, non-word strings, and personal and place names, since the inclusion of them would gravely distort names, since the inclusion of them would gravely distort the vocabulary growth curves. V(N) is similar as type in the vocabulary growth curves. V(N) is similar as type in language statistics.language statistics.

However, the size However, the size NN of a text or sample refers to the total of a text or sample refers to the total number of any character cluster or a single character, number of any character cluster or a single character, including words, names, Arabic numerals, and non-word including words, names, Arabic numerals, and non-word strings in the text, but excluding punctuation marks. N is strings in the text, but excluding punctuation marks. N is similar as token in language statistics.similar as token in language statistics.

Page 12: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

2. Existing Vocabulary Growth 2. Existing Vocabulary Growth ModelsModels

Page 13: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Some early studiesSome early studies In the early 1930s, G. K. Zipf (1935) set out the In the early 1930s, G. K. Zipf (1935) set out the

rank-frequency law which held that the relationship rank-frequency law which held that the relationship between the frequency of use of a word in a text between the frequency of use of a word in a text and the rank order of that word in a frequency list and the rank order of that word in a frequency list is a constant.is a constant.

In the 1960s, Carroll (1967; 1969) put into words In the 1960s, Carroll (1967; 1969) put into words for the first time that word frequency distributions for the first time that word frequency distributions are particularly characterized by large numbers of are particularly characterized by large numbers of extremely low-probability words. extremely low-probability words.

Page 14: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Some early studiesSome early studies

This property sets quantitative lexical studies apart This property sets quantitative lexical studies apart from most other areas in statistics and leads to from most other areas in statistics and leads to particular statistical phenomena such as particular statistical phenomena such as vocabulary growth patterns that systematically vocabulary growth patterns that systematically keep changing as the sample size is increased. keep changing as the sample size is increased.

In the 1980s and 1990s, special statistical In the 1980s and 1990s, special statistical techniques were developed to describe lexical techniques were developed to describe lexical frequency distributions accurately, and a number frequency distributions accurately, and a number of vocabulary growth models were proposed. of vocabulary growth models were proposed.

Page 15: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Four important modelsFour important models

Of particular interest in these vocabulary Of particular interest in these vocabulary growth patterns are as following:growth patterns are as following:

Brunet’s model, Brunet’s model, Guiraud’s model, Guiraud’s model, Tuldava’s model, Tuldava’s model, Herdan’s model.Herdan’s model.

Page 16: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Brunet’s model (1978) 1/2 Brunet’s model (1978) 1/2 Brunet’s model (1978) Brunet’s model (1978) ::

(( 2.12.1 ))

hense, hense,

(( 2.22.2 ))

Page 17: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Brunet’s model (1978) 2/2Brunet’s model (1978) 2/2

WW: Brunet’s constant serving as the base of : Brunet’s constant serving as the base of the logarithmic function the logarithmic function

Parameter Parameter WW has no probabilistic has no probabilistic interpretation; it varies with the size of the interpretation; it varies with the size of the sample. Parameter sample. Parameter αα is usually fixed at 0.17, is usually fixed at 0.17, a heuristic value that has been found to a heuristic value that has been found to produce the desired result of producing a produce the desired result of producing a roughly constant relation between roughly constant relation between loglogwwV(N)V(N) and and loglogwwN N (Baayen 2001). (Baayen 2001).

Page 18: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Guiraud’s model 1/2Guiraud’s model 1/2

Guiraud’s model (1990), where the square root of Guiraud’s model (1990), where the square root of the sample size replaces the sample size in what the sample size replaces the sample size in what is known as the type-token ratio, is known as the type-token ratio, V(N)/NV(N)/N, as , as follows:follows:

Hence,Hence, (2.2)(2.2)

Page 19: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Guiraud’s model 2/2Guiraud’s model 2/2

RR: Guiraud’s constant serving as the : Guiraud’s constant serving as the coefficient of the formulacoefficient of the formula

The vocabulary size The vocabulary size V(N)V(N) in Guiraud’s in Guiraud’s model reduces to a simple square function model reduces to a simple square function of the sample size of the sample size NN, but the parameter , but the parameter RR changes systematically with the sample changes systematically with the sample size.size.

Page 20: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Tuldava model 1/2Tuldava model 1/2 According to the hypothesis that the relation between According to the hypothesis that the relation between V(N)/NV(N)/N and and NN is is

approximated by the power function:approximated by the power function:

Tuldava (1990) constructed a vocabulary growth model (Expression Tuldava (1990) constructed a vocabulary growth model (Expression 2.3) by applying logarithmization to both variables of 2.3) by applying logarithmization to both variables of V(N)V(N) and and NN..

(2.3)(2.3)

Page 21: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Tuldava model 2/2Tuldava model 2/2

ee: natural base, e = 2.71828….: natural base, e = 2.71828…. Parameters Parameters αα and and ββ have no probabilistic have no probabilistic

interpretation; they are the coefficients of interpretation; they are the coefficients of variety that are supposed to be correlated variety that are supposed to be correlated with the probabilistic process of choosing with the probabilistic process of choosing “new” (unused) and “old” (already used in “new” (unused) and “old” (already used in the text) words at each stage of text the text) words at each stage of text processing. processing.

Page 22: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

G. Herdan model 1/2G. Herdan model 1/2

G. Herdan (1964) proposed a power model G. Herdan (1964) proposed a power model based on the observation that the growth based on the observation that the growth curve of the vocabulary appears as roughly curve of the vocabulary appears as roughly a straight line in the double logarithmic a straight line in the double logarithmic plane.plane.

Page 23: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

G. Herdan model 2/2G. Herdan model 2/2

Hence, expression 2.4 is Herdan Model:Hence, expression 2.4 is Herdan Model:

(2.4)(2.4)

αα and and ββ in Herdan’s model are empirical in Herdan’s model are empirical parameters; they have no sensible parameters; they have no sensible interpretation.interpretation.

Page 24: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Short summaryShort summary

The current vocabulary growth models are of The current vocabulary growth models are of particular importance for the theory of quantitative particular importance for the theory of quantitative statistics, but they are seldom used to solve statistics, but they are seldom used to solve practical lexico-statistical problems, either practical lexico-statistical problems, either because their complex expressions are because their complex expressions are computationally rather intractable, or the model computationally rather intractable, or the model parameters keep changing systematically with the parameters keep changing systematically with the sample size. sample size.

Moreover, none of the current growth models have Moreover, none of the current growth models have been tested against the vocabulary development been tested against the vocabulary development for MEE in light of corpus evidence. for MEE in light of corpus evidence.

Page 25: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

3. DMMEE and Methodology3. DMMEE and Methodology

Page 26: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

DMMEE corpusDMMEE corpus

DMMEE was designed to represent the DMMEE was designed to represent the contemporary MEE.contemporary MEE.

DMMEE was compiled with a little more DMMEE was compiled with a little more than one million word tokens. than one million word tokens.

Altogether 959 text samples were collected Altogether 959 text samples were collected in the corpus; the length of each text varies in the corpus; the length of each text varies from 349 to 2070 word tokens. from 349 to 2070 word tokens.

Page 27: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Detailed information for issues on Detailed information for issues on the size of DMMEE the size of DMMEE

Page 28: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Summers (1991) approachSummers (1991) approach

To achieve representativeness and balance in text sample To achieve representativeness and balance in text sample selection, Summers (1991) outlined a number of possible selection, Summers (1991) outlined a number of possible approaches including: approaches including:

an “elitist” approach based on literary or academic merit or an “elitist” approach based on literary or academic merit or “influentialness”; “influentialness”;

random selection;random selection; “ “currency”;currency”; subjective judgment of “typicalness”; subjective judgment of “typicalness”; availability of text in archives; availability of text in archives; demographic sampling of reading habits; demographic sampling of reading habits; empirical adjustment of text selection to meet linguistic empirical adjustment of text selection to meet linguistic

specifications. specifications.

Page 29: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

959 text in DMMEE 959 text in DMMEE

By using a combination of theses By using a combination of theses approaches, 959 representative texts were approaches, 959 representative texts were selected from today’s most influential MEE selected from today’s most influential MEE materials that are produced or published in materials that are produced or published in either Britain or the USA. either Britain or the USA.

The publishing time ranges from 1987 to The publishing time ranges from 1987 to 2004. The titles of the publications are listed 2004. The titles of the publications are listed in Appendix A. in Appendix A. Appendix-AAppendix-A

Page 30: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

genre categories in DMMEEgenre categories in DMMEE

DMMEE is organized according to general genre DMMEE is organized according to general genre categories such as papers and books. categories such as papers and books.

The quantity of text in each category complies The quantity of text in each category complies roughly with the proportion of each material genre roughly with the proportion of each material genre displayed in Appendix A. displayed in Appendix A.

In DMMEE, about 85% of the text samples are In DMMEE, about 85% of the text samples are taken from papers and books, 10% from lectures taken from papers and books, 10% from lectures and discussions, and the remainder from such and discussions, and the remainder from such sources as brochures and manual instructions. sources as brochures and manual instructions.

Page 31: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Overall structure and general genre Overall structure and general genre categories for DMMEE categories for DMMEE

Page 32: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

subject fields in DMMEEsubject fields in DMMEE

The text samples of DMMEE were selected from The text samples of DMMEE were selected from various constituent subject fields of MEE, including various constituent subject fields of MEE, including marine diesel engine, marine power plant, marine diesel engine, marine power plant, electrical installation, steering gear, marine electrical installation, steering gear, marine refrigeration, gas exchange, lubricating system, refrigeration, gas exchange, lubricating system, propelling system, maintenance and repair, and propelling system, maintenance and repair, and etc. etc.

Within each general subject field, individual Within each general subject field, individual subjects can be identified, such as steam turbines subjects can be identified, such as steam turbines and generation sets in the marine power plant. and generation sets in the marine power plant.

Page 33: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Five issues of MEE vocabulary Five issues of MEE vocabulary growth 1/2growth 1/2

With the aid of software SPSS and data-With the aid of software SPSS and data-managing system FoxPro, we explore the five managing system FoxPro, we explore the five issues of MEE vocabulary growth, including:issues of MEE vocabulary growth, including:

To verify the descriptive power of current To verify the descriptive power of current growth models from fitting the empirical growth models from fitting the empirical growth curves. growth curves.

To construct a new growth model based on To construct a new growth model based on the analysis of current mathematical models. the analysis of current mathematical models.

Page 34: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Five issues of MEE vocabulary Five issues of MEE vocabulary growth 2/2growth 2/2

To make a comparison of the new growth To make a comparison of the new growth model with other known models in terms of model with other known models in terms of parameter estimation, goodness of each parameter estimation, goodness of each model fit, and analysis of residuals. model fit, and analysis of residuals.

To calculate the theoretical mean To calculate the theoretical mean vocabulary sizes and their 95% two-sided vocabulary sizes and their 95% two-sided tolerance intervals using the new model. tolerance intervals using the new model.

To plot the distributions of newly occurring To plot the distributions of newly occurring vocabulary in cumulative textsvocabulary in cumulative texts

Page 35: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Processing steps: First stepProcessing steps: First step

Specifically, the task of data processing was Specifically, the task of data processing was conducted via the following steps: conducted via the following steps:

First, the text samples were randomly drawn from First, the text samples were randomly drawn from the published written texts of DMMEE and the published written texts of DMMEE and grouped into two sets, the Sample Set and the grouped into two sets, the Sample Set and the Test Set, each of which contains about 500,000 Test Set, each of which contains about 500,000 tokens. The Sample Set served as database for tokens. The Sample Set served as database for examining the four mathematical models and examining the four mathematical models and constructing a new vocabulary growth model. The constructing a new vocabulary growth model. The Test Set was used to verify the descriptive power Test Set was used to verify the descriptive power of the new model against MEE. of the new model against MEE.

Page 36: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

First Step 1/3First Step 1/3

The number of samples in each set is determined by two The number of samples in each set is determined by two factors (Fan 2006): factors (Fan 2006):

The number of samples. It should be sufficiently large. The number of samples. It should be sufficiently large. Insufficient samples cannot capture true inter-textual lexical Insufficient samples cannot capture true inter-textual lexical characteristics. characteristics.

The cumulative volume of texts: the total cumulative The cumulative volume of texts: the total cumulative volume of texts in the samples should be large enough to volume of texts in the samples should be large enough to produce sufficient vocabulary size of native speakers. produce sufficient vocabulary size of native speakers.

-- Nation (1990) and Nagy (1997) place the vocabulary -- Nation (1990) and Nagy (1997) place the vocabulary size of educated native speakers at 20,000; size of educated native speakers at 20,000;

-- Radford et al. (1999) gives it a higher figure, around -- Radford et al. (1999) gives it a higher figure, around 30,000. 30,000.

Page 37: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

First Step 2/3 First Step 2/3

To be general, an EFL learner’s vocabulary To be general, an EFL learner’s vocabulary cannot be lower than those estimates. cannot be lower than those estimates.

A test run revealed that cumulative texts of A test run revealed that cumulative texts of 500,000 word tokens can roughly cover the 500,000 word tokens can roughly cover the vocabulary size of the native speaker. vocabulary size of the native speaker.

Page 38: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

First Step 3/3First Step 3/3

Information on the two sets is given in Information on the two sets is given in following Table.following Table.

Page 39: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Second stepSecond step

The second step was to explore the newly The second step was to explore the newly occurring vocabulary distributions. occurring vocabulary distributions.

The newly occurring vocabulary size The newly occurring vocabulary size V(N)V(N)newnew was was calculated and plotted as a function of the sample calculated and plotted as a function of the sample size size NN for both the Sample Set and the Test Set. for both the Sample Set and the Test Set. Two respective FoxPro programs were designed Two respective FoxPro programs were designed to perform random selection of the text samples, to perform random selection of the text samples, and to make tokenization and lemmatization to and to make tokenization and lemmatization to each text sample. SPSS was used to plot the inter-each text sample. SPSS was used to plot the inter-textual vocabulary growth curves and the scatter-textual vocabulary growth curves and the scatter-grams of grams of V(N)V(N)newnew against against NN for the two sets for the two sets separately.separately.

Page 40: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

4 Vocabulary Growth Models 4 Vocabulary Growth Models for MEEfor MEE

Page 41: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

vocabulary growth curvesvocabulary growth curves

To keep the integrity of lexical structure of To keep the integrity of lexical structure of each individual text, the vocabulary growth each individual text, the vocabulary growth curves plotted are measured in units of text, curves plotted are measured in units of text, not necessarily at the equally spaced not necessarily at the equally spaced intervals. intervals.

Following figure plots the dependence of Following figure plots the dependence of the vocabulary size the vocabulary size V(N)V(N) on the sample size on the sample size NN for the Sample Set and the Test Set of for the Sample Set and the Test Set of DMMEE.DMMEE.

Page 42: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

sample set and test set sample set and test set

Vocabulary size Vocabulary size V(N)V(N) as a function of sample size as a function of sample size NN for the cumulative texts in the Sample Set (left for the cumulative texts in the Sample Set (left panel) and the Test Set (right panel) of DMMEE.panel) and the Test Set (right panel) of DMMEE.

Page 43: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Curve discriptionCurve discription

The vocabulary growth curves for the Sample Set The vocabulary growth curves for the Sample Set and the Test Set exhibit similar patterns. and the Test Set exhibit similar patterns.

Take the left panel into consideration. The growth Take the left panel into consideration. The growth curve of curve of V(N)V(N) is not a linear function of is not a linear function of NN. Initially, . Initially, V(N)V(N) increases quickly, but the growth rate increases quickly, but the growth rate decreases as decreases as NN is increased. By the end of the is increased. By the end of the Sample Set the curve has not flattened out to a Sample Set the curve has not flattened out to a horizontal line, that is, the vocabulary size keeps horizontal line, that is, the vocabulary size keeps increasing even when the sample size has increasing even when the sample size has reached 500,000 word tokens. reached 500,000 word tokens.

Page 44: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

4. 1 Test of the Existing Vocabulary 4. 1 Test of the Existing Vocabulary Growth Models Growth Models

Page 45: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Following figures plot the empirical vocabulaFollowing figures plot the empirical vocabulary growth curve (tiny circles) for the Sample ry growth curve (tiny circles) for the Sample Set, as well as the corresponding expectatioSet, as well as the corresponding expectations (solid line) using Brunet’s model, Guirauns (solid line) using Brunet’s model, Guiraud’s model, Tuldava’s model and Herdan’s md’s model, Tuldava’s model and Herdan’s model.odel.

Page 46: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Brunet modelBrunet model

Page 47: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Brunet’s ModelBrunet’s Model

Brunet’s model fits the empirical growth curve well. Brunet’s model fits the empirical growth curve well. The value of The value of RR square (the coefficient of square (the coefficient of determination) reaches 99.876%. But the upper determination) reaches 99.876%. But the upper left panel shows that Brunet’s model left panel shows that Brunet’s model underestimates the observed vocabulary sizes underestimates the observed vocabulary sizes both at the beginning and towards the end of the both at the beginning and towards the end of the growth curve. growth curve.

The predicted values are presented in the third The predicted values are presented in the third column of column of Appendix-BAppendix-B..

Page 48: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Guiraud’s modelGuiraud’s model

Page 49: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Guiraud’s modelGuiraud’s model

Guiraud’s model does not fit the empirical Guiraud’s model does not fit the empirical growth curve quite well. The R square is growth curve quite well. The R square is 99.812%, the lowest value among the four 99.812%, the lowest value among the four models. The upper right panel shows that models. The upper right panel shows that Guiraud’s model overestimates the Guiraud’s model overestimates the observed vocabulary sizes from the observed vocabulary sizes from the beginning of the growth curve to nearly the beginning of the growth curve to nearly the middle. The predicted values are listed in middle. The predicted values are listed in the seventh column of the seventh column of Appendix-BAppendix-B..

Page 50: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Tuldava’s modelTuldava’s model

Page 51: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Tuldava’s modelTuldava’s model

The fit of Tuldava’s model is close to the The fit of Tuldava’s model is close to the empirical growth curve, with the empirical growth curve, with the RR square square reaching 99.910%. But Tuldava’s model reaching 99.910%. But Tuldava’s model tends to underestimate the observed tends to underestimate the observed vocabulary sizes at the beginning of the vocabulary sizes at the beginning of the curve. The expected values are presented in curve. The expected values are presented in the fifth column of the fifth column of Appendix-BAppendix-B..

Page 52: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Herdan’s modelHerdan’s model

Page 53: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Herdan’s modelHerdan’s model

Though similar in expression to Guiraud’s model, Though similar in expression to Guiraud’s model, Herdan’s model describes the empirical growth Herdan’s model describes the empirical growth curve best of the four models. The curve best of the four models. The R R squaresquare reaches the highest value 99.942%. But the lower reaches the highest value 99.942%. But the lower right panel shows that Herdan’s model tends to right panel shows that Herdan’s model tends to overestimate the observed vocabulary sizes overestimate the observed vocabulary sizes towards the end of the growth curve. The towards the end of the growth curve. The expected values are listed in the nineth column of expected values are listed in the nineth column of Appendix-BAppendix-B..

Page 54: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

4. 2 A New Vocabulary Growth Model for 4. 2 A New Vocabulary Growth Model for MEEMEE

Page 55: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

New modelNew model Based on the lexico-statistical study of the Sample Set, a new Based on the lexico-statistical study of the Sample Set, a new

mathematical model (Expression 4.1) is suggested to fit the vocabulary mathematical model (Expression 4.1) is suggested to fit the vocabulary growth curve for MEE. growth curve for MEE.

(4.1)(4.1)

The new model is constructed by multiplying the logarithmic function The new model is constructed by multiplying the logarithmic function and the power law. Parameter and the power law. Parameter αα is the coefficient of the whole is the coefficient of the whole expression; parameter expression; parameter ββ is the exponential of the power function part is the exponential of the power function part of the model. of the model.

Parameters Parameters αα and and ββ do not have fixed values; they have to change do not have fixed values; they have to change slightly with specific sample sizes to realize sufficient goodness of fit.slightly with specific sample sizes to realize sufficient goodness of fit.

Page 56: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

The theoretical considerations for The theoretical considerations for proposing the new model 1/3proposing the new model 1/3

Brunet’s model is in essence a complex Brunet’s model is in essence a complex logarithmic function, with loglogarithmic function, with logWWNN being the being the

independent variable and logindependent variable and logWWV(N)V(N) the the

dependent variable. dependent variable. It tends to It tends to underestimateunderestimate the observed the observed

vocabulary sizes both at the beginning and vocabulary sizes both at the beginning and towards the end of the empirical growth towards the end of the empirical growth curve.curve.

Page 57: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

The theoretical considerations for The theoretical considerations for proposing the new model 2/3proposing the new model 2/3

Herdan’s model is a generalized power Herdan’s model is a generalized power function. function.

It tends to It tends to overestimateoverestimate the observed values the observed values towards the end of the vocabulary growth towards the end of the vocabulary growth curve. curve.

Page 58: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

The theoretical considerations for The theoretical considerations for proposing the new model 3/3proposing the new model 3/3

The mathematical combination of the The mathematical combination of the logarithmic function and the power law may logarithmic function and the power law may provide good fit to the empirical growth provide good fit to the empirical growth curve for MEE.curve for MEE.

Following figure plots Following figure plots the dependence of the dependence of VV(N)(N) on on NN for the Sample Set and the expecta for the Sample Set and the expectations using the new growth model.tions using the new growth model.

Page 59: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

New modelNew model

Vocabulary size Vocabulary size V(N)V(N) as a function of the as a function of the sample size sample size NN, measured in units of text. , measured in units of text. The tiny circles represent The tiny circles represent the empirical growthe empirical growth curve for the Sample Set, and the solid linth curve for the Sample Set, and the solid line the expected growth curve derived from the the expected growth curve derived from the new mathematical model. e new mathematical model.

Page 60: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

New modelNew model

Page 61: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

New modelNew model

The new mathematical model provides very The new mathematical model provides very good fit to the empirical growth curve. good fit to the empirical growth curve.

The value of The value of R R squaresquare reaches 99.945%, reaches 99.945%, higher than any other vocabulary growth higher than any other vocabulary growth models, particularly Brunet’s model and models, particularly Brunet’s model and Herdan’s model. Herdan’s model.

The respective values of parameters The respective values of parameters αα and and ββ are 3.529696 and 0.442790 for the are 3.529696 and 0.442790 for the Sample Set.Sample Set.

Page 62: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

4. 3 Goodness-of-Fit Test for the New 4. 3 Goodness-of-Fit Test for the New Vocabulary Growth ModelVocabulary Growth Model

Page 63: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Evaluation with test setEvaluation with test set

The Test Set of DMMEE is used to evaluate The Test Set of DMMEE is used to evaluate the descriptive power of the new vocabulary the descriptive power of the new vocabulary growth model. Following figure plots the growth model. Following figure plots the empirical growth curve for the Test Set (tiny empirical growth curve for the Test Set (tiny circles) and the corresponding expectations circles) and the corresponding expectations using the new mathematical model (solid using the new mathematical model (solid line), with line), with α=α=3.529696 and 3.529696 and β=β=0.442790. 0.442790.

Page 64: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Test SetTest Set

Page 65: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Dependence of the vocabulary size Dependence of the vocabulary size V(N)V(N) on on the sample size the sample size NN, measured in units of text. , measured in units of text. The tiny circles represent The tiny circles represent the empirical growthe empirical growth curve for the Test Set, and the solid line tth curve for the Test Set, and the solid line the expected growth curve derived from the he expected growth curve derived from the new mathematical model. new mathematical model.

Page 66: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

good fit to growth curvegood fit to growth curve

The Sample Set has similar corpus size with The Sample Set has similar corpus size with the Test Set, and they both represent the the Test Set, and they both represent the contemporary MEE. contemporary MEE.

Thus, the observed values of parameters Thus, the observed values of parameters αα and and ββ derived from the Sample Set also derived from the Sample Set also apply to the Test Set. Above Figure shows apply to the Test Set. Above Figure shows that the new model provides good fit to the that the new model provides good fit to the growth curve for the Test Set. The value of growth curve for the Test Set. The value of RR square reaches 99.917%. square reaches 99.917%.

Page 67: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

random deviation random deviation εε

The new growth model is probabilistic in The new growth model is probabilistic in nature, so the appropriate generalization to nature, so the appropriate generalization to this new model assumes that the expected this new model assumes that the expected value of value of V(N)V(N) is the function of is the function of NN, but that for , but that for fixed sample size fixed sample size NNii, , the variable the variable V(N)V(N)ii differs from its expected value in a random differs from its expected value in a random deviation deviation εε (Expression 4.2). (Expression 4.2).

(4.2)(4.2)

Page 68: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Normal distributed random deviation Normal distributed random deviation εε

The quantity The quantity εε in Expression 4.2 is a random variable, assumed to be in Expression 4.2 is a random variable, assumed to be normally distributed with E(normally distributed with E(εε)=0 and variance()=0 and variance(εε)=s)=s22. The inclusion of . The inclusion of ε ε allows the observed point allows the observed point (N(Nii, V(N), V(N)ii)) to fall either above the theoretical to fall either above the theoretical model curve (when model curve (when ε >0ε >0)) or below the curve (when or below the curve (when ε <0ε <0). For any fixed ). For any fixed NNii on the growth curve, on the growth curve, V(N)V(N)ii is the sum of a constant is the sum of a constant

and a normally distributed random deviation and a normally distributed random deviation εε; so itself has a normal ; so itself has a normal distribution, with the expected value distribution, with the expected value E(V(N)E(V(N)ii)) from fitting the new model from fitting the new model being the theoretical mean. These properties are illustrated in above being the theoretical mean. These properties are illustrated in above Figure.Figure.

Page 69: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

upper and lower bound vocabulariesupper and lower bound vocabularies

The upper and lower bound vocabularies of The upper and lower bound vocabularies of given point given point (N(Nii, V(N), V(N)ii)) are calculated for a are calculated for a

confidence level of 95%, using the tolerance confidence level of 95%, using the tolerance interval formula as follows:interval formula as follows:

Page 70: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Random selected ten sets of Random selected ten sets of samples 1/2samples 1/2

The standard deviation The standard deviation ss determines the determines the extent to which each normal curve of extent to which each normal curve of vocabulary spreads out about its mean vocabulary spreads out about its mean value value E(V(N)E(V(N)ii)). When . When ss is small, the is small, the observed point observed point (N(Nii, V(N), V(N)ii)) probably falls quite probably falls quite close to the theoretical growth curve; close to the theoretical growth curve; whereas observations may deviate whereas observations may deviate considerably from their expected vocabulary considerably from their expected vocabulary sizes when sizes when ss is large (Devore, 2000). is large (Devore, 2000).

Page 71: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Random selected ten sets of Random selected ten sets of samples 2/2samples 2/2

In order to calculate the standard deviation value In order to calculate the standard deviation value for each point for each point (N, V(N))(N, V(N)) in units of text, the text in units of text, the text samples of DMMEE are randomly selected and samples of DMMEE are randomly selected and rearranged into ten sets of samples, based on rearranged into ten sets of samples, based on which ten sets of vocabulary sizes and their which ten sets of vocabulary sizes and their corresponding sample sizes are obtained and corresponding sample sizes are obtained and displayed in displayed in Appendix-CAppendix-C. .

Devore proposed that whenDevore proposed that when n nsamplesample = 10 (number of = 10 (number of samples), the tolerance critical value is 3.379 for samples), the tolerance critical value is 3.379 for capturing at least 95% of the observed values in capturing at least 95% of the observed values in the normal population distribution. the normal population distribution.

Page 72: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

two-sided tolerance intervaltwo-sided tolerance interval

SPSS calculates the two-sided tolerance SPSS calculates the two-sided tolerance interval for each pointinterval for each point (N (Nii, V(N), V(N)ii)) on the on the

expected growth curve with a confidence expected growth curve with a confidence level of 95%. level of 95%.

Table 4.1 displays part of the statistics, Table 4.1 displays part of the statistics, including the standard deviation including the standard deviation ss and the and the corresponding upper and lower bound corresponding upper and lower bound vocabulary sizes.vocabulary sizes.

Page 73: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

V(N)V(N)maxmax and and V(N)V(N)minmin

Sample size Sample size NN, theoretical vocabulary mean, theoretical vocabulary mean E(V(N))E(V(N)), standard deviation , standard deviation ss, upper bound , upper bound vocabulary size vocabulary size V(N)V(N)maxmax and lower bound and lower bound

vocabulary size vocabulary size V(N)V(N)minmin for each point for each point (N, (N,

V(N))V(N)) on the expected growth curve derived on the expected growth curve derived from the new model. See table 4.1.from the new model. See table 4.1.

Page 74: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Table 4.1Table 4.1

Page 75: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Empirical growth curve 1/3Empirical growth curve 1/3

Folowing figure plots the empirical growth Folowing figure plots the empirical growth curve for the Test Set and the upper and curve for the Test Set and the upper and lower bound values using the new lower bound values using the new mathematical model. mathematical model.

The dashed lines show that all the observed The dashed lines show that all the observed vocabulary sizes fall within the 95% two-vocabulary sizes fall within the 95% two-sided tolerance bounds, with no exceptions.sided tolerance bounds, with no exceptions.

Page 76: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Empirical growth curve 2/3Empirical growth curve 2/3

Page 77: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

empirical growth curve 3/3empirical growth curve 3/3

In above Figure: In above Figure: V(N)V(N) as functions of as functions of NN, me, measured in units of text. The tiny circles repreasured in units of text. The tiny circles represent the empirical growth curve for the Test sent the empirical growth curve for the Test Set, the solid line the theoretical expectationSet, the solid line the theoretical expectations using the new model, and the dashed lines s using the new model, and the dashed lines the upper and lower bound curves for a confthe upper and lower bound curves for a confidence level of 95%.idence level of 95%.

Page 78: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

5.Newly Occurring Vocabulary 5.Newly Occurring Vocabulary Distributions in Cumulative Distributions in Cumulative TextsTexts

Page 79: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Law of decreasing new vocabulary Law of decreasing new vocabulary growth 1/3growth 1/3

Feng (1988Feng (1988 ,, 1996) proposed a “law of decreasing 1996) proposed a “law of decreasing new vocabulary growth” in his study of terminology. new vocabulary growth” in his study of terminology.

He noted that with the increase of terminology He noted that with the increase of terminology entries, the number of high frequency words grows entries, the number of high frequency words grows correspondently and the probability of the correspondently and the probability of the occurrence of new words decreases. At this point, occurrence of new words decreases. At this point, although the number of terminology entries keeps although the number of terminology entries keeps growing, the growth rate of the total number of new growing, the growth rate of the total number of new words slows down. The repeated occurrences of words slows down. The repeated occurrences of high frequency words indicate a tendency of high frequency words indicate a tendency of decreasing new vocabulary growth. decreasing new vocabulary growth.

Page 80: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Law of decreasing new vocabulary Law of decreasing new vocabulary growth 2/3growth 2/3

““Law of decreasing new vocabulary growth” lends Law of decreasing new vocabulary growth” lends itself not only to terminology system but also the itself not only to terminology system but also the process of reading written texts. process of reading written texts.

When starting to read a text written in a less When starting to read a text written in a less familiar language, one may encounter a body of familiar language, one may encounter a body of new words. With the increase of the read texts, the new words. With the increase of the read texts, the growth rate of new words will gradually reduce. If growth rate of new words will gradually reduce. If the reader can master the new words he comes the reader can master the new words he comes across, reading will be much easier. across, reading will be much easier.

Page 81: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Law of decreasing new vocabulary Law of decreasing new vocabulary growth 3/3growth 3/3

There exists a function relationship between the number of There exists a function relationship between the number of new words (W) and text capacity (T) as follows. new words (W) and text capacity (T) as follows.

W = W = ΦΦ (T)(T) With the increase of text capacity (T), the growth rate of the With the increase of text capacity (T), the growth rate of the

number of new words (W) begins to reduce and the number of new words (W) begins to reduce and the function curve becomes smoother in a form of convex function curve becomes smoother in a form of convex parabola in the rectangular coordinate system. parabola in the rectangular coordinate system.

This function curve indicates the process of vocabulary This function curve indicates the process of vocabulary growth in the reading of written texts. It is a mathematical growth in the reading of written texts. It is a mathematical account of the law of vocabulary change in a reading account of the law of vocabulary change in a reading process.process.

Page 82: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Corpus evidenceCorpus evidence

The corpus evidence proves Feng’s theory The corpus evidence proves Feng’s theory and suggests that as the sample sizeand suggests that as the sample size N N increases, the new vocabulary increases, the new vocabulary V(N)newV(N)new that that an additional input text produces decreases. an additional input text produces decreases. At any given At any given NiNi the new vocabulary the new vocabulary V(N)newV(N)new of a sample with of a sample with NN word tokens can be word tokens can be estimated with Expression 5.1.estimated with Expression 5.1.

(5.1)(5.1)

Page 83: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

V(N)V(N)newnew as a function of the sample as a function of the sample

size size NN Following figure plots the newly occurring Following figure plots the newly occurring

vocabulary size vocabulary size V(N)V(N)newnew as a function of the as a function of the

sample size sample size NN for the Sample Set and the for the Sample Set and the Test Set.Test Set.

Page 84: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Scatter-grams of Scatter-grams of V(N)V(N)newnew against against NN

Figure: Scatter-grams of Figure: Scatter-grams of V(N)V(N)newnew against against NN f f

or the Sample Set (left panel) and the Test or the Sample Set (left panel) and the Test Set (right panel).Set (right panel).

Page 85: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Decreasing process discription 1/2Decreasing process discription 1/2

The left panel for the Sample Set and the right panel for the The left panel for the Sample Set and the right panel for the Test Set reveal surprisingly similar Test Set reveal surprisingly similar distributions of newly distributions of newly occurring vocabulary sizes.occurring vocabulary sizes.

The new vocabulary The new vocabulary V(N)V(N)newnew is a rapidly decreasing functio is a rapidly decreasing function of the sample size n of the sample size NN with a long tail of the low values of with a long tail of the low values of VV(N)(N)newnew. .

Initially, Initially, V(N)V(N)newnew decreases sharply to some steady-state val decreases sharply to some steady-state value that decreases slowly and smoothly. ue that decreases slowly and smoothly.

Then as Then as NN increases, the empirical values of increases, the empirical values of V(N)V(N)newnew are di are distributed about the stable value in a wide dispersion till the stributed about the stable value in a wide dispersion till the end of the scatter plot. end of the scatter plot.

Page 86: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Decreasing process discription 2/2Decreasing process discription 2/2

Take the left panel into consideration. Take the left panel into consideration. The first input text produces more than 500 new word types.The first input text produces more than 500 new word types.

Then Then V(N)V(N)newnew decreases sharply as decreases sharply as NN increases. increases. When the sample size reaches 62,305 word tokens, the valWhen the sample size reaches 62,305 word tokens, the val

ue of ue of V(N)V(N)newnew has reduced to less than 150. But from has reduced to less than 150. But from N N = 6= 62,305 onwards till the end of the scatter plot, the newly occ2,305 onwards till the end of the scatter plot, the newly occurring vocabulary size decreases very slowly and the obserurring vocabulary size decreases very slowly and the observed values of ved values of V(N)V(N)newnew are distributed in a fluctuant way. are distributed in a fluctuant way.

The last input text produces 6 new word types, which proveThe last input text produces 6 new word types, which proves that new vocabulary still occurs even when the sample sis that new vocabulary still occurs even when the sample size has reached 500,000 word tokens. ze has reached 500,000 word tokens.

Page 87: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

6. Examples of Pedagogical 6. Examples of Pedagogical ApplicationsApplications

Page 88: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Application in EFLApplication in EFL

This research has significance in explicit This research has significance in explicit EFL teaching and learning. EFL teaching and learning.

With the assistance of the new growth With the assistance of the new growth model, teachers and students can make model, teachers and students can make reliable estimates not only on the vocabulary reliable estimates not only on the vocabulary size and its intervals for a given textbook but size and its intervals for a given textbook but also on the volume of texts that are needed also on the volume of texts that are needed to produce a particular vocabulary size. to produce a particular vocabulary size.

Page 89: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Example in MEEExample in MEE

For example, suppose that a textbook of For example, suppose that a textbook of MEE consists of 50 input texts, each size of MEE consists of 50 input texts, each size of which varies between 500 and 2,000, which varies between 500 and 2,000, totalling 50,000 word tokens. The totalling 50,000 word tokens. The vocabulary size of the whole textbook is vocabulary size of the whole textbook is estimated by using the new growth model, estimated by using the new growth model, with with αα= 3.5297 and = 3.5297 and ββ= 0.4428:= 0.4428:

Page 90: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Estimating V(N)Estimating V(N)

The 95% tolerance interval is calculated for The 95% tolerance interval is calculated for capturing all the possible vocabulary sizes, capturing all the possible vocabulary sizes, with with critical valuecritical value = 3.379 and = 3.379 and ss= 118.1479:= 118.1479:

Page 91: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Estimating V(N) Estimating V(N)

Thus we are Thus we are “95% certain” that “95% certain” that the the vocabulary size of the whole textbook lies vocabulary size of the whole textbook lies between 4199 and 4998 word types. between 4199 and 4998 word types.

If 5% of the total vocabulary input fails to be If 5% of the total vocabulary input fails to be acquired, then the two-sided tolerance acquired, then the two-sided tolerance interval for the acquired vocabulary isinterval for the acquired vocabulary is

[4199[4199 -- 4199×5%, 49984199×5%, 4998 -- 4998×5%], 4998×5%],

that is, [3989, 4748]. that is, [3989, 4748].

Page 92: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

E[V(N)E[V(N)newnew]]

If a new text with 1000 tokens is added into If a new text with 1000 tokens is added into the textbook, the newly occurring types that the textbook, the newly occurring types that this input text produces are estimated this input text produces are estimated according to Expression 5.1: according to Expression 5.1:

Page 93: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

V(N)V(N)newnew

The 95% tolerance interval is calculated for The 95% tolerance interval is calculated for the newly occurring vocabulary, with the newly occurring vocabulary, with critical critical valuevalue = 3.379 and = 3.379 and ss= 10.1751: = 10.1751:

Page 94: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Estimating vocabulary sizeEstimating vocabulary size

Another example is to estimate the sample Another example is to estimate the sample size for a given vocabulary size. size for a given vocabulary size.

TheThe Engineering English Competence Engineering English Competence Examination has a fairly high requirement Examination has a fairly high requirement on vocabulary proficiency; the minimal limit on vocabulary proficiency; the minimal limit is 4000 word types. To produce sufficient is 4000 word types. To produce sufficient vocabulary, the textbook of MEE has to vocabulary, the textbook of MEE has to reach a certain size. reach a certain size.

Page 95: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Estimating E[N]Estimating E[N]

According to the new growth model, the According to the new growth model, the expected size of the textbook is calculated expected size of the textbook is calculated as follows:as follows:

Page 96: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Estimating NEstimating N

With With critical valuecritical value = 3.379 and = 3.379 and ss= 2231.342, the = 2231.342, the 95% tolerance interval is:95% tolerance interval is:

Thus we are at least “95% confident” that the Thus we are at least “95% confident” that the textbook of MEE has to reach the size of [31010, textbook of MEE has to reach the size of [31010, 46090] tokens in order to produce 4,000 word 46090] tokens in order to produce 4,000 word types. types.

Page 97: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

77 . . ConclusionConclusion

Page 98: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Two fundamental issuesTwo fundamental issues

This talk explores two fundamental issues This talk explores two fundamental issues concerning the inter-textual vocabulary concerning the inter-textual vocabulary growth patterns for MEE. growth patterns for MEE.

They are vocabulary growth models and They are vocabulary growth models and newly occurring vocabulary distributions in newly occurring vocabulary distributions in cumulative texts.cumulative texts.

Page 99: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

First issueFirst issue

The first issue explores the MEE vocabulary growth. The first issue explores the MEE vocabulary growth. The growth curve of vocabulary The growth curve of vocabulary V(N)V(N) is not a linear function is not a linear function

of the sample size of the sample size NN. Initially, . Initially, V(N)V(N) increases quickly, but th increases quickly, but the growth rate decreases as e growth rate decreases as NN is increased is increased..

Four mathematical models (Brunet’s model, Guiraud’s Four mathematical models (Brunet’s model, Guiraud’s model, Tuldava’s model and Herdan’s model) are tested model, Tuldava’s model and Herdan’s model) are tested against the empirical growth curve for MEE. against the empirical growth curve for MEE.

A new growth model is constructed by multiplying the A new growth model is constructed by multiplying the logarithmic function and the power law. Parameter logarithmic function and the power law. Parameter αα is the is the coefficient of the whole expression; parameter coefficient of the whole expression; parameter ββ is the is the exponential of the power function part of the model.exponential of the power function part of the model.

Page 100: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

First issueFirst issue

The new model describes the empirical growth curve best The new model describes the empirical growth curve best of the five models for MEE, with the of the five models for MEE, with the RR square reaching the square reaching the highest value and the mean square being the lowest. highest value and the mean square being the lowest.

The upper and lower tolerance bounds are calculated to The upper and lower tolerance bounds are calculated to capture at least 95% of the possible vocabulary sizes in the capture at least 95% of the possible vocabulary sizes in the normal population distribution. normal population distribution.

The residuals from fitting the five growth models are The residuals from fitting the five growth models are compared and analyzed; the respective values fall within compared and analyzed; the respective values fall within the interval of [–223, 397] for Brunet’s model, [–305, 299] the interval of [–223, 397] for Brunet’s model, [–305, 299] for Guiraud’s model, [–209, 300] for Tuldava’s model, [–for Guiraud’s model, [–209, 300] for Tuldava’s model, [–150, 220] for Herdan’s model, and [–160,160] for the new 150, 220] for Herdan’s model, and [–160,160] for the new model. model.

Page 101: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Second issueSecond issue

The second issue explores the The second issue explores the distributionsdistributions of of newly occurring vocabulary sizes in cumulative newly occurring vocabulary sizes in cumulative textstexts. .

The new vocabulary The new vocabulary V(N)V(N)newnew is a rapidly decreasin is a rapidly decreasing function of the sample size g function of the sample size NN with a long tail of lo with a long tail of low values of w values of V(N)V(N)newnew. .

Initially, Initially, V(N)V(N)newnew decreases sharply to a certain poi decreases sharply to a certain point. Then as nt. Then as NN increases, increases, V(N)V(N)newnew decreases very sl decreases very slowly and its observed values are distributed in a faowly and its observed values are distributed in a fairly wide dispersion.irly wide dispersion.

Page 102: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

Further studiesFurther studies

The present study has been exploratory in nature, and The present study has been exploratory in nature, and some difficult issues have not yet been tackled adequately.some difficult issues have not yet been tackled adequately.

For example, the new vocabulary growth model was For example, the new vocabulary growth model was constructed on the basis of studies of the Sample Set, constructed on the basis of studies of the Sample Set, which consists of 480 individual texts, totalling about which consists of 480 individual texts, totalling about 500,000 word tokens. The model has been verified to 500,000 word tokens. The model has been verified to provide good fit to the empirical vocabulary sizes within the provide good fit to the empirical vocabulary sizes within the boundary of the sample size of not more than 500,000 boundary of the sample size of not more than 500,000 word tokens. However, the possibilities of extrapolation of word tokens. However, the possibilities of extrapolation of this new growth model in the direction of larger than this new growth model in the direction of larger than 500,000 tokens need further consideration and verification.500,000 tokens need further consideration and verification.

Page 103: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

ReferencesReferences Baayen, R. H. (2001). Baayen, R. H. (2001). Word Frequency Distributions.Word Frequency Distributions. Dordrecht: Dordrecht:

Kluwer Academic Publishers. ISBN 0-7923-7027-1Kluwer Academic Publishers. ISBN 0-7923-7027-1 Brunet, E. (1978). Brunet, E. (1978). Le Vocabulaire de Jean GiraudouxLe Vocabulaire de Jean Giraudoux. TLQ, Volume 1, . TLQ, Volume 1,

Slatkine, Geneve.Slatkine, Geneve. Carroll, J. B. (1967). Carroll, J. B. (1967). Computational Analysis of Present-Day American Computational Analysis of Present-Day American

English.English. Providence: Brown University Press. Providence: Brown University Press. Carroll, J. B. (1969). Carroll, J. B. (1969). A Rationale for an Asymptotic Lognormal Form of A Rationale for an Asymptotic Lognormal Form of

Word Frequency distributionsWord Frequency distributions. Princeton: Research Bulletin, . Princeton: Research Bulletin, Educational Testing Service. Educational Testing Service.

Devore, J. (2000). Devore, J. (2000). Probability and StatisticsProbability and Statistics. Pacific Grove: . Pacific Grove: Brooks/Cole.Brooks/Cole.

Ellis, Rod. (2004). Ellis, Rod. (2004). Second Language Acquisition (5th ed.)Second Language Acquisition (5th ed.). Shanghai: . Shanghai: Foreign Language Teaching Press.Foreign Language Teaching Press.

Page 104: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

RefferencesRefferences Fan, Fengxiang. (2006). Fan, Fengxiang. (2006). A Corpus-Based Empirical Study on Inter-textual VocaA Corpus-Based Empirical Study on Inter-textual Voca

bulary Growth. bulary Growth. Journal of Quantitative LinguisticsJournal of Quantitative Linguistics. Volume 13, Number 1: 111-. Volume 13, Number 1: 111-127. DOI : 10.1080/09296170500500603. 127. DOI : 10.1080/09296170500500603.

Feng, Zhiwei. (1988). FEL Formula -- Economical Law in the Formation of TerFeng, Zhiwei. (1988). FEL Formula -- Economical Law in the Formation of Terms. ms. Social Social

Sciences in ChinaSciences in China, No 4. Beijing., No 4. Beijing. FengFeng ,, Zhiwei. (1996). Introduction of Modern Terminology, The Language PuZhiwei. (1996). Introduction of Modern Terminology, The Language Pu

blishing House, Beijing.blishing House, Beijing. Guiraud, H. (1990). Guiraud, H. (1990). Les Caracteres Statistiques du Vocabulaire.Les Caracteres Statistiques du Vocabulaire. Paris: Presses Paris: Presses

Universitaires de France.Universitaires de France. Herdan, G. (1964). Herdan, G. (1964). Quantititative Linguistics.Quantititative Linguistics. Buttersworths, London. Buttersworths, London. Kennedy, G. (2000). Kennedy, G. (2000). An Introduction to Corpus Linguistics.An Introduction to Corpus Linguistics. Beijing: Foreign Beijing: Foreign

Language Teaching and Research Press.Language Teaching and Research Press. Schmitt, N. & Meara, P. (1997). Schmitt, N. & Meara, P. (1997). Researching vocabulary through a word Researching vocabulary through a word

knowledge framework: Word associations and verbal suffixes.knowledge framework: Word associations and verbal suffixes. Studies in Studies in Second Language Acquisition, 19, 17-36.Second Language Acquisition, 19, 17-36.

Page 105: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English

RefferencesRefferences Schmitt, N. & McCarthy, M. (2002). Schmitt, N. & McCarthy, M. (2002). Vocabulary: Description, Vocabulary: Description,

Acquisition and Pedagogy.Acquisition and Pedagogy. Shanghai: Shanghai Foreign Language Shanghai: Shanghai Foreign Language Education Press. Education Press.

Sinclaire, J. (1991).Sinclaire, J. (1991). Corpus, Concordance, Collocation Corpus, Concordance, Collocation. Oxford: Oxford . Oxford: Oxford University Press.University Press.

Smith, Neil. (1999). Smith, Neil. (1999). Chomsky: Ideas and IdealsChomsky: Ideas and Ideals. Cambridge: . Cambridge: Cambridge University PressCambridge University Press

Summers, D. (1991). Summers, D. (1991). Longman/ Lancaster English Language Corpus: Longman/ Lancaster English Language Corpus: Criteria and Design.Criteria and Design. Harlow: Longman. Harlow: Longman.

Tuldava, J. (1996). Tuldava, J. (1996). The Frequency Spectrum of Text and Vocabulary.The Frequency Spectrum of Text and Vocabulary. Journal of Quantitiative Linguistics, 3, 38-50.Journal of Quantitiative Linguistics, 3, 38-50.

Zipf, G. K. (1935). Zipf, G. K. (1935). The Psycho-Biology of Language. The Psycho-Biology of Language. Boston: Boston: Houghton Mifflin.Houghton Mifflin.

Zipf, G. K. (1949). Zipf, G. K. (1949). An Introduction to Human Ecology.An Introduction to Human Ecology. New York. New York. http://www.ccpo.odu.edu/SEES/ozone/class/Chap_9/9_4.htmhttp://www.ccpo.odu.edu/SEES/ozone/class/Chap_9/9_4.htm

Page 106: Inter-textual Vocabulary Growth Patterns  for Marine Engineering English