10-1.1.1.25.6859

7/31/2019 10-1.1.1.25.6859

1/7

Statistical Consistency of Keywords Dictionary Parameters

Grigorij Martynenko

Department of Computational Linguistics, St.Petersburg State UniversityUniversitetskaya nab. 11

St. Petersburg 199034, [email protected]

Abstract

The construction of optimal keywords dictionary is one of the most important tasks in developing of thesaurus-based Informational Retrieval Systems and some other NLP applications. The problem is, which parameters ofkeywords dictionaries may be rated as consistent ones from statistical point of view (that is not dependent onsample size)? Analysis of the recent scientific works and the results of our own investigations allowed us todetermine a rather complete list of parameters, which may be used for description of texts and language

resources. Each of these parameters was exposed to the consistency test. Methodology for consistency test hasbeen elaborated using the method of least squares with a number of principle modifications. Our main results arethe following: 1) Theoretically all analyzed parameters have either upper or lower limits. That means that inprinciple they are statistically consistent. However, for the most of parameters actual consistency is achieved onlyin the very big sample sizes. 2) The most consistent parameters turned out to be: order coefficient, logarithmicconcentration coefficient, entropy, rank golden section, and rank mean. Their rapid speed of convergence to thelimit values allows to effectively perform classification procedures on data of the arbitrary size. 3) The proposedmodel of approximation makes it possible to forecast the values of all parameters for any sample size.

1. Introduction. The problem of Statistical Consistency

One of the most important tasks in developing of thesaurus-based Informational Retrieval Systems, as

well as for variety of other NLP applications, is the construction of optimal keywords dictionary.

Usually such dictionary is being built by means of free document indexing. As a result of this

approach, systematico-statistical parameters of keywords dictionary (e.g., its size, rank meanfrequency, entropy, etc.) represent the functions of the corpora size, which was subjected to the

indecing procedure. The problem of parameters statistic consistency arises here: Which statistical

parameters may be rated as consistent ones in this case? In order to solve it we have to answer two

essential questions: 1) which system parameters of the keywords dictionary have and which do not

have a theoretical (or empirical) upper level value? 2) if this level really exists, what speed of

convergence to the limit values characterizes each of these parameters? In fact, for construction and

evaluation of keywords dictionaries, obtained on text data of the arbitrary size, only those statistic

parameters, which have rapid speed of convergence to the limit values, may be properly used.

2. Methods of data systematization in keywords frequency dictionaries

A keywords dictionary usually represents a lexicographic composition, any article of which containsthe name of lexical unit and accompanying it statistical data of different kind (e.g., absolute frequency

of the lexeme in concern, its frequency rank, quantity of lexical units with the identical frequency,

etc.). Analyzing information accumulated in keywords dictionaries, it is possible to build statistical

distributions, whose concrete types are determined by the fact, what particular information is

functioning as dependent and independent parameters. The main distributions are the following:

polynomial, rank, and spectral ones. In polynomial distribution the role of independent parameter

plays the varying name of lexical units, whereas its frequency actions as dependent parameter. In rank

distribution independent parameter is represented by rank of the lexical unit, while dependent one

by its frequency. In spectral distribution the frequency of lexical unit serves as an independent

parameter, and the quantity of lexical units with the identical frequency functions as a dependent one.

In last two distributions parameter name simply disappears.

7/31/2019 10-1.1.1.25.6859

2/7

The detailed list of data systematization types, which are used (or may be used) in lexicostatistical

practice, is presented in Table 1. The table reveals that when converting cumulative distributions into

rank ones, the values of random variable and statistical weights trade places: the variants of random

variable become the values of dependent variable (i.e. function), whereas statistical weights becomes

the values of independent variable (i.e. argument). In addition, the following fact attracts our

attention: when converting data into rank distributions, accumulated quantities of units turn into

sequence of natural numbers, i.e. into rank sequence or rank scale.

Statistical data

Systematization types values of random variables statistical weights

Polynomial distribution name

of lexical unit

frequency

of lexical unit (F)

Spectral

distribution

frequency

of lexical unit (F)

quantity of lexical units with

correspondent frequency

Spectral-cumulative distribution frequency

of lexical unit (F)

quantity of lexical units, whose

frequency does not exceed F

Spectral-decumulative1

distribution

frequency

of lexical unit (F)

quantity of lexical units, whose

frequency exceedes F

Increasing rank distribution quantity of lexical units,

whose frequency does not

exceed F(rank of concrete

lexical unit)

frequency

of lexical unit (F)

Decreasing rank distribution quantity of lexical units,

whose frequency exceedes F

(rank of concrete lexical unit)

frequency

of lexical unit (F)

Cumulative-increasing rank

distribution

Rank of concrete lexical unit Accumulated number of lexical

unit occurrences

Cumulative-decreasing rank

distribution

Rank of concrete lexical unit Accumulated number of lexical

unit occurrencesTable 1. Systematization types of lexicostatistical data.

Resorting to parallels in the field of mechanics, rank distribution may be interpreted as a probabilistic

mass, which is equal to unity and distributed along the rank axis so that in particular points ri

coordinates - correspondent weights (probabilities pi) are concentrated. If to continue this mechanical

analogy, then mathematical expectation (or its empirical analogue - arithmetic mean), which is being

calculated having ranks values as random variables and frequencies (or their analogies) as

correspondent statistical weights, may be rated as center of gravity of the rank distribution.

It is important to notice that these connections between different possible ways of presenting of the

same lexicostatistical information, which are shown in the Table 1, are usually not taking into accountby researchers (not only linguists, but also biologists, sociologists, psychologists, economists,

scientometrists, and others). What concerns mathematical statistics, similar conversions are here

clearly interpreted (Evans, Hastings & Paecock, 1993].

3. Types of scales and corresponding statistic parameters

Mathematical statistics distinguishes different scales (quantitative, ordinal, nominal) and has

developed a number of methods of data processing, which are applicable just for the appropriate

scale. The most promoted system of techniques has been elaborated for the quantitative parameters.

Basing to the considerable extent on the theory of moments, it implies an advanced system of mean

1 Term decumulative distribution is used in econometrics, when investigating populations income (Lange,1964). The values of random variable reflect the income measurement, whereas the number of persons, whoseincome exceeds the given one becomes statistical weights.

7/31/2019 10-1.1.1.25.6859

3/7

and variance values, characteristics of distribution forms, etc., and effectively uses as well a system of

ordinal statistics (mode, median, quartile, etc.).

In previous section it was shown that the central distributions used in processing of frequency

dictionaries are rank, spectral, and polynomial ones. Though rank and spectral distributions have the

outward appearance of quantitative scales, they are characterized by an extremely great variance ofparameters in both rank and frequency scales. This fact induces some researchers to doubt the

possibility to apply here the theory of moments (because of their prone to infinity), and therefore to

suggest instead some other characteristics, not depending on the sample size (Khajtun, 1983; Shrejder

& Sharov, 1982). What concerns polynomial distribution, it cannot apply here the theory of moments

theoretically as its variance is of quality nature.

Analysis of the recent scientific works and the results of our own investigations allowed us to

determine a rather complete list of parameters, which may be used for description of keywords

dictionaries (and text lexicostatistical structure in the whole). Table 1 presents the list of these

parameters subdivided into three groups according to the type of scale.

Nominal scale Quantitative (frequency) scale

Mode (Mo)

Dictionary size (N)

Maximal frequency (fmax)

Entropy (E)

Maximal entropy (Emax)

Order coefficient (E/Emax)

Mean frequency (fave)

Geometric mean frequency (fg ave)

Frequency variance coefficient (Vf)

Frequency median (Mef)

Golden section (Gf)

Diversity coefficient2 (S)

Ordinal (rank) scale

Rank mean (rave) (differentiation coefficient)

Rank variance coefficient (Vr)

Rank median (Me) (equilibrium measure)

Rank golden section (Gr)Rank mean deviation (dr)

Variation coefficient on dr (Vr)

Coefficient of concentration (rave/N)

Logarithmic concentration coefficient (k=lograve/logN)

Zipfs indicator () (exponent in formula of Zipfs rule - f(r)=c/r)

Table 2. Statistic parameters, which may be used for description of text lexicostatistical structure.

Among parameters listed in the table 2, the following are most usually used: dictionary size (N),

maximal frequency (fmax), mean frequency (fave), entropy (E), and Zipfs indicator. Our register is

noticeably wider and includes a large number of rank parameters, which are intensively explored

during last years by the author of the given paper (Martynenko, 1988, 1989) and his disciples

(Grebennikov, 1998). Moreover, to some of these parameters theoretical argumentation of theirstatistic consistency was given (Martynenko & Fomin, 1989).

4. Test for consistency of statistical parameters

Methodology for consistency test has been elaborated using the method of least squares with a

number of principle modifications, caused by complicated character of parameters dependence on the

sample size (Martynenko, 1988). In our hypothesis test the null hypothesis was stated as a parameter

converges to its limit value (alternative hypothesis a parameter increases or decreases without

limits). All parameters listed in table 2 were subjected to this consistency test with the use of

approximating models.

2Diversity coefficient (S) is a number of words, whose absolute frequency equals to 1.

7/31/2019 10-1.1.1.25.6859

4/7

The process of approximation has been carried out on the following material: 1) keywords distribution

in the Information Retrieval System Ships of the Fishing Fleet; and 2) distribution of lexical units in

the frequency dictionary on radio electronics (Alekseev, 1965), 3) distribution of lexical units in the

IUHTXHQF\GLFWLRQDU\RIWKH(QJOLVKODQJXDJH.XHUD)UDQFLV

For approximation of the empirical distributions the broad spectrum of theoretical functions has been

used. For example, for increasing dependences the following functions of unlimited upgrowth were

used:

y=axb power function

y=aecx exponent function

y=a(logx)b logarithmic function

Finding the logarithm of these functions permit to easily convert them into linear dependencies, thus

allowing the method of the least squares to be used for approximation in its canonical form.

The list of functions of asymptotic upgrowth is more extensive. Here we mention just some of them:

y=k-a/xb power function

y=k-k/ecx exponent function

y=k-ke-qx^b combination of power and exponent functions Weibull function

y=k/e(c/x)^b one more combination of power and exponent functions

y=k/(1+a/xb) fractional/power function

y=k/(1+q/ecx) fractional/exponent function

y=k/(1+a/(lnx)b) fractional/logarithmic function

The enumerated (and some other) functions by means of single or recurring finding of their logarithm

may be transformed into linear ones. Thus, linear variants for the above-mentioned functions are like

follows:

ln(k/(k-y))=blnx for power function

lnln(k/(k-y))=lnq-blnx for Weibull function

ln(k/y-1)=lnq-blnx fractional/exponent function

It is known, that the system of normal equations for linear dependency is a common task for

processing, and its solution relative to unknown parameters do not provoke peculiar embarrassments.

However, in our case one of the parameters (asymptote - k) is included in dependent variable, causing

impossibility to use the method of the least squares in its pure form. We propose for this problem the

following solution. First, we assigned to asymptote the serial of concrete values with a determined

step, beginning from the maximum parameter value, obtained in empiric series. Further, on each step

of processing cycle for the fixed asymptote value we calculated the values of other parameters bymeans of method of the least squares, and at once the hypothesis on the absence of statistical

significance of differences between empiric and theoretical distributions with 0.05 significance level

was verified with the use of 2

criterion. If the function demonstrated disagreement with empirical

data on each step, we abandoned it and tested the other function, trying to determine the field of its

values, meeting the requirements of2criterion.

All parameters listed in the Table 2 were subjected to this processing. It turned out that in

overwhelming majority of cases the best consent with the empiric data revealed the Weibull function,

though the degree of concordance differs for different parameters. Thus the most well-consistent (for

Weibull function) parameters are the following (in order of consistency decrease):

7/31/2019 10-1.1.1.25.6859

5/7

1. Order coefficient (E/Emax)

2. Logarithmic concentration coefficient (k=lograve/logN)

3. Entropy (E)

4. Rank golden section (Gr)

5. Rank mean (rave)

Weibull function had shown the best harmony with the empiric distribution.

Other parameters are less consent with the empiric data. It attracts attention, that among the most

consistent parameters the rank indices are dominating, as well as parameters connected in some way

with the entropy. In their totality these parameters characterize the investigated text population from

the point of view of 1) redistribution of the functional activity between the population constitutive

elements, 2) concentration and dispersion of this activity, 3) the degree of share holding of its each

element and their groups. Thus, actually these statistical characteristics are systematico-constituting

ones.

The table 3 demonstrates approximation potential of the Weibull function and its forecasting abilities

on the example of two parameters dictionary size and rank mean. Approximating of the Weibullfunction for demonstrational material are the following:

dictionary size (N) rank mean (rave)sample size (L)in thousands ofword-usage

empiric

data

theoretic

data

empiric

data

theoretic

data

Keywords dictionary of the Information Retrieval System

Ships of the Fishing Fleet

1 610 606,6 106,1 104,9

2 819 818,3 147,2 142,6

3 932 930,8 161,0 167,84 1001 996,7 172,5 170,5

5 1034 1037,5 174,1 175,8

5,64 1049 1055,5 178,8 177,9

10 (forecast) 1106 182,4

Frequency dictionary on radio electronics

50 5399 5421 621 626

100 7853 7827 751 732

150 9361 9419 754 763

200 10582 10565 772 777

500

(forecast)

13672 780

1000 (forecast) 14880 784

)UHTXHQF\GLFWLRQDU\RIWKH(QJOLVKODQJXDJHE\.XHUD)UDQFLV

10,0 3009 3133 546 543

101,0 13706 13711 1256 1317

253,5 23655 23878 1937 1814

1014,2 50406 50492 2730 2773

10000,0 (forecast) 104922 4371

100000,0 (forecast) 112500 4887

Table 3. Dependencies of keywords dictionary parameters (dictionary size Nand rank mean rave) on

sample sizeL

Theoretical formulas for dependencies of dictionary size Nand rank mean rave on sample sizeL turned

out to be as follows:

7/31/2019 10-1.1.1.25.6859

6/7

1. Keywords dictionary of the Information Retrieval System Ships of the Fishing Fleet

N=1120-1120e-0,78L^0,75

rave=183-183e-0,85L^0,83

(1)

2. Frequency dictionary on radio electronics (Alekseev, 1965)

N=15250-15250e-0,027L^0,713

rave=780-780e-0,0766L^0,780

)UHTXHQF\GLFWLRQDU\RIWKH(QJOLVKODQJXDJH.XHUD)UDQFLV

N=112500-112500e-0,00618L^0,660

rave=4950-4950e-0,0424L^0,433

Table 3 demonstrate that rank means of keywords dictionaries become stabilize under comparatively

small sample size. The best way to make it certain is to calculate the sample size which correspondesto the given value of the rank mean. Let us consider the example, where rank mean amounts to 99% of

the theoretical one (which is maximum possible). If to place into formula (1) the correspondent data,

we receiveL=10730 keywords usage, which approximately equals to 2000 documents.

Rank means stabilization of standard frequency dictionaries created for specialized scientific and

technical text corpora, also begins rather fast (of course, taking into account their larger lexical

diversity, as compared with that of ordinary keywords dictionaries). For frequency dictionary on radio

electronics by Alekseev the 99%-th level of rank mean achieves, when sample size achieves 191000

word usages. It is even slightly less the standard level of 200000 word usages adopted by All-Russian

Research Group Speech Statistics (Alekseev, 1975).

What concerns the dictionary size, all dictionaries considerably slow achieve its 99% level. Thus, for

keywords dictionary Ships of the Fishing Fleet it amounts to 11000 word usages, for the frequency

dictionary on radio electronics 38000000 word usages, and for the frequency dictionary of the

English language 22485000 word usages.

Conclusion

Our main results are the following:

1) Theoretically all keywords dictionary parameters have either upper or lower limits. That means that

in principle they are statistically consistent. However, for the most of parameters actual consistency is

achieved only in the very big sample sizes, which are hardly attainable in ordinary NLP tasks.

2) The most consistent parameters turned out to be (in decreasing order): order coefficient (E/Emax),

logarithmic concentration coefficient (k=lograve/logN), entropy (E), rank golden section (Gr), rank

mean (rave) (differentiation coefficient). These parameters along with some other ones represent a

description tool for the system characteristics of any thesaurus or keywords dictionary. Moreover,

their rapid speed of convergence to the limit values allows to effectively perform classification

procedures on data of the arbitrary size.

3) The proposed model of approximation makes it possible to forecast the values of all keywords

dictionary parameters for any sample size.

4) The higher is thematic and functional specialization of lexical units, there higher is the consistency

of their parameters. Thus, statistical consistency of keywords dictionary is higher than that ofspecialized frequency dictionary (in our example the radio electronics one). At the same time the

7/31/2019 10-1.1.1.25.6859

7/7

parameters consistency of these both dictionaries is higher than that of dictionaries, describing

language in the whole (e.g., frequency dictionary of the English language).

References

Alekseev, P.M. (1965). Frequency dictionary of English sublanguage of electronics. Abstract of Ph.D.thesis. Leningrad.

Alekseev, P.M. (1975). Statistic Lexicography (Typology, compiling and applications of frequencydictionaries). Leningrad. Edition of Leningrad Pedagogical Institute.

Grebennikov, A.O. (1998). To the problem of statistics consistency of the fiction frequencydictionary. Structural and Applied linguistics. Issue 5, St. Petersburg: Edition of St. Petersburg StateUniversity. 110112.

Evans, M., Hastings, N., & Peacock, B. (1993). Statistical Distributions. New York: Wiley.Khajtun, S.D. (1983). Scientific measurement. Present Conditions and Perspectives. Moscow.Nauka..XHUD + )UDQFLV :1 &RPSXWDWLRQDO DQDO\VLV RI SUHVHQWGD\ $PHULFDQ (QJOLVK

Providence.Lange, O. (1964). Introduction into econometrics. Moscow. Progres.Martynenko, G.Y. & Fomin, S.V. (1989). Rank moments. Nauchno-technicheskaya informacija.

Series 2. N 8. 914.Martynenko, G.Y. (1988). Osnovy stilemetrii (Fundamentals of Stylometrics). Leningrad: Edition of

Leningrad State University.Martynenko, G.Y. (1989). About statistic characteristics of rank distributions. Quantitative linguistics

and automatic text analysis. Tartu. 5068.Shrejder, Y.A. & Sharov, A.A. (1982). Systems and Models. Moscow.Radio i sviaz.

Documents

10-1.1.1.25.6859