Upload
jjb1156
View
212
Download
0
Embed Size (px)
Citation preview
7/31/2019 10-1.1.1.25.6859
1/7
Statistical Consistency of Keywords Dictionary Parameters
Grigorij Martynenko
Department of Computational Linguistics, St.Petersburg State UniversityUniversitetskaya nab. 11
St. Petersburg 199034, [email protected]
Abstract
The construction of optimal keywords dictionary is one of the most important tasks in developing of thesaurus-based Informational Retrieval Systems and some other NLP applications. The problem is, which parameters ofkeywords dictionaries may be rated as consistent ones from statistical point of view (that is not dependent onsample size)? Analysis of the recent scientific works and the results of our own investigations allowed us todetermine a rather complete list of parameters, which may be used for description of texts and language
resources. Each of these parameters was exposed to the consistency test. Methodology for consistency test hasbeen elaborated using the method of least squares with a number of principle modifications. Our main results arethe following: 1) Theoretically all analyzed parameters have either upper or lower limits. That means that inprinciple they are statistically consistent. However, for the most of parameters actual consistency is achieved onlyin the very big sample sizes. 2) The most consistent parameters turned out to be: order coefficient, logarithmicconcentration coefficient, entropy, rank golden section, and rank mean. Their rapid speed of convergence to thelimit values allows to effectively perform classification procedures on data of the arbitrary size. 3) The proposedmodel of approximation makes it possible to forecast the values of all parameters for any sample size.
1. Introduction. The problem of Statistical Consistency
One of the most important tasks in developing of thesaurus-based Informational Retrieval Systems, as
well as for variety of other NLP applications, is the construction of optimal keywords dictionary.
Usually such dictionary is being built by means of free document indexing. As a result of this
approach, systematico-statistical parameters of keywords dictionary (e.g., its size, rank meanfrequency, entropy, etc.) represent the functions of the corpora size, which was subjected to the
indecing procedure. The problem of parameters statistic consistency arises here: Which statistical
parameters may be rated as consistent ones in this case? In order to solve it we have to answer two
essential questions: 1) which system parameters of the keywords dictionary have and which do not
have a theoretical (or empirical) upper level value? 2) if this level really exists, what speed of
convergence to the limit values characterizes each of these parameters? In fact, for construction and
evaluation of keywords dictionaries, obtained on text data of the arbitrary size, only those statistic
parameters, which have rapid speed of convergence to the limit values, may be properly used.
2. Methods of data systematization in keywords frequency dictionaries
A keywords dictionary usually represents a lexicographic composition, any article of which containsthe name of lexical unit and accompanying it statistical data of different kind (e.g., absolute frequency
of the lexeme in concern, its frequency rank, quantity of lexical units with the identical frequency,
etc.). Analyzing information accumulated in keywords dictionaries, it is possible to build statistical
distributions, whose concrete types are determined by the fact, what particular information is
functioning as dependent and independent parameters. The main distributions are the following:
polynomial, rank, and spectral ones. In polynomial distribution the role of independent parameter
plays the varying name of lexical units, whereas its frequency actions as dependent parameter. In rank
distribution independent parameter is represented by rank of the lexical unit, while dependent one
by its frequency. In spectral distribution the frequency of lexical unit serves as an independent
parameter, and the quantity of lexical units with the identical frequency functions as a dependent one.
In last two distributions parameter name simply disappears.
7/31/2019 10-1.1.1.25.6859
2/7
The detailed list of data systematization types, which are used (or may be used) in lexicostatistical
practice, is presented in Table 1. The table reveals that when converting cumulative distributions into
rank ones, the values of random variable and statistical weights trade places: the variants of random
variable become the values of dependent variable (i.e. function), whereas statistical weights becomes
the values of independent variable (i.e. argument). In addition, the following fact attracts our
attention: when converting data into rank distributions, accumulated quantities of units turn into
sequence of natural numbers, i.e. into rank sequence or rank scale.
Statistical data
Systematization types values of random variables statistical weights
Polynomial distribution name
of lexical unit
frequency
of lexical unit (F)
Spectral
distribution
frequency
of lexical unit (F)
quantity of lexical units with
correspondent frequency
Spectral-cumulative distribution frequency
of lexical unit (F)
quantity of lexical units, whose
frequency does not exceed F
Spectral-decumulative1
distribution
frequency
of lexical unit (F)
quantity of lexical units, whose
frequency exceedes F
Increasing rank distribution quantity of lexical units,
whose frequency does not
exceed F(rank of concrete
lexical unit)
frequency
of lexical unit (F)
Decreasing rank distribution quantity of lexical units,
whose frequency exceedes F
(rank of concrete lexical unit)
frequency
of lexical unit (F)
Cumulative-increasing rank
distribution
Rank of concrete lexical unit Accumulated number of lexical
unit occurrences
Cumulative-decreasing rank
distribution
Rank of concrete lexical unit Accumulated number of lexical
unit occurrencesTable 1. Systematization types of lexicostatistical data.
Resorting to parallels in the field of mechanics, rank distribution may be interpreted as a probabilistic
mass, which is equal to unity and distributed along the rank axis so that in particular points ri
coordinates - correspondent weights (probabilities pi) are concentrated. If to continue this mechanical
analogy, then mathematical expectation (or its empirical analogue - arithmetic mean), which is being
calculated having ranks values as random variables and frequencies (or their analogies) as
correspondent statistical weights, may be rated as center of gravity of the rank distribution.
It is important to notice that these connections between different possible ways of presenting of the
same lexicostatistical information, which are shown in the Table 1, are usually not taking into accountby researchers (not only linguists, but also biologists, sociologists, psychologists, economists,
scientometrists, and others). What concerns mathematical statistics, similar conversions are here
clearly interpreted (Evans, Hastings & Paecock, 1993].
3. Types of scales and corresponding statistic parameters
Mathematical statistics distinguishes different scales (quantitative, ordinal, nominal) and has
developed a number of methods of data processing, which are applicable just for the appropriate
scale. The most promoted system of techniques has been elaborated for the quantitative parameters.
Basing to the considerable extent on the theory of moments, it implies an advanced system of mean
1 Term decumulative distribution is used in econometrics, when investigating populations income (Lange,1964). The values of random variable reflect the income measurement, whereas the number of persons, whoseincome exceeds the given one becomes statistical weights.
7/31/2019 10-1.1.1.25.6859
3/7
and variance values, characteristics of distribution forms, etc., and effectively uses as well a system of
ordinal statistics (mode, median, quartile, etc.).
In previous section it was shown that the central distributions used in processing of frequency
dictionaries are rank, spectral, and polynomial ones. Though rank and spectral distributions have the
outward appearance of quantitative scales, they are characterized by an extremely great variance ofparameters in both rank and frequency scales. This fact induces some researchers to doubt the
possibility to apply here the theory of moments (because of their prone to infinity), and therefore to
suggest instead some other characteristics, not depending on the sample size (Khajtun, 1983; Shrejder
& Sharov, 1982). What concerns polynomial distribution, it cannot apply here the theory of moments
theoretically as its variance is of quality nature.
Analysis of the recent scientific works and the results of our own investigations allowed us to
determine a rather complete list of parameters, which may be used for description of keywords
dictionaries (and text lexicostatistical structure in the whole). Table 1 presents the list of these
parameters subdivided into three groups according to the type of scale.
Nominal scale Quantitative (frequency) scale
Mode (Mo)
Dictionary size (N)
Maximal frequency (fmax)
Entropy (E)
Maximal entropy (Emax)
Order coefficient (E/Emax)
Mean frequency (fave)
Geometric mean frequency (fg ave)
Frequency variance coefficient (Vf)
Frequency median (Mef)
Golden section (Gf)
Diversity coefficient2 (S)
Ordinal (rank) scale
Rank mean (rave) (differentiation coefficient)
Rank variance coefficient (Vr)
Rank median (Me) (equilibrium measure)
Rank golden section (Gr)Rank mean deviation (dr)
Variation coefficient on dr (Vr)
Coefficient of concentration (rave/N)
Logarithmic concentration coefficient (k=lograve/logN)
Zipfs indicator () (exponent in formula of Zipfs rule - f(r)=c/r)
Table 2. Statistic parameters, which may be used for description of text lexicostatistical structure.
Among parameters listed in the table 2, the following are most usually used: dictionary size (N),
maximal frequency (fmax), mean frequency (fave), entropy (E), and Zipfs indicator. Our register is
noticeably wider and includes a large number of rank parameters, which are intensively explored
during last years by the author of the given paper (Martynenko, 1988, 1989) and his disciples
(Grebennikov, 1998). Moreover, to some of these parameters theoretical argumentation of theirstatistic consistency was given (Martynenko & Fomin, 1989).
4. Test for consistency of statistical parameters
Methodology for consistency test has been elaborated using the method of least squares with a
number of principle modifications, caused by complicated character of parameters dependence on the
sample size (Martynenko, 1988). In our hypothesis test the null hypothesis was stated as a parameter
converges to its limit value (alternative hypothesis a parameter increases or decreases without
limits). All parameters listed in table 2 were subjected to this consistency test with the use of
approximating models.
2Diversity coefficient (S) is a number of words, whose absolute frequency equals to 1.
7/31/2019 10-1.1.1.25.6859
4/7
The process of approximation has been carried out on the following material: 1) keywords distribution
in the Information Retrieval System Ships of the Fishing Fleet; and 2) distribution of lexical units in
the frequency dictionary on radio electronics (Alekseev, 1965), 3) distribution of lexical units in the
IUHTXHQF\GLFWLRQDU\RIWKH(QJOLVKODQJXDJH.XHUD)UDQFLV
For approximation of the empirical distributions the broad spectrum of theoretical functions has been
used. For example, for increasing dependences the following functions of unlimited upgrowth were
used:
y=axb power function
y=aecx exponent function
y=a(logx)b logarithmic function
Finding the logarithm of these functions permit to easily convert them into linear dependencies, thus
allowing the method of the least squares to be used for approximation in its canonical form.
The list of functions of asymptotic upgrowth is more extensive. Here we mention just some of them:
y=k-a/xb power function
y=k-k/ecx exponent function
y=k-ke-qx^b combination of power and exponent functions Weibull function
y=k/e(c/x)^b one more combination of power and exponent functions
y=k/(1+a/xb) fractional/power function
y=k/(1+q/ecx) fractional/exponent function
y=k/(1+a/(lnx)b) fractional/logarithmic function
The enumerated (and some other) functions by means of single or recurring finding of their logarithm
may be transformed into linear ones. Thus, linear variants for the above-mentioned functions are like
follows:
ln(k/(k-y))=blnx for power function
lnln(k/(k-y))=lnq-blnx for Weibull function
ln(k/y-1)=lnq-blnx fractional/exponent function
It is known, that the system of normal equations for linear dependency is a common task for
processing, and its solution relative to unknown parameters do not provoke peculiar embarrassments.
However, in our case one of the parameters (asymptote - k) is included in dependent variable, causing
impossibility to use the method of the least squares in its pure form. We propose for this problem the
following solution. First, we assigned to asymptote the serial of concrete values with a determined
step, beginning from the maximum parameter value, obtained in empiric series. Further, on each step
of processing cycle for the fixed asymptote value we calculated the values of other parameters bymeans of method of the least squares, and at once the hypothesis on the absence of statistical
significance of differences between empiric and theoretical distributions with 0.05 significance level
was verified with the use of 2
criterion. If the function demonstrated disagreement with empirical
data on each step, we abandoned it and tested the other function, trying to determine the field of its
values, meeting the requirements of2criterion.
All parameters listed in the Table 2 were subjected to this processing. It turned out that in
overwhelming majority of cases the best consent with the empiric data revealed the Weibull function,
though the degree of concordance differs for different parameters. Thus the most well-consistent (for
Weibull function) parameters are the following (in order of consistency decrease):
7/31/2019 10-1.1.1.25.6859
5/7
1. Order coefficient (E/Emax)
2. Logarithmic concentration coefficient (k=lograve/logN)
3. Entropy (E)
4. Rank golden section (Gr)
5. Rank mean (rave)
Weibull function had shown the best harmony with the empiric distribution.
Other parameters are less consent with the empiric data. It attracts attention, that among the most
consistent parameters the rank indices are dominating, as well as parameters connected in some way
with the entropy. In their totality these parameters characterize the investigated text population from
the point of view of 1) redistribution of the functional activity between the population constitutive
elements, 2) concentration and dispersion of this activity, 3) the degree of share holding of its each
element and their groups. Thus, actually these statistical characteristics are systematico-constituting
ones.
The table 3 demonstrates approximation potential of the Weibull function and its forecasting abilities
on the example of two parameters dictionary size and rank mean. Approximating of the Weibullfunction for demonstrational material are the following:
dictionary size (N) rank mean (rave)sample size (L)in thousands ofword-usage
empiric
data
theoretic
data
empiric
data
theoretic
data
Keywords dictionary of the Information Retrieval System
Ships of the Fishing Fleet
1 610 606,6 106,1 104,9
2 819 818,3 147,2 142,6
3 932 930,8 161,0 167,84 1001 996,7 172,5 170,5
5 1034 1037,5 174,1 175,8
5,64 1049 1055,5 178,8 177,9
10 (forecast) 1106 182,4
Frequency dictionary on radio electronics
50 5399 5421 621 626
100 7853 7827 751 732
150 9361 9419 754 763
200 10582 10565 772 777
500
(forecast)
13672 780
1000 (forecast) 14880 784
)UHTXHQF\GLFWLRQDU\RIWKH(QJOLVKODQJXDJHE\.XHUD)UDQFLV
10,0 3009 3133 546 543
101,0 13706 13711 1256 1317
253,5 23655 23878 1937 1814
1014,2 50406 50492 2730 2773
10000,0 (forecast) 104922 4371
100000,0 (forecast) 112500 4887
Table 3. Dependencies of keywords dictionary parameters (dictionary size Nand rank mean rave) on
sample sizeL
Theoretical formulas for dependencies of dictionary size Nand rank mean rave on sample sizeL turned
out to be as follows:
7/31/2019 10-1.1.1.25.6859
6/7
1. Keywords dictionary of the Information Retrieval System Ships of the Fishing Fleet
N=1120-1120e-0,78L^0,75
rave=183-183e-0,85L^0,83
(1)
2. Frequency dictionary on radio electronics (Alekseev, 1965)
N=15250-15250e-0,027L^0,713
rave=780-780e-0,0766L^0,780
)UHTXHQF\GLFWLRQDU\RIWKH(QJOLVKODQJXDJH.XHUD)UDQFLV
N=112500-112500e-0,00618L^0,660
rave=4950-4950e-0,0424L^0,433
Table 3 demonstrate that rank means of keywords dictionaries become stabilize under comparatively
small sample size. The best way to make it certain is to calculate the sample size which correspondesto the given value of the rank mean. Let us consider the example, where rank mean amounts to 99% of
the theoretical one (which is maximum possible). If to place into formula (1) the correspondent data,
we receiveL=10730 keywords usage, which approximately equals to 2000 documents.
Rank means stabilization of standard frequency dictionaries created for specialized scientific and
technical text corpora, also begins rather fast (of course, taking into account their larger lexical
diversity, as compared with that of ordinary keywords dictionaries). For frequency dictionary on radio
electronics by Alekseev the 99%-th level of rank mean achieves, when sample size achieves 191000
word usages. It is even slightly less the standard level of 200000 word usages adopted by All-Russian
Research Group Speech Statistics (Alekseev, 1975).
What concerns the dictionary size, all dictionaries considerably slow achieve its 99% level. Thus, for
keywords dictionary Ships of the Fishing Fleet it amounts to 11000 word usages, for the frequency
dictionary on radio electronics 38000000 word usages, and for the frequency dictionary of the
English language 22485000 word usages.
Conclusion
Our main results are the following:
1) Theoretically all keywords dictionary parameters have either upper or lower limits. That means that
in principle they are statistically consistent. However, for the most of parameters actual consistency is
achieved only in the very big sample sizes, which are hardly attainable in ordinary NLP tasks.
2) The most consistent parameters turned out to be (in decreasing order): order coefficient (E/Emax),
logarithmic concentration coefficient (k=lograve/logN), entropy (E), rank golden section (Gr), rank
mean (rave) (differentiation coefficient). These parameters along with some other ones represent a
description tool for the system characteristics of any thesaurus or keywords dictionary. Moreover,
their rapid speed of convergence to the limit values allows to effectively perform classification
procedures on data of the arbitrary size.
3) The proposed model of approximation makes it possible to forecast the values of all keywords
dictionary parameters for any sample size.
4) The higher is thematic and functional specialization of lexical units, there higher is the consistency
of their parameters. Thus, statistical consistency of keywords dictionary is higher than that ofspecialized frequency dictionary (in our example the radio electronics one). At the same time the
7/31/2019 10-1.1.1.25.6859
7/7
parameters consistency of these both dictionaries is higher than that of dictionaries, describing
language in the whole (e.g., frequency dictionary of the English language).
References
Alekseev, P.M. (1965). Frequency dictionary of English sublanguage of electronics. Abstract of Ph.D.thesis. Leningrad.
Alekseev, P.M. (1975). Statistic Lexicography (Typology, compiling and applications of frequencydictionaries). Leningrad. Edition of Leningrad Pedagogical Institute.
Grebennikov, A.O. (1998). To the problem of statistics consistency of the fiction frequencydictionary. Structural and Applied linguistics. Issue 5, St. Petersburg: Edition of St. Petersburg StateUniversity. 110112.
Evans, M., Hastings, N., & Peacock, B. (1993). Statistical Distributions. New York: Wiley.Khajtun, S.D. (1983). Scientific measurement. Present Conditions and Perspectives. Moscow.Nauka..XHUD + )UDQFLV :1 &RPSXWDWLRQDO DQDO\VLV RI SUHVHQWGD\ $PHULFDQ (QJOLVK
Providence.Lange, O. (1964). Introduction into econometrics. Moscow. Progres.Martynenko, G.Y. & Fomin, S.V. (1989). Rank moments. Nauchno-technicheskaya informacija.
Series 2. N 8. 914.Martynenko, G.Y. (1988). Osnovy stilemetrii (Fundamentals of Stylometrics). Leningrad: Edition of
Leningrad State University.Martynenko, G.Y. (1989). About statistic characteristics of rank distributions. Quantitative linguistics
and automatic text analysis. Tartu. 5068.Shrejder, Y.A. & Sharov, A.A. (1982). Systems and Models. Moscow.Radio i sviaz.