10-1.1.1.25.6859

  • Upload
    jjb1156

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

  • 7/31/2019 10-1.1.1.25.6859

    1/7

    Statistical Consistency of Keywords Dictionary Parameters

    Grigorij Martynenko

    Department of Computational Linguistics, St.Petersburg State UniversityUniversitetskaya nab. 11

    St. Petersburg 199034, [email protected]

    Abstract

    The construction of optimal keywords dictionary is one of the most important tasks in developing of thesaurus-based Informational Retrieval Systems and some other NLP applications. The problem is, which parameters ofkeywords dictionaries may be rated as consistent ones from statistical point of view (that is not dependent onsample size)? Analysis of the recent scientific works and the results of our own investigations allowed us todetermine a rather complete list of parameters, which may be used for description of texts and language

    resources. Each of these parameters was exposed to the consistency test. Methodology for consistency test hasbeen elaborated using the method of least squares with a number of principle modifications. Our main results arethe following: 1) Theoretically all analyzed parameters have either upper or lower limits. That means that inprinciple they are statistically consistent. However, for the most of parameters actual consistency is achieved onlyin the very big sample sizes. 2) The most consistent parameters turned out to be: order coefficient, logarithmicconcentration coefficient, entropy, rank golden section, and rank mean. Their rapid speed of convergence to thelimit values allows to effectively perform classification procedures on data of the arbitrary size. 3) The proposedmodel of approximation makes it possible to forecast the values of all parameters for any sample size.

    1. Introduction. The problem of Statistical Consistency

    One of the most important tasks in developing of thesaurus-based Informational Retrieval Systems, as

    well as for variety of other NLP applications, is the construction of optimal keywords dictionary.

    Usually such dictionary is being built by means of free document indexing. As a result of this

    approach, systematico-statistical parameters of keywords dictionary (e.g., its size, rank meanfrequency, entropy, etc.) represent the functions of the corpora size, which was subjected to the

    indecing procedure. The problem of parameters statistic consistency arises here: Which statistical

    parameters may be rated as consistent ones in this case? In order to solve it we have to answer two

    essential questions: 1) which system parameters of the keywords dictionary have and which do not

    have a theoretical (or empirical) upper level value? 2) if this level really exists, what speed of

    convergence to the limit values characterizes each of these parameters? In fact, for construction and

    evaluation of keywords dictionaries, obtained on text data of the arbitrary size, only those statistic

    parameters, which have rapid speed of convergence to the limit values, may be properly used.

    2. Methods of data systematization in keywords frequency dictionaries

    A keywords dictionary usually represents a lexicographic composition, any article of which containsthe name of lexical unit and accompanying it statistical data of different kind (e.g., absolute frequency

    of the lexeme in concern, its frequency rank, quantity of lexical units with the identical frequency,

    etc.). Analyzing information accumulated in keywords dictionaries, it is possible to build statistical

    distributions, whose concrete types are determined by the fact, what particular information is

    functioning as dependent and independent parameters. The main distributions are the following:

    polynomial, rank, and spectral ones. In polynomial distribution the role of independent parameter

    plays the varying name of lexical units, whereas its frequency actions as dependent parameter. In rank

    distribution independent parameter is represented by rank of the lexical unit, while dependent one

    by its frequency. In spectral distribution the frequency of lexical unit serves as an independent

    parameter, and the quantity of lexical units with the identical frequency functions as a dependent one.

    In last two distributions parameter name simply disappears.

  • 7/31/2019 10-1.1.1.25.6859

    2/7

    The detailed list of data systematization types, which are used (or may be used) in lexicostatistical

    practice, is presented in Table 1. The table reveals that when converting cumulative distributions into

    rank ones, the values of random variable and statistical weights trade places: the variants of random

    variable become the values of dependent variable (i.e. function), whereas statistical weights becomes

    the values of independent variable (i.e. argument). In addition, the following fact attracts our

    attention: when converting data into rank distributions, accumulated quantities of units turn into

    sequence of natural numbers, i.e. into rank sequence or rank scale.

    Statistical data

    Systematization types values of random variables statistical weights

    Polynomial distribution name

    of lexical unit

    frequency

    of lexical unit (F)

    Spectral

    distribution

    frequency

    of lexical unit (F)

    quantity of lexical units with

    correspondent frequency

    Spectral-cumulative distribution frequency

    of lexical unit (F)

    quantity of lexical units, whose

    frequency does not exceed F

    Spectral-decumulative1

    distribution

    frequency

    of lexical unit (F)

    quantity of lexical units, whose

    frequency exceedes F

    Increasing rank distribution quantity of lexical units,

    whose frequency does not

    exceed F(rank of concrete

    lexical unit)

    frequency

    of lexical unit (F)

    Decreasing rank distribution quantity of lexical units,

    whose frequency exceedes F

    (rank of concrete lexical unit)

    frequency

    of lexical unit (F)

    Cumulative-increasing rank

    distribution

    Rank of concrete lexical unit Accumulated number of lexical

    unit occurrences

    Cumulative-decreasing rank

    distribution

    Rank of concrete lexical unit Accumulated number of lexical

    unit occurrencesTable 1. Systematization types of lexicostatistical data.

    Resorting to parallels in the field of mechanics, rank distribution may be interpreted as a probabilistic

    mass, which is equal to unity and distributed along the rank axis so that in particular points ri

    coordinates - correspondent weights (probabilities pi) are concentrated. If to continue this mechanical

    analogy, then mathematical expectation (or its empirical analogue - arithmetic mean), which is being

    calculated having ranks values as random variables and frequencies (or their analogies) as

    correspondent statistical weights, may be rated as center of gravity of the rank distribution.

    It is important to notice that these connections between different possible ways of presenting of the

    same lexicostatistical information, which are shown in the Table 1, are usually not taking into accountby researchers (not only linguists, but also biologists, sociologists, psychologists, economists,

    scientometrists, and others). What concerns mathematical statistics, similar conversions are here

    clearly interpreted (Evans, Hastings & Paecock, 1993].

    3. Types of scales and corresponding statistic parameters

    Mathematical statistics distinguishes different scales (quantitative, ordinal, nominal) and has

    developed a number of methods of data processing, which are applicable just for the appropriate

    scale. The most promoted system of techniques has been elaborated for the quantitative parameters.

    Basing to the considerable extent on the theory of moments, it implies an advanced system of mean

    1 Term decumulative distribution is used in econometrics, when investigating populations income (Lange,1964). The values of random variable reflect the income measurement, whereas the number of persons, whoseincome exceeds the given one becomes statistical weights.

  • 7/31/2019 10-1.1.1.25.6859

    3/7

    and variance values, characteristics of distribution forms, etc., and effectively uses as well a system of

    ordinal statistics (mode, median, quartile, etc.).

    In previous section it was shown that the central distributions used in processing of frequency

    dictionaries are rank, spectral, and polynomial ones. Though rank and spectral distributions have the

    outward appearance of quantitative scales, they are characterized by an extremely great variance ofparameters in both rank and frequency scales. This fact induces some researchers to doubt the

    possibility to apply here the theory of moments (because of their prone to infinity), and therefore to

    suggest instead some other characteristics, not depending on the sample size (Khajtun, 1983; Shrejder

    & Sharov, 1982). What concerns polynomial distribution, it cannot apply here the theory of moments

    theoretically as its variance is of quality nature.

    Analysis of the recent scientific works and the results of our own investigations allowed us to

    determine a rather complete list of parameters, which may be used for description of keywords

    dictionaries (and text lexicostatistical structure in the whole). Table 1 presents the list of these

    parameters subdivided into three groups according to the type of scale.

    Nominal scale Quantitative (frequency) scale

    Mode (Mo)

    Dictionary size (N)

    Maximal frequency (fmax)

    Entropy (E)

    Maximal entropy (Emax)

    Order coefficient (E/Emax)

    Mean frequency (fave)

    Geometric mean frequency (fg ave)

    Frequency variance coefficient (Vf)

    Frequency median (Mef)

    Golden section (Gf)

    Diversity coefficient2 (S)

    Ordinal (rank) scale

    Rank mean (rave) (differentiation coefficient)

    Rank variance coefficient (Vr)

    Rank median (Me) (equilibrium measure)

    Rank golden section (Gr)Rank mean deviation (dr)

    Variation coefficient on dr (Vr)

    Coefficient of concentration (rave/N)

    Logarithmic concentration coefficient (k=lograve/logN)

    Zipfs indicator () (exponent in formula of Zipfs rule - f(r)=c/r)

    Table 2. Statistic parameters, which may be used for description of text lexicostatistical structure.

    Among parameters listed in the table 2, the following are most usually used: dictionary size (N),

    maximal frequency (fmax), mean frequency (fave), entropy (E), and Zipfs indicator. Our register is

    noticeably wider and includes a large number of rank parameters, which are intensively explored

    during last years by the author of the given paper (Martynenko, 1988, 1989) and his disciples

    (Grebennikov, 1998). Moreover, to some of these parameters theoretical argumentation of theirstatistic consistency was given (Martynenko & Fomin, 1989).

    4. Test for consistency of statistical parameters

    Methodology for consistency test has been elaborated using the method of least squares with a

    number of principle modifications, caused by complicated character of parameters dependence on the

    sample size (Martynenko, 1988). In our hypothesis test the null hypothesis was stated as a parameter

    converges to its limit value (alternative hypothesis a parameter increases or decreases without

    limits). All parameters listed in table 2 were subjected to this consistency test with the use of

    approximating models.

    2Diversity coefficient (S) is a number of words, whose absolute frequency equals to 1.

  • 7/31/2019 10-1.1.1.25.6859

    4/7

    The process of approximation has been carried out on the following material: 1) keywords distribution

    in the Information Retrieval System Ships of the Fishing Fleet; and 2) distribution of lexical units in

    the frequency dictionary on radio electronics (Alekseev, 1965), 3) distribution of lexical units in the

    IUHTXHQF\GLFWLRQDU\RIWKH(QJOLVKODQJXDJH.XHUD)UDQFLV

    For approximation of the empirical distributions the broad spectrum of theoretical functions has been

    used. For example, for increasing dependences the following functions of unlimited upgrowth were

    used:

    y=axb power function

    y=aecx exponent function

    y=a(logx)b logarithmic function

    Finding the logarithm of these functions permit to easily convert them into linear dependencies, thus

    allowing the method of the least squares to be used for approximation in its canonical form.

    The list of functions of asymptotic upgrowth is more extensive. Here we mention just some of them:

    y=k-a/xb power function

    y=k-k/ecx exponent function

    y=k-ke-qx^b combination of power and exponent functions Weibull function

    y=k/e(c/x)^b one more combination of power and exponent functions

    y=k/(1+a/xb) fractional/power function

    y=k/(1+q/ecx) fractional/exponent function

    y=k/(1+a/(lnx)b) fractional/logarithmic function

    The enumerated (and some other) functions by means of single or recurring finding of their logarithm

    may be transformed into linear ones. Thus, linear variants for the above-mentioned functions are like

    follows:

    ln(k/(k-y))=blnx for power function

    lnln(k/(k-y))=lnq-blnx for Weibull function

    ln(k/y-1)=lnq-blnx fractional/exponent function

    It is known, that the system of normal equations for linear dependency is a common task for

    processing, and its solution relative to unknown parameters do not provoke peculiar embarrassments.

    However, in our case one of the parameters (asymptote - k) is included in dependent variable, causing

    impossibility to use the method of the least squares in its pure form. We propose for this problem the

    following solution. First, we assigned to asymptote the serial of concrete values with a determined

    step, beginning from the maximum parameter value, obtained in empiric series. Further, on each step

    of processing cycle for the fixed asymptote value we calculated the values of other parameters bymeans of method of the least squares, and at once the hypothesis on the absence of statistical

    significance of differences between empiric and theoretical distributions with 0.05 significance level

    was verified with the use of 2

    criterion. If the function demonstrated disagreement with empirical

    data on each step, we abandoned it and tested the other function, trying to determine the field of its

    values, meeting the requirements of2criterion.

    All parameters listed in the Table 2 were subjected to this processing. It turned out that in

    overwhelming majority of cases the best consent with the empiric data revealed the Weibull function,

    though the degree of concordance differs for different parameters. Thus the most well-consistent (for

    Weibull function) parameters are the following (in order of consistency decrease):

  • 7/31/2019 10-1.1.1.25.6859

    5/7

    1. Order coefficient (E/Emax)

    2. Logarithmic concentration coefficient (k=lograve/logN)

    3. Entropy (E)

    4. Rank golden section (Gr)

    5. Rank mean (rave)

    Weibull function had shown the best harmony with the empiric distribution.

    Other parameters are less consent with the empiric data. It attracts attention, that among the most

    consistent parameters the rank indices are dominating, as well as parameters connected in some way

    with the entropy. In their totality these parameters characterize the investigated text population from

    the point of view of 1) redistribution of the functional activity between the population constitutive

    elements, 2) concentration and dispersion of this activity, 3) the degree of share holding of its each

    element and their groups. Thus, actually these statistical characteristics are systematico-constituting

    ones.

    The table 3 demonstrates approximation potential of the Weibull function and its forecasting abilities

    on the example of two parameters dictionary size and rank mean. Approximating of the Weibullfunction for demonstrational material are the following:

    dictionary size (N) rank mean (rave)sample size (L)in thousands ofword-usage

    empiric

    data

    theoretic

    data

    empiric

    data

    theoretic

    data

    Keywords dictionary of the Information Retrieval System

    Ships of the Fishing Fleet

    1 610 606,6 106,1 104,9

    2 819 818,3 147,2 142,6

    3 932 930,8 161,0 167,84 1001 996,7 172,5 170,5

    5 1034 1037,5 174,1 175,8

    5,64 1049 1055,5 178,8 177,9

    10 (forecast) 1106 182,4

    Frequency dictionary on radio electronics

    50 5399 5421 621 626

    100 7853 7827 751 732

    150 9361 9419 754 763

    200 10582 10565 772 777

    500

    (forecast)

    13672 780

    1000 (forecast) 14880 784

    )UHTXHQF\GLFWLRQDU\RIWKH(QJOLVKODQJXDJHE\.XHUD)UDQFLV

    10,0 3009 3133 546 543

    101,0 13706 13711 1256 1317

    253,5 23655 23878 1937 1814

    1014,2 50406 50492 2730 2773

    10000,0 (forecast) 104922 4371

    100000,0 (forecast) 112500 4887

    Table 3. Dependencies of keywords dictionary parameters (dictionary size Nand rank mean rave) on

    sample sizeL

    Theoretical formulas for dependencies of dictionary size Nand rank mean rave on sample sizeL turned

    out to be as follows:

  • 7/31/2019 10-1.1.1.25.6859

    6/7

    1. Keywords dictionary of the Information Retrieval System Ships of the Fishing Fleet

    N=1120-1120e-0,78L^0,75

    rave=183-183e-0,85L^0,83

    (1)

    2. Frequency dictionary on radio electronics (Alekseev, 1965)

    N=15250-15250e-0,027L^0,713

    rave=780-780e-0,0766L^0,780

    )UHTXHQF\GLFWLRQDU\RIWKH(QJOLVKODQJXDJH.XHUD)UDQFLV

    N=112500-112500e-0,00618L^0,660

    rave=4950-4950e-0,0424L^0,433

    Table 3 demonstrate that rank means of keywords dictionaries become stabilize under comparatively

    small sample size. The best way to make it certain is to calculate the sample size which correspondesto the given value of the rank mean. Let us consider the example, where rank mean amounts to 99% of

    the theoretical one (which is maximum possible). If to place into formula (1) the correspondent data,

    we receiveL=10730 keywords usage, which approximately equals to 2000 documents.

    Rank means stabilization of standard frequency dictionaries created for specialized scientific and

    technical text corpora, also begins rather fast (of course, taking into account their larger lexical

    diversity, as compared with that of ordinary keywords dictionaries). For frequency dictionary on radio

    electronics by Alekseev the 99%-th level of rank mean achieves, when sample size achieves 191000

    word usages. It is even slightly less the standard level of 200000 word usages adopted by All-Russian

    Research Group Speech Statistics (Alekseev, 1975).

    What concerns the dictionary size, all dictionaries considerably slow achieve its 99% level. Thus, for

    keywords dictionary Ships of the Fishing Fleet it amounts to 11000 word usages, for the frequency

    dictionary on radio electronics 38000000 word usages, and for the frequency dictionary of the

    English language 22485000 word usages.

    Conclusion

    Our main results are the following:

    1) Theoretically all keywords dictionary parameters have either upper or lower limits. That means that

    in principle they are statistically consistent. However, for the most of parameters actual consistency is

    achieved only in the very big sample sizes, which are hardly attainable in ordinary NLP tasks.

    2) The most consistent parameters turned out to be (in decreasing order): order coefficient (E/Emax),

    logarithmic concentration coefficient (k=lograve/logN), entropy (E), rank golden section (Gr), rank

    mean (rave) (differentiation coefficient). These parameters along with some other ones represent a

    description tool for the system characteristics of any thesaurus or keywords dictionary. Moreover,

    their rapid speed of convergence to the limit values allows to effectively perform classification

    procedures on data of the arbitrary size.

    3) The proposed model of approximation makes it possible to forecast the values of all keywords

    dictionary parameters for any sample size.

    4) The higher is thematic and functional specialization of lexical units, there higher is the consistency

    of their parameters. Thus, statistical consistency of keywords dictionary is higher than that ofspecialized frequency dictionary (in our example the radio electronics one). At the same time the

  • 7/31/2019 10-1.1.1.25.6859

    7/7

    parameters consistency of these both dictionaries is higher than that of dictionaries, describing

    language in the whole (e.g., frequency dictionary of the English language).

    References

    Alekseev, P.M. (1965). Frequency dictionary of English sublanguage of electronics. Abstract of Ph.D.thesis. Leningrad.

    Alekseev, P.M. (1975). Statistic Lexicography (Typology, compiling and applications of frequencydictionaries). Leningrad. Edition of Leningrad Pedagogical Institute.

    Grebennikov, A.O. (1998). To the problem of statistics consistency of the fiction frequencydictionary. Structural and Applied linguistics. Issue 5, St. Petersburg: Edition of St. Petersburg StateUniversity. 110112.

    Evans, M., Hastings, N., & Peacock, B. (1993). Statistical Distributions. New York: Wiley.Khajtun, S.D. (1983). Scientific measurement. Present Conditions and Perspectives. Moscow.Nauka..XHUD + )UDQFLV :1 &RPSXWDWLRQDO DQDO\VLV RI SUHVHQWGD\ $PHULFDQ (QJOLVK

    Providence.Lange, O. (1964). Introduction into econometrics. Moscow. Progres.Martynenko, G.Y. & Fomin, S.V. (1989). Rank moments. Nauchno-technicheskaya informacija.

    Series 2. N 8. 914.Martynenko, G.Y. (1988). Osnovy stilemetrii (Fundamentals of Stylometrics). Leningrad: Edition of

    Leningrad State University.Martynenko, G.Y. (1989). About statistic characteristics of rank distributions. Quantitative linguistics

    and automatic text analysis. Tartu. 5068.Shrejder, Y.A. & Sharov, A.A. (1982). Systems and Models. Moscow.Radio i sviaz.