Upload
l-egghe
View
216
Download
3
Embed Size (px)
Citation preview
Sampling and Concentration Values ofIncomplete Bibliographies
L. Egghe*LUC, Universitaire Campus, B-3590 Diepenbeek, Belgiumand UIA, Universiteitsplein 1, B-2610 Wilrijk, Belgium. E-mail: [email protected]
This article studies concentration aspects of bibliogra-phies. More, in particular, we study the impact of incom-pleteness of such a bibliography on its concentrationvalues (i.e., its degree of inequality of production of itssources). Incompleteness is modeled by sampling in thecomplete bibliography. The model is general enough tocomprise truncation of a bibliography as well as a sys-tematic sample on sources or items. In all cases weprove that the sampled bibliography (or incomplete one)has a higher concentration value than the complete one.These models, hence, shed some light on the measure-ment of production inequality in incomplete bibliogra-phies.
Introduction
Bibliographies [or its generalization: Informetric Produc-tion Processes (IPPs)] are formally described, for example,in Egghe (1989, 1990). A bibliography consists of a set ofsources (e.g., journals), a set of items (e.g., articles), and afunction that points out which items belong to which source(i.e., pointing out, for each article, in which journal it ispublished). Other interpretations of IPPs (even beyond theinformation sciences, e.g., in econometrics, biometrics, etc.)exist (see, e.g., Egghe and Rousseau, 1990), but we will notuse them here. We will henceforth use the terminology“bibliography.”
Typical for all bibliographies is the large inequality thatexists between the production of the sources. Intuitivelyspeaking, few sources have many items and many sourceshave few items. One talks in this connection also about“elitism” or “elitarism” of bibliographies—also found, forexample, in econometric models relating to richness andpoverty. In informetrics, the formal way to express inequal-ity is given by the classical informetric laws such as theones of Lotka, Zipf, Mandelbrot, Bradford, Leimkuhler, andso on (see, e.g., Egghe and Rousseau, 1990). The measure-
ment of inequality can be performed using concentrationtheory (see Egghe and Rousseau, 1990; Rousseau, 1992),but its invention goes back to the beginning of the 20thcentury (see, e.g., Gini, 1909; Lorenz, 1905). In its mostelementary way it goes as follows. We start from a vector
X � �x1, . . . , xN�
where, for each i � 1, . . ..,N, xi � 0 denotes the number ofitems in (more generally the production of) the ith source inthe bibliography. This sequence is ordered in a monotoneway—here we will order it decreasingly. We transform Xinto its corresponding vector of relative values:
AX � �a1, . . . , aN�
where
ai �xi
�j�1
N
xj
(1)
for every i � 1, . . . , N. The Lorenz curve LX of X is thenformed by linearly interconnecting the points in the unitsquare
�0, 0�, � 1
N, a1� , � 2
N, a1 � a2� ,
. . . , � i
N, �
j�1
i
aj� , . . . , �1, 1�
Note that indeed
�j�1
N
aj � 1.
* Permanent address
© 2002 Wiley Periodicals, Inc.Published online 10 January 2002 ● DOI: 10.1002/asi.10033
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 53(4):271–281, 2002
Because X is decreasing, the Lorenz curve LX is concavelyincreasing between (0,0) and (1,1). Let
Y � �y1, . . . , yN�
be a second vector as above. We say that Y represents amore concentrated situation than X if
LY � LX (2)
and where the higher concentration is strict if LY � LX atleast in some points of the graphs.
One can then proceed with this classical concentrationtheory by defining good concentration measures M as func-tions on such vectors X, Y such that LY � LX implies M(Y)� M(X). Classical examples of good concentration mea-sures are: V (the coefficient of variation), G (the Gini index),P (Pratt”s measure), Th (Theil’s measure) (see Egghe &Rousseau, 1990), but we will not be using these measureshere. It is well known that comparing the Lorenz curves isthe perfect solution to comparing inequality.
The above method applies to any bibliography: here in X� (x1, . . . , xN), each xi represents the number of articles inthe ith journal in the bibliography [where journals are or-dered in decreasing order of the number of articles theyhave (on a certain subject)]. In practice, however, we arefaced with the problem of incomplete bibliographies: wenever know the complete bibliography, for example, due tothe imperfectness of information retrieval machines. Thisincompleteness can also be interpreted in the followingnatural case: suppose we want to keep track of a bibliogra-phy in time. Then the cumulative set of journals and articlesup to, say 1999, is an incomplete version (or extract orsample) of the one up to, say 2000. Interpreted in this way,incomplete bibliographies have their application in thestudy of (even complete) bibliographies in function of time.Hence, the term “incomplete” does not only have to beinterpreted in its “smallest” interpretation, i.e., the one inwhich we only retrieve a part of what we want: the completebibliography.
An incomplete version of a bibliography can be inter-preted in several ways. In any way we will consider theincomplete version of a bibliography as a bibliography thatis obtained as a sample in the original one. Of course, thereare many ways to execute a sample. Because of the dualcharacter of bibliographies, at least two “major” types ofsamples are possible: a sample in the items and a sample inthe sources. We refer to Rousseau (1993) for a first attemptto study the effect of sampling on the concentration prop-erties of a bibliography, based on elaborated (theoretical)examples. In addition to these two types we can—in case weend up with a source with zero items (in case of an itemssample) or with a nonpicked source (in case of a sourcesample): (1) allow this source but with zero items (xi � 0),or (2) delete this source, i.e., considering as nonexistent.
These two approaches are considerably different. With-out going into the different sampling techniques discussedin the sequel, let us illustrate this by a simple example.Suppose X � (5,4,3,2,1) is our “complete” situaton; hence,a bibliography where we have one source with 5,4,3,2,1items, respectively. Deleting the items in the sources withone and two items leaves us two possibilities, as describedabove.
(1) The sampled bibliography is represented by the vector,denoted by s(X): s(X) � (5,4,3,0,0);
(2) The sampled bibliography is represented by the vector,denoted by �(X): �(X) � (5,4,3,).
These two cases are, conceptually, very different. De-noted in a general way, the difference between
k
�x1, . . . , xN�k, 0, . . . , 0�
and
�x1, . . . , xN�k�
(x1, . . . , xN�k � 0) can be the difference between (econo-metric interpretation).
(1) a group of N persons where the first N � k ones have agood salary, while the last k persons do not earn anymoney; and
(2) a group of N � k persons, all having a good salary.
In an informetric interpretation, we have the above dif-ference:
(1) we have a group of N researchers (in a scientific do-main) in which N � k of them are very productive (interms of number of publications) and in which k of themare not productive at all; and
(2) we have a group of N � k very productive researchers.
The above arguments lead to the following four samplingtypes (many more methods will be explained in the se-quel—we do not go into this now).
(1) sampling in items, keeping zero sources (or in otherterms: the number of sources (N) is fixed),
(2) sampling in items, deleting zero sources (here the num-ber of sources varies),
(3) sampling in sources, keeping not-selected sources aszero sources (again using a fixed number N of sources),
(4) sampling in sources, deleting not-selected sources(again here the number of sources varies).
The ultimate goal of studying the above samplingtypes—besides their proper theoretical interest—is to beable to conjecture some results concerning the concentrationof a real incomplete bibliography, for example, a retrieved
272 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—February 15, 2002
bibliography and to determine how it differs from the con-centration of a (unknown) bibliography.
The next section deals with sampling in items. There weprove, using a very general sampling method (to be dis-cussed there and comprising truncation of a bibliography aswell as systematic samples of the bibliography, called byRousseau, 1993, perfectly stratified sampling—see furtherfor exact definitions) that, if the number of sources is fixedand if we sample from the least productive sources to themost productive ones (the most important case as will beexplained there), the Lorenz curve of the sampled bibliog-raphy is always above the one of the complete bibliography.In all other cases (including the deletion of zero sources) weproduce counterexamples showing that the Lorenz curve ofthe sampled bibliography is not always above or below theone of the complete bibliography.
The third section deals with sampling in sources. Alsohere we prove that, if the number of sources is fixed and ifwe keep the most productive sources (see further for anexact definition), the Lorenz curve of the sampled bibliog-raphy is always above the one of the complete bibliography.
In summary, the most “natural” sampling types lead to anincrease of the inequality (in production of the sources) inthe sampled bibliography. Hence, this leads to a systematicoverestimation of the inequality (concentration) of the com-plete bibliography.
Sampling Items
We will first introduce two important item samplingmethods that will turn out to be two extreme cases of thegeneral sampling method that we will discuss in this section.
Systematic Sample
As always, a bibliography is represented by a productionvector
X � �x1, . . . , xN�
where xi���{0}, for all i � 1, . . . , N. We order X decreas-ingly and sample in the items, using the least productivesources first. Let ��Q� (the positive rational numbers) besuch that 0 � � � 1 and such that � is a multiple of
1
�j�1
N
xj
.
Otherwise stated,
�j�1
N
xj � N,
the total number of items, is a1
�multiple (this is used to
have no rounding-off errors that would disturb the results aswe will see in the sequel). A systematic sample startsreplacing xN by [�xN] ([x] denotes the largest entire number,smaller than or equal to x). This will be the Nth coordinates(xN) of the sampled vector, denoted by s(X). The “rest”1
�(�xN � [�xN]) is then added to xN�1 and, for the (N�1)th
coordinate of s(X) we take
s�xN�1� � ���xN�1 �1
���xN � ��xN���� (3)
The rest is added to xN�2 and so on. The notation is gettingrather complicated, but there is a simple way to express thecumulative number
�j�i
N
s�xj�
for every i � 1, . . . , N. It is nothing else than
�j�i
N
s�xj� � ��� �j�i
N
xj�� (4)
which makes life much more easy and allows us not to use(3) further on: (4) will suffice.
Another way to explain systematic sampling is as fol-
lows. Let � �1
n(n��). Replace each xj by a unit row vector
of length xj, so that the production vector (x1, . . . , xN) isreplaced by [(1, . . .,1)(1, . . .,1)(. . .)(1, . . .,1)]. Ignoring theinner brackets we essentially have a unit vector of length
�j�1
N xj. Choosing our1
nsample is then just a matter of going
along this vector (from right to left), ignoring the innerbrackets and highlight each nth entry. Then s(xj) is just thenumber of highlighted entries in the bracket correspondingto the jth source.
The same sort of construction can be described when �
�p
q(p,q��, p q). Here, the production vector (x1, . . .,xN)
is replaced by replacing each xj by a unit matrix with p rowsand xj columns so that overall we have a unit matrix with p
rows and �j�1
N xj columns. To execute thep
qsample we run
down the columns successively highlighting ever qth entryand then count up the number of highlighted entries withineach matrix.
Some examples will show the simplicity of the method.
(1) X � (4,1,1) � (x1,x2,x3) and � � 1
2. Because 1
2 x3 1
we take s(x3) � 0, and we shift x3 � 1 to the second and
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—February 15, 2002 273
add it. Now 2 �1
2� 1 yielding s(x2) � 1. Finally, 4 �
1
2� 2 yielding s(x1) � 2. Hence, s(X) � (2,1,0). Note that
�j�1
3 xj � 6 is a1
�� 2 multiple. In the description with
the unit vectors we rewrite X as [(1,1,1,1),(1),(1)], andhighlight every second 1 from the right. So we have[(1,1,1,1),(1),(1)], yielding s(X) � (2,1,0) again. In thesame way the following exercices can be executed.
(2) X � (4,2,1,1,1), � � 1
3. Note that �
j�1
5 xj � 9 � 3
multiple. Applying the same method now yields (verify)
s�X� � �2, 0, 1, 0, 0�
but because s(X) is not decreasing we define [order s(X)decreasingly]
s�X� � �2,1,0,0,0�
� can be any number in Q��[0,1] such that �i�j
N xi is a1
��
multiple and not just a number of the form � �1
n, n��.
Example:
(3) X � (1,1,1,1,1), � � 3
5. Note that �
j�1
5 xj � 5, a 5
3
multiple. Verify that
s�X� � �1,1,0,1,0�
and, hence, that
s�X� � �1,1,1,0,0�.
Systematic sampling is one of the most importantsampling methods, which is also applied in real life tohave a fast method that resembles (or approximates)random sampling (see, e.g., Carpenter & Storey Vasu,1978; Clarke & Cooke, 1992; Egghe & Rousseau, 1990,2001a). Only in cases of production units in factories,where a production error might occur in every nth objectsay, systematic sampling might give sampling results thatdiffer from random sampling. As described in the men-tioned references, there is very little chance that we havethis problem in sampling in bibliographies. In our inter-pretation we think it resembles the “making” of incom-plete bibliographies very well. Yet, systematic samplingwill be generalized in this section, where source-variable� will be allowed (see further).
Truncation
Let the bibliography be represented by X � (x1,. . . , xN),ordered decreasingly. Let i�{1, . . . , N}. The i-truncation ofthis bibliography is obtained by keeping the i most produc-tive sources and putting the sources on rank i � 1, . . . , N onzero production. Hence, the i-truncation of the bibliography
is represented by
N � i
s�X� � �x1, . . . , xi, 0, . . . , 0� (5)
The i-truncation can be considered as the bibliographyconsisting of the i “core” sources of the original bibliogra-phy (see Egghe & Rousseau, 2001b, for a treatment of coresof a bibliography).
General Model for Sampling in Items
The above item sampling methods can be generalized asfollows. Let X � (x1,. . . , xN) represent our bibliography.We suppose X to be decreasing. The philosophy of thisgeneral method is allowing for variable sample fractions �i
(i � 1, . . .,N) dependent on the source i. Most naturally werequire (�i)i�1,. . .,N to be decreasing (including a constantsequence): in this sampling method, items in low productivesources (high i) have a lower chance to be picked for thesample than items have in sources with low i (highlyproductive sources). In exact mathematical terms, s(X)� [s(x1), . . .,s(xN)], the representation of the sampled bib-liography, is obtained as follows:
s�xN� � ��NxN� (6)
of course sampling first in the low productive sources. Thedecimal rest, �NxN �[�NxN] is then added to the (N � 1)thsource (the same as shown earlier). Then
s�xN�1� � ��N�1xN�1 � �NxN � ��NxN�� (7)
and so on. The simplest way to describe this samplingmodel is as follows: for every i � 1, . . . , N
�j�i
N
s�xj� � � �j�i
N
�jxj� (8)
From this it also follows that, for every i � 1, . . . , N
s�xi� � � �j�i
N
�jxj � �j�i�1
N
s�xj�� (9)
because �j�i�1
N s(xj)��. Formula (9) is easily obtained and
certainly easier than deriving it from a complete inductionargument based on (6) and (7) (which, however, is alsopossible). The vector representing the sampled bibliographyis denoted s(X) � [s(x1), . . .,s(xN)]. Its decreasing version isdenoted s(X) � [s(x1), . . .,s(xN)]. Denote by LX, Ls(X),Ls(X), the Lorenz curves derived from X, s(X), s(X). Wehave the following general result.
274 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—February 15, 2002
Theorem I
If (�i)i�1,. . . .,N is decreasing and if �j�1
N�jxj��, then
Ls�x� � Ls�X� � LX (10)
Hence, the sampled bibliography is more concentrated thanthe original one.
Because the proof is a bit lengthy, we give it in Appendix1.
Note that the sampling methods described earlier are aspecial case of this
(1) Systematic sampling is obtained by taking
�1 � �2 � . . . � �N � � (11)
Note that in this case, the condition that �j�1
N xj must be an
1
�multiple is the same as the requirement in Theorem I
above:
�j�1
N
�jxj � � (12)
This is a very natural requirement where we only look atbibliographies in which “entire” items are sampled. Besides,below, we will give a counterexample to Theorem I in case(12) is not satisfied. So, because systematic sampling isincluded in Theorem I we have here also that the sampledbibliography is more concentrated than the original one.
(2) Truncation is obtained by taking
� �1 � . . . � �i � 1�i�1 � . . . � �N � 0 (13)
Note that here (12) is always valid and that s(X) � s(X).Because Theorem I applies to truncation, we have a gener-alization of the result obtained in Egghe and Rousseau(2001b), which has, as mentioned above, an application inthe determination of the core of a bibliography.
Example Showing That the Requirement (12) Cannot beDropped in Theorem I
Even in the case of a systematic sampling we can give a
counterexample. Take X � (3,1,1), � � 0.5, hence �j�1
3 xj is
not a1
�� 2 multiple. It is easily seen that s(X) � (1,1,0),
hence
AX � �3
5,
1
5,
1
5�As�X� � �1
2,
1
2, 0�
showing that LX and Ls(X) cross.If we sample items, using the most productive sources
first, Theorem I is not true.
If the Sampling Method is Done in the Reverse Way, i.e.,by Starting with the Most Productive Sources, ThenTheorem I Is Not True, Nor Do We Have an OppositeResult
This sampling method replaces (6)–(8) by
s�x1� � ��1x1�
s�x2� � ��2x2 � �1x1 � ��1x1��
�j�1
i
s�xj� � � �j�1
i
�jxj�Even in the PSS case (�1 � . . . � �N � �) the analog ofTheorem I, nor the opposite result (Ls(X) � LX) are generallytrue. Indeed, take X � (5,1,1,1,1), � � 1
3. Then this sampling
method yields s(X) � (1,1,0,0,1), and hence, s(X)� (1,1,1,0,0). But
AX � �5
9,
1
9,
1
9,
1
9,
1
9� ,
As�X� � �1
3,
1
3,
1
3, 0, 0�
showing that Ls(X) and LX cross, and hence also Ls(X) LX.
If the Sampling Method Is Applied But with Dropping theZero Sources, Then Theorem I Is Not True, Nor Do WeHave an Opposite Result
Take X � (12,4,2,1,1), � � 0.25. Then here s(X)� (3,1,1) and so
AX � �12
20,
4
20,
2
20,
1
20,
1
20� ,
As�X� � �3
5,
1
5,
1
5�It is trivial to see that Ls(X) LX. Of course, any examplewhere no zero sources occur yields Ls(X) � LX, by Theorem
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—February 15, 2002 275
I, because this sampling method and the one describedabove then coincide. The next is an example where Ls(X) andLX cross.
Take X � (3,3,2,2,1,1), � � 0.5. Then s(X) � (2,1,1,1,1),
AX � � 3
12,
3
12,
2
12,
2
12,
1
12,
1
12� ,
As�X� � �2
6,
1
6,
1
6,
1
6,
1
6� .
The equation of the line connecting (0,0) with �26, 612
� is y �32x. For x � 1
5this yields Lx�
15� �
310
, while Ls(X)(15)�2
6� 3
10.
But LX (25)�LX(2
6)�1
2�Ls(X)(
25). This shows that LX and Ls(X)
cross.This completes the study of all cases of sampling on
items. We now proceed with the case of sampling onsources.
Sampling Sources
Also here, the most important case of sampling insources, namely systematic sampling, will yield a result asin Theorem I, namely Ls(X) � LX. It is the case where wekeep the nonpicked sources as zero sources and where wesample in such order that the largest sources are kept in thesample. This will be described now.
Description of the Model of Sampling in Sources
Again, we represent the bibliography by X � (x1, . . .,xN)(decreasing), and we let � Q��]0,1[ such that is a
multiple of1
N. Otherwise stated, N is an
1
multiple. A
systematic sample in sources, giving priority to higher pro-ductive sources yields the bibliography represented by thevector
s�X� � �x1, . . . , x1/�1, 0,
x1/�1, . . . , x2/�1, 0, . . . , xN�1, 0� (14)
The decreasing order version of this vector is then
K
s�X� � �x1, . . . , x1/�1, x1/�1, . . . ,
x2/�1, . . . , . . . , xN�1, 0, . . . , 0� (15)
where
N �K
, K � �. (16)
We have the following result.
Theorem IIThe above sampling method in sources yields
Ls�X� � LX. (17)
Hence, the sampled bibliography is more concentrated thanthe original one.
Because the proof is rather technical we give it in Ap-pendix 2.
The Above Theorem Is Not Valid for the (Nondecreasing)s(X)
This is clear, because so many zeroes occur. An exampleTake X � (4,3,2,1), � 0.5. Then s(X) � (4,0,2,0). Hence
AX � � 4
10,
3
10,
2
10,
1
10�and
As�X� � �4
6, 0,
2
6, 0�
So
LX�1
4� �4
10� Ls�X��1
4� �4
6
but
LX�1
2� �7
10� Ls�X��1
2� �4
6.
The Above Theorem Is Not Valid for Not PerfectlyStratified Samples
Indeed, even the simplest case of replacing one source bya zero source, does not yield the result. Take X� (3,3,1,1,1,1) and s(X) � (3,1,1,1,1,0) (hence replacingone source with 3 items by a zero source). Now
AX � � 3
10,
3
10,
1
10,
1
10,
1
10,
1
10�and
As�X� � �3
7,
1
7,
1
7,
1
7,
1
7, 0�
Note that
LX�1
6� �3
10� Ls�X��1
6� �3
7
276 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—February 15, 2002
but
LX�2
6� �6
10� Ls�X��2
6� �4
7.
Note that LX and Ls(X) intersect twice because
LX�5
6� � Ls�X��5
6� � 1.
The Above Theorem Is Not Valid If We Sample in theReverse Way
Here, instead of replacing each
xi/�i � 1, . . . ,K�
by 0 as in (14) we replace each
XN�i/�1 by 0.
Example: X � (10,1,1,1), � 0.5, s(X) � (1,1,0,0). But
AX � �10
13,
1
13,
1
13,
1
13�As�X� � �1
2,
1
2, 0, 0�
Hence LX and Ls(X) cross.
The Above Theorem Is Not Valid If We Delete Sources(instead of Replacing Them by 0)
Indeed, take X � (4,3,2,1), � 0.5 (and sample as in thetheorem, except that we delete nonpicked sources). Heres(X) � (4,2) and, hence
AX � � 4
10,
3
10,
2
10,
1
10�As�X� � �4
6,
2
6�Hence
LX�1
2� �7
10� Ls�X��1
2� �4
6.
Also, an opposite example exists: X � (10,1,1,1), � 0.5.Then s(X) � (10,1), hence
AX � �10
13,
1
13,
1
13,
1
13�
and
As�X� � �10
11, 1�
So LX Ls(X) here.
Conclusions
In this article we studied different sampling techniques inbibliographies, comprising sampling in items and samplingin sources.
When sampling in items we allow the probability for anitem to be picked to be increasing with the number of itemsin the sources. In this general setting we prove that, if westart sampling in the least productive sources (hereby keep-ing zero sources), the Lorenz curve of the sampled bibliog-raphy is above the Lorenz curve of the original one. In otherwords, the sampled bibliography is more concentrated thanthe original one. The model and result applies to the case ofsystematic sampling (often used as an approximation forrandom sampling) as well as to truncation of bibliographies[i.e., only using the “core” of the bibliography (see Egghe &Rousseau, (2001b)].
When sampling in sources, a similar result is proven.Here we show that a systematic sample in sources, keepingthe most productive sources and replacing nonpickedsources by a zero source, yields a sampled bibliography forwhich the Lorenz curve is above the one of the originalbibliography.
We also show that all variants of the above methods(reversing the order, deleting zero sources, etc.) do not yieldsuch (or another) result.
So, in the two major sampling methods, we have that thesampled bibliography is more concentrated than the originalone. This gives information about the concentration of bib-liographies (as we receive them, e.g., as the result of an IRaction) compared to the (unknown) complete one. In allcases we can say that the observed concentration is higherthan the concentration of the complete bibliography.
Note
As remarked by one of the referees, the requirement that
�j�1
N
�jxj � �
(in the case of sampling in items Theorem I) and the
requirement that N is an1
multiple (in the case of sampling
in sources—see earlier) is not always valid in practice. Verysimply, if N is prime, the sampling in sources is not evenpossible.
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—February 15, 2002 277
What is behind these requirements is that (1) for exactresults, we need them (because we show by example thatwithout them the results are wrong); (2) for large N (whichis always the case) one can drop the requirements, herebyonly making a mistake in the last item (or source) sampled.This yields an alteration of the Lorenz curve which, how-ever, goes to zero for N3 �. We estimate that in this case(large N) the found increase in concentration will be there.In short, we indicate that, in practical cases, concentrationincreases.
Problem
We leave it as an open problem (for further study) tomake explicite calculations of the difference of concentra-tion between a bibliography and its sampled version. Itwould yield information on the concentration values of acomplete (unknown) bibliography.
Acknowledgment
The author is grateful to Dr. R. Rousseau for interestingdiscussions on the topic of this article.
References
Carpenter, R.L., & Storey Vasu, E. (1978). Statistical methods for librar-ians. American Library Association, Chicago, USA.
Clarke, G.M., & Cooke, D. (1992). A basic course in statistics (3rd ed.).London: Edward Arnold.
Egghe, L. (1989). The duality of informetric systems with applications tothe empirical Laws. Ph. D. Thesis. The City University, London, UK.
Egghe, L. (1990). The duality of informetric systems with applications tothe empirical laws. Journal of Information Science, 16(1), 17–27.
Egghe, L., & Rousseau, R. (1990). Introduction to informetrics. Quantita-tive methods in library, documentation and information science. Am-sterdam: Elsevier.
Egghe, L., & Rousseau, R. (2001a). Elementary statistics for effectivelibrary and Information service management. London: Aslib.
Egghe, L., & Rousseau, R. (2001b). The core of a scientific subject: Anexact definition using concentration theory and fuzzy set theory. In:Proceedings of the 8th International Conference on Scientometrics andInformetrics. Volume 1, Syndney, Australia. The University of NewSouth Wales, 2001, 147–156.
Gini, C. (1909). Il diverso accrescimento delle classi sociali e la concen-trazione della richezza. Giornale degli Economisti, 11, 37.
Lorenz, M.O. (1905). Methods of measuring concentration of wealth.Journal of the American Statistical Association, 9, 209–219.
Rousseau, R. (1992). Concentration and diversity measures in informetricresearch. Doctorate Dissertation, University of Antwerp.
Rousseau, R. (1993). Measuring concentration: Sampling design issues, asillustrated by the case of perfectly stratified samples. Scientometrics,28(1), 3–14.
Appendix 1
Proof of Theorem I
That Ls(X) � Ls(X) is trivial because s(X) is the decreas-ing reorder of s(X) and, hence, for every i � 1, . . . , N
�j�1
i
s�xj� � �j�1
i
s�xj� (A1)
and note that
�j�1
N
s�xj� � �j�1
N
s�xj�.
Hence, it suffices to prove that Ls(X) � LX. Denote by T thetotal number of items,
�j�1
N
xj.
LX is obtained by linearly connecting the points
�0, 0�, � 1
N,
x1
T � , . . . , � i
N,
�j�1
i
xj
T � , . . . , �1, 1� (A2)
Ls(X) is obtained by linearly connecting the points
�0, 0�, �1
N,
s�x1�
�j�1
N
�jxj� , . . . , � i
N,
�j�1
i
s�xj�
�j�1
N
�jxj� , . . . , �1, 1�
(A3)
Now
�j�1
i
s�xj�
�j�1
N
�jxj
�
�j�1
N
s�xj� � �j�i�1
N
s�xj�
�j�1
N
�jxj
�
�j�1
N
�jxj � � �j�i�1
N
�jxj��j�1
N
�jxj
So, if the above number is larger than or equal to
�j�1
i
xj
T
for all i, we have shown that Ls(X) � LX. We can exclude thecase xi�1�. . .�xN � 0 because then the assertion is trivial.We have to show that
278 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—February 15, 2002
�j�1
i
�jxj � �j�i�1
N
�jxj
�j�1
i
xj � �j�i�1
N
xj
�
�i�1
N
�jxj
T�
� �j�i�1
N
�jxj��
j�i�1
N
xj
Because for all a,b,c,d���
a
b�
c
df
a � c
b � d�
c
d�
�c�
d
it suffices to show that
�j�1
i
�jxj
�j�1
i
xj
�
�j�i�1
N
�jxj
�j�i�1
N
xj
(A4)
From the lemma below we have that
�j�1
i
�jxj
�j�1
i
xj
� i � ��i, �1� (A5)
unless x1 � . . . � xi � 0 but then, because X decreases, Xis the zero vector, which we exclude and (applied to �i�1,. . . ,�N,xi�1, . . .,xN)
�j�i�1
N
�jxj
�j�i�1
N
xj
� �i � ��N, �i�1� (A6)
(because xi�1 � . . . � xN � 0 is excluded).Because �i�1 � �i, for all i, we have that (A5) and (A6)
imply (A4), completing the proof of the theorem.
Lemma
�j�1
i
�jxj � i� �j�1
i
xj� (A7)
where i � [�i,�1].
Proof
Because all �j � [0,1] we have that
�j�1
i
�jxj � �0, �j�1
i
xj�and hence,
�j�1
i
�jxj � i� �j�1
i
xj� (A8)
for a certain i � [0,1]. Suppose now that i �i. Hence, i
�j for all j � 1, . . .,i, and hence,
i� �j�1
i
xj� � �j�1
i
ixj � �j�1
i
�jxj
unless x1 � . . . � xi � 0 in which case (A7) is trivial forany i. Hence, (A9) contradicts (A8). In the same way,suppose i � �1. Hence, i � �j for all j � 1, . . . ,N, hencefor all j � 1, . . . ,i. So,
i��j�1
i
xj� � �j�1
i
ixj � �j�1
i
�jxj (A9)
contradicting (A8) again. Consequently, i � [�i,�1].
Appendix 2
Proof of Theorem II
We have to compare the Lorenz curves LX of X � (x1, . . .,xN) and
K
s�X� � �x1, . . . , x1/�1, x1/�1, . . . , x2/�1, x2/�1, . . . ,
x�K�1�/�1, . . . , x�K/��1, 0, . . . , 0� (A10)
To prove that Ls(X) � LX we have to show (denote M�1
)
that
x1 � . . . � xM�1 � xM�1 � . . . � x2M�1 � . . .� xiM�1 � xiM�1 � . . . � xiM�j
T � ���1
K
x�M
�x1 � . . . � xiM�j�i
T(A11)
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—February 15, 2002 279
where i � 1, . . .,K�1 ; j � 1,. . . .,M � 1 (note that j � M� 1 denotes the case where we have x(i�1)M�1 as last termin the nominator of the left-hand side and note that the casei � 0 yields (A11) trivially, for all j � 1, . . . ,M�1). (A11)gives
x1T � . . . � xM�1T � xM�1T � . . . � x2M�1T � . . .
� xiM�1T � xiM�1T � . . . � xiM�jT � x1T � . . .
� xiM�j�iT � xM�x1 � . . . � xiM�j�i� � x2M�x1 � . . .
� xiM�j�i� � . . . � xKM�x1 � . . . � xiM�j�i�
where KM � N.In the right-hand side, among the sum x1T� . . .
�xiM�j�iT there are exactly
� � iM � j � i
M �multiples of M, which remain after cancellation with theleft-hand side. This yields the equivalent condition
xiM�j�i�1T � . . . � xiM�jT � xMT � x2MT � . . . � x MT
� xM�x1 � . . . � xiM�j�i� � x2M�x1 � . . . � xiM�j�i� � . . .
� xKM�x1 � . . . � xiM�j�i�
Hence
xiM�j�i�1T � . . . � xiM�jT � x� �1�M�x1 � . . . � xiM�j�i�
� x� �2�M�x1 � . . . � xiM�j�i� � . . . � xKM�x1 � . . .
� xiM�j�i� � xM�xiM�j�i�1 � . . . � xN� � x2M�xiM�j�i�1
� . . . � xN� � . . . � x M�xiM�j�i�1 � . . . � xN�
� � �p�1
xpM�� �t�iM�j�i�1
iM�j
xt � �s�iM�j�1
N
xs� (A12)
Hence, we have the condition
� �t�iM�j�i�1
iM�j
xt��T � �p�1
xpM� � � �q� �1
K
xqM�� �r�1
iM�j�i
xr�� � �
p�1
xpM�� �s�iM�j�1
N
xs� (A13)
Note that
T � �p�1
xpM � �M � 1� �p�1
xpM
because X decreases, and, again because X decreases,
�t�iM�j�i�1
iM�j
xt � �m�iM�j�1
iM�j�i
xm � �m�iM�j�i�1
iM�j�2i
xm � . . .
� �m�iM�j��M�2�i�1
iM�j��M�1�i
xm
Hence
� �t�iM�j�i�1
iM�j
xt��T � �p�1
xpM� � � �p�1
xpM�� �m�iM�j�1
iM�j�i
xm
� �m�iM�j�i�1
iM�j�2i
xm � . . . � �m�iM�j��M�2�i�1
iM�j��M�1�i
xm�(M � 1 terms in the last factor).
� � �p�1
xpM�� �m�iM�j�1
iM�j��M�1�i
xm�Hence, it suffices to show, by (A13), that
� �p�1
xpM�� �m�iM�j�1
iM�j��M�1�i
xm� � � �q� �1
K
xqM�� �r�1
iM�j�i
xr�� � �
p�1
xpM�� �s�iM�j�1
N
xs�which reduces to
� �q� �1
K
xqM�� �r�1
iM�j�i
xr� � � �p�1
xpM�� �s�iM�j��M�1�i�1
N
xs�(A14)
or
�p�1
xpM
�q� �1
K
xqM
�
�r�1
iM�j�i
xr
�s�2Mi�j�i�1
N
xs
(A15)
280 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—February 15, 2002
But for all p � 1, . . . ,
xpM �1
M�x�p�1�M�1 � . . . � xpM�
and for all q � � 1, . . . , K � 1
xqM �1
M�xqM�1 � . . . � x�q�1�M�
because X decreases, yielding, because N � KM,
�p�1
xpM
�q� �1
K
xqM
�
1
M�x1 � . . . � x M�
1
M�x� �1�M�1 � . . . � xN� � xN
�x1 � . . . � x M
x� �1�M�1 � . . . � xN�
x1 � . . . � xiM�j�i
x� �1�M�1 � . . . � xN(A16)
because M � [iM�j�i] � iM � j � i.Furthermore,
� � 1�M � 1 � M � M � 1 � iM
� j � i � M � 1 � 2iM � j � i � 1
because i � 1. Consequently, (A16) gives
�p�1
xpM
�q� �1
K
xqM
�
�r�1
iM�j�i
xr
�s�2iM�j�i�1
N
xs
which is (A15) and, hence, (A11). Consequently, Ls(X)
� LX in all cases.
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—February 15, 2002 281