11
Sampling and Concentration Values of Incomplete Bibliographies L. Egghe *LUC, Universitaire Campus, B-3590 Diepenbeek, Belgium and UIA, Universiteitsplein 1, B-2610 Wilrijk, Belgium. E-mail: [email protected] This article studies concentration aspects of bibliogra- phies. More, in particular, we study the impact of incom- pleteness of such a bibliography on its concentration values (i.e., its degree of inequality of production of its sources). Incompleteness is modeled by sampling in the complete bibliography. The model is general enough to comprise truncation of a bibliography as well as a sys- tematic sample on sources or items. In all cases we prove that the sampled bibliography (or incomplete one) has a higher concentration value than the complete one. These models, hence, shed some light on the measure- ment of production inequality in incomplete bibliogra- phies. Introduction Bibliographies [or its generalization: Informetric Produc- tion Processes (IPPs)] are formally described, for example, in Egghe (1989, 1990). A bibliography consists of a set of sources (e.g., journals), a set of items (e.g., articles), and a function that points out which items belong to which source (i.e., pointing out, for each article, in which journal it is published). Other interpretations of IPPs (even beyond the information sciences, e.g., in econometrics, biometrics, etc.) exist (see, e.g., Egghe and Rousseau, 1990), but we will not use them here. We will henceforth use the terminology “bibliography.” Typical for all bibliographies is the large inequality that exists between the production of the sources. Intuitively speaking, few sources have many items and many sources have few items. One talks in this connection also about “elitism” or “elitarism” of bibliographies—also found, for example, in econometric models relating to richness and poverty. In informetrics, the formal way to express inequal- ity is given by the classical informetric laws such as the ones of Lotka, Zipf, Mandelbrot, Bradford, Leimkuhler, and so on (see, e.g., Egghe and Rousseau, 1990). The measure- ment of inequality can be performed using concentration theory (see Egghe and Rousseau, 1990; Rousseau, 1992), but its invention goes back to the beginning of the 20th century (see, e.g., Gini, 1909; Lorenz, 1905). In its most elementary way it goes as follows. We start from a vector X x 1 ,..., x N where, for each i 1, . . ..,N, x i 0 denotes the number of items in (more generally the production of) the ith source in the bibliography. This sequence is ordered in a monotone way— here we will order it decreasingly. We transform X into its corresponding vector of relative values: A X a 1 ,..., a N where a i x i j1 N x j (1) for every i 1, . . . , N. The Lorenz curve L X of X is then formed by linearly interconnecting the points in the unit square 0, 0 , 1 N , a 1 , 2 N , a 1 a 2 , ..., i N , j1 i a j ,..., 1, 1 Note that indeed j1 N a j 1. * Permanent address © 2002 Wiley Periodicals, Inc. Published online 10 January 2002 DOI: 10.1002/asi.10033 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 53(4):271–281, 2002

Sampling and concentration values of incomplete bibliographies

  • Upload
    l-egghe

  • View
    216

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Sampling and concentration values of incomplete bibliographies

Sampling and Concentration Values ofIncomplete Bibliographies

L. Egghe*LUC, Universitaire Campus, B-3590 Diepenbeek, Belgiumand UIA, Universiteitsplein 1, B-2610 Wilrijk, Belgium. E-mail: [email protected]

This article studies concentration aspects of bibliogra-phies. More, in particular, we study the impact of incom-pleteness of such a bibliography on its concentrationvalues (i.e., its degree of inequality of production of itssources). Incompleteness is modeled by sampling in thecomplete bibliography. The model is general enough tocomprise truncation of a bibliography as well as a sys-tematic sample on sources or items. In all cases weprove that the sampled bibliography (or incomplete one)has a higher concentration value than the complete one.These models, hence, shed some light on the measure-ment of production inequality in incomplete bibliogra-phies.

Introduction

Bibliographies [or its generalization: Informetric Produc-tion Processes (IPPs)] are formally described, for example,in Egghe (1989, 1990). A bibliography consists of a set ofsources (e.g., journals), a set of items (e.g., articles), and afunction that points out which items belong to which source(i.e., pointing out, for each article, in which journal it ispublished). Other interpretations of IPPs (even beyond theinformation sciences, e.g., in econometrics, biometrics, etc.)exist (see, e.g., Egghe and Rousseau, 1990), but we will notuse them here. We will henceforth use the terminology“bibliography.”

Typical for all bibliographies is the large inequality thatexists between the production of the sources. Intuitivelyspeaking, few sources have many items and many sourceshave few items. One talks in this connection also about“elitism” or “elitarism” of bibliographies—also found, forexample, in econometric models relating to richness andpoverty. In informetrics, the formal way to express inequal-ity is given by the classical informetric laws such as theones of Lotka, Zipf, Mandelbrot, Bradford, Leimkuhler, andso on (see, e.g., Egghe and Rousseau, 1990). The measure-

ment of inequality can be performed using concentrationtheory (see Egghe and Rousseau, 1990; Rousseau, 1992),but its invention goes back to the beginning of the 20thcentury (see, e.g., Gini, 1909; Lorenz, 1905). In its mostelementary way it goes as follows. We start from a vector

X � �x1, . . . , xN�

where, for each i � 1, . . ..,N, xi � 0 denotes the number ofitems in (more generally the production of) the ith source inthe bibliography. This sequence is ordered in a monotoneway—here we will order it decreasingly. We transform Xinto its corresponding vector of relative values:

AX � �a1, . . . , aN�

where

ai �xi

�j�1

N

xj

(1)

for every i � 1, . . . , N. The Lorenz curve LX of X is thenformed by linearly interconnecting the points in the unitsquare

�0, 0�, � 1

N, a1� , � 2

N, a1 � a2� ,

. . . , � i

N, �

j�1

i

aj� , . . . , �1, 1�

Note that indeed

�j�1

N

aj � 1.

* Permanent address

© 2002 Wiley Periodicals, Inc.Published online 10 January 2002 ● DOI: 10.1002/asi.10033

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 53(4):271–281, 2002

Page 2: Sampling and concentration values of incomplete bibliographies

Because X is decreasing, the Lorenz curve LX is concavelyincreasing between (0,0) and (1,1). Let

Y � �y1, . . . , yN�

be a second vector as above. We say that Y represents amore concentrated situation than X if

LY � LX (2)

and where the higher concentration is strict if LY � LX atleast in some points of the graphs.

One can then proceed with this classical concentrationtheory by defining good concentration measures M as func-tions on such vectors X, Y such that LY � LX implies M(Y)� M(X). Classical examples of good concentration mea-sures are: V (the coefficient of variation), G (the Gini index),P (Pratt”s measure), Th (Theil’s measure) (see Egghe &Rousseau, 1990), but we will not be using these measureshere. It is well known that comparing the Lorenz curves isthe perfect solution to comparing inequality.

The above method applies to any bibliography: here in X� (x1, . . . , xN), each xi represents the number of articles inthe ith journal in the bibliography [where journals are or-dered in decreasing order of the number of articles theyhave (on a certain subject)]. In practice, however, we arefaced with the problem of incomplete bibliographies: wenever know the complete bibliography, for example, due tothe imperfectness of information retrieval machines. Thisincompleteness can also be interpreted in the followingnatural case: suppose we want to keep track of a bibliogra-phy in time. Then the cumulative set of journals and articlesup to, say 1999, is an incomplete version (or extract orsample) of the one up to, say 2000. Interpreted in this way,incomplete bibliographies have their application in thestudy of (even complete) bibliographies in function of time.Hence, the term “incomplete” does not only have to beinterpreted in its “smallest” interpretation, i.e., the one inwhich we only retrieve a part of what we want: the completebibliography.

An incomplete version of a bibliography can be inter-preted in several ways. In any way we will consider theincomplete version of a bibliography as a bibliography thatis obtained as a sample in the original one. Of course, thereare many ways to execute a sample. Because of the dualcharacter of bibliographies, at least two “major” types ofsamples are possible: a sample in the items and a sample inthe sources. We refer to Rousseau (1993) for a first attemptto study the effect of sampling on the concentration prop-erties of a bibliography, based on elaborated (theoretical)examples. In addition to these two types we can—in case weend up with a source with zero items (in case of an itemssample) or with a nonpicked source (in case of a sourcesample): (1) allow this source but with zero items (xi � 0),or (2) delete this source, i.e., considering as nonexistent.

These two approaches are considerably different. With-out going into the different sampling techniques discussedin the sequel, let us illustrate this by a simple example.Suppose X � (5,4,3,2,1) is our “complete” situaton; hence,a bibliography where we have one source with 5,4,3,2,1items, respectively. Deleting the items in the sources withone and two items leaves us two possibilities, as describedabove.

(1) The sampled bibliography is represented by the vector,denoted by s(X): s(X) � (5,4,3,0,0);

(2) The sampled bibliography is represented by the vector,denoted by �(X): �(X) � (5,4,3,).

These two cases are, conceptually, very different. De-noted in a general way, the difference between

k

�x1, . . . , xN�k, 0, . . . , 0�

and

�x1, . . . , xN�k�

(x1, . . . , xN�k � 0) can be the difference between (econo-metric interpretation).

(1) a group of N persons where the first N � k ones have agood salary, while the last k persons do not earn anymoney; and

(2) a group of N � k persons, all having a good salary.

In an informetric interpretation, we have the above dif-ference:

(1) we have a group of N researchers (in a scientific do-main) in which N � k of them are very productive (interms of number of publications) and in which k of themare not productive at all; and

(2) we have a group of N � k very productive researchers.

The above arguments lead to the following four samplingtypes (many more methods will be explained in the se-quel—we do not go into this now).

(1) sampling in items, keeping zero sources (or in otherterms: the number of sources (N) is fixed),

(2) sampling in items, deleting zero sources (here the num-ber of sources varies),

(3) sampling in sources, keeping not-selected sources aszero sources (again using a fixed number N of sources),

(4) sampling in sources, deleting not-selected sources(again here the number of sources varies).

The ultimate goal of studying the above samplingtypes—besides their proper theoretical interest—is to beable to conjecture some results concerning the concentrationof a real incomplete bibliography, for example, a retrieved

272 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—February 15, 2002

Page 3: Sampling and concentration values of incomplete bibliographies

bibliography and to determine how it differs from the con-centration of a (unknown) bibliography.

The next section deals with sampling in items. There weprove, using a very general sampling method (to be dis-cussed there and comprising truncation of a bibliography aswell as systematic samples of the bibliography, called byRousseau, 1993, perfectly stratified sampling—see furtherfor exact definitions) that, if the number of sources is fixedand if we sample from the least productive sources to themost productive ones (the most important case as will beexplained there), the Lorenz curve of the sampled bibliog-raphy is always above the one of the complete bibliography.In all other cases (including the deletion of zero sources) weproduce counterexamples showing that the Lorenz curve ofthe sampled bibliography is not always above or below theone of the complete bibliography.

The third section deals with sampling in sources. Alsohere we prove that, if the number of sources is fixed and ifwe keep the most productive sources (see further for anexact definition), the Lorenz curve of the sampled bibliog-raphy is always above the one of the complete bibliography.

In summary, the most “natural” sampling types lead to anincrease of the inequality (in production of the sources) inthe sampled bibliography. Hence, this leads to a systematicoverestimation of the inequality (concentration) of the com-plete bibliography.

Sampling Items

We will first introduce two important item samplingmethods that will turn out to be two extreme cases of thegeneral sampling method that we will discuss in this section.

Systematic Sample

As always, a bibliography is represented by a productionvector

X � �x1, . . . , xN�

where xi���{0}, for all i � 1, . . . , N. We order X decreas-ingly and sample in the items, using the least productivesources first. Let ��Q� (the positive rational numbers) besuch that 0 � � � 1 and such that � is a multiple of

1

�j�1

N

xj

.

Otherwise stated,

�j�1

N

xj � N,

the total number of items, is a1

�multiple (this is used to

have no rounding-off errors that would disturb the results aswe will see in the sequel). A systematic sample startsreplacing xN by [�xN] ([x] denotes the largest entire number,smaller than or equal to x). This will be the Nth coordinates(xN) of the sampled vector, denoted by s(X). The “rest”1

�(�xN � [�xN]) is then added to xN�1 and, for the (N�1)th

coordinate of s(X) we take

s�xN�1� � ���xN�1 �1

���xN � ��xN���� (3)

The rest is added to xN�2 and so on. The notation is gettingrather complicated, but there is a simple way to express thecumulative number

�j�i

N

s�xj�

for every i � 1, . . . , N. It is nothing else than

�j�i

N

s�xj� � ��� �j�i

N

xj�� (4)

which makes life much more easy and allows us not to use(3) further on: (4) will suffice.

Another way to explain systematic sampling is as fol-

lows. Let � �1

n(n��). Replace each xj by a unit row vector

of length xj, so that the production vector (x1, . . . , xN) isreplaced by [(1, . . .,1)(1, . . .,1)(. . .)(1, . . .,1)]. Ignoring theinner brackets we essentially have a unit vector of length

�j�1

N xj. Choosing our1

nsample is then just a matter of going

along this vector (from right to left), ignoring the innerbrackets and highlight each nth entry. Then s(xj) is just thenumber of highlighted entries in the bracket correspondingto the jth source.

The same sort of construction can be described when �

�p

q(p,q��, p q). Here, the production vector (x1, . . .,xN)

is replaced by replacing each xj by a unit matrix with p rowsand xj columns so that overall we have a unit matrix with p

rows and �j�1

N xj columns. To execute thep

qsample we run

down the columns successively highlighting ever qth entryand then count up the number of highlighted entries withineach matrix.

Some examples will show the simplicity of the method.

(1) X � (4,1,1) � (x1,x2,x3) and � � 1

2. Because 1

2 x3 1

we take s(x3) � 0, and we shift x3 � 1 to the second and

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—February 15, 2002 273

Page 4: Sampling and concentration values of incomplete bibliographies

add it. Now 2 �1

2� 1 yielding s(x2) � 1. Finally, 4 �

1

2� 2 yielding s(x1) � 2. Hence, s(X) � (2,1,0). Note that

�j�1

3 xj � 6 is a1

�� 2 multiple. In the description with

the unit vectors we rewrite X as [(1,1,1,1),(1),(1)], andhighlight every second 1 from the right. So we have[(1,1,1,1),(1),(1)], yielding s(X) � (2,1,0) again. In thesame way the following exercices can be executed.

(2) X � (4,2,1,1,1), � � 1

3. Note that �

j�1

5 xj � 9 � 3

multiple. Applying the same method now yields (verify)

s�X� � �2, 0, 1, 0, 0�

but because s(X) is not decreasing we define [order s(X)decreasingly]

s�X� � �2,1,0,0,0�

� can be any number in Q��[0,1] such that �i�j

N xi is a1

��

multiple and not just a number of the form � �1

n, n��.

Example:

(3) X � (1,1,1,1,1), � � 3

5. Note that �

j�1

5 xj � 5, a 5

3

multiple. Verify that

s�X� � �1,1,0,1,0�

and, hence, that

s�X� � �1,1,1,0,0�.

Systematic sampling is one of the most importantsampling methods, which is also applied in real life tohave a fast method that resembles (or approximates)random sampling (see, e.g., Carpenter & Storey Vasu,1978; Clarke & Cooke, 1992; Egghe & Rousseau, 1990,2001a). Only in cases of production units in factories,where a production error might occur in every nth objectsay, systematic sampling might give sampling results thatdiffer from random sampling. As described in the men-tioned references, there is very little chance that we havethis problem in sampling in bibliographies. In our inter-pretation we think it resembles the “making” of incom-plete bibliographies very well. Yet, systematic samplingwill be generalized in this section, where source-variable� will be allowed (see further).

Truncation

Let the bibliography be represented by X � (x1,. . . , xN),ordered decreasingly. Let i�{1, . . . , N}. The i-truncation ofthis bibliography is obtained by keeping the i most produc-tive sources and putting the sources on rank i � 1, . . . , N onzero production. Hence, the i-truncation of the bibliography

is represented by

N � i

s�X� � �x1, . . . , xi, 0, . . . , 0� (5)

The i-truncation can be considered as the bibliographyconsisting of the i “core” sources of the original bibliogra-phy (see Egghe & Rousseau, 2001b, for a treatment of coresof a bibliography).

General Model for Sampling in Items

The above item sampling methods can be generalized asfollows. Let X � (x1,. . . , xN) represent our bibliography.We suppose X to be decreasing. The philosophy of thisgeneral method is allowing for variable sample fractions �i

(i � 1, . . .,N) dependent on the source i. Most naturally werequire (�i)i�1,. . .,N to be decreasing (including a constantsequence): in this sampling method, items in low productivesources (high i) have a lower chance to be picked for thesample than items have in sources with low i (highlyproductive sources). In exact mathematical terms, s(X)� [s(x1), . . .,s(xN)], the representation of the sampled bib-liography, is obtained as follows:

s�xN� � ��NxN� (6)

of course sampling first in the low productive sources. Thedecimal rest, �NxN �[�NxN] is then added to the (N � 1)thsource (the same as shown earlier). Then

s�xN�1� � ��N�1xN�1 � �NxN � ��NxN�� (7)

and so on. The simplest way to describe this samplingmodel is as follows: for every i � 1, . . . , N

�j�i

N

s�xj� � � �j�i

N

�jxj� (8)

From this it also follows that, for every i � 1, . . . , N

s�xi� � � �j�i

N

�jxj � �j�i�1

N

s�xj�� (9)

because �j�i�1

N s(xj)��. Formula (9) is easily obtained and

certainly easier than deriving it from a complete inductionargument based on (6) and (7) (which, however, is alsopossible). The vector representing the sampled bibliographyis denoted s(X) � [s(x1), . . .,s(xN)]. Its decreasing version isdenoted s(X) � [s(x1), . . .,s(xN)]. Denote by LX, Ls(X),Ls(X), the Lorenz curves derived from X, s(X), s(X). Wehave the following general result.

274 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—February 15, 2002

Page 5: Sampling and concentration values of incomplete bibliographies

Theorem I

If (�i)i�1,. . . .,N is decreasing and if �j�1

N�jxj��, then

Ls�x� � Ls�X� � LX (10)

Hence, the sampled bibliography is more concentrated thanthe original one.

Because the proof is a bit lengthy, we give it in Appendix1.

Note that the sampling methods described earlier are aspecial case of this

(1) Systematic sampling is obtained by taking

�1 � �2 � . . . � �N � � (11)

Note that in this case, the condition that �j�1

N xj must be an

1

�multiple is the same as the requirement in Theorem I

above:

�j�1

N

�jxj � � (12)

This is a very natural requirement where we only look atbibliographies in which “entire” items are sampled. Besides,below, we will give a counterexample to Theorem I in case(12) is not satisfied. So, because systematic sampling isincluded in Theorem I we have here also that the sampledbibliography is more concentrated than the original one.

(2) Truncation is obtained by taking

� �1 � . . . � �i � 1�i�1 � . . . � �N � 0 (13)

Note that here (12) is always valid and that s(X) � s(X).Because Theorem I applies to truncation, we have a gener-alization of the result obtained in Egghe and Rousseau(2001b), which has, as mentioned above, an application inthe determination of the core of a bibliography.

Example Showing That the Requirement (12) Cannot beDropped in Theorem I

Even in the case of a systematic sampling we can give a

counterexample. Take X � (3,1,1), � � 0.5, hence �j�1

3 xj is

not a1

�� 2 multiple. It is easily seen that s(X) � (1,1,0),

hence

AX � �3

5,

1

5,

1

5�As�X� � �1

2,

1

2, 0�

showing that LX and Ls(X) cross.If we sample items, using the most productive sources

first, Theorem I is not true.

If the Sampling Method is Done in the Reverse Way, i.e.,by Starting with the Most Productive Sources, ThenTheorem I Is Not True, Nor Do We Have an OppositeResult

This sampling method replaces (6)–(8) by

s�x1� � ��1x1�

s�x2� � ��2x2 � �1x1 � ��1x1��

�j�1

i

s�xj� � � �j�1

i

�jxj�Even in the PSS case (�1 � . . . � �N � �) the analog ofTheorem I, nor the opposite result (Ls(X) � LX) are generallytrue. Indeed, take X � (5,1,1,1,1), � � 1

3. Then this sampling

method yields s(X) � (1,1,0,0,1), and hence, s(X)� (1,1,1,0,0). But

AX � �5

9,

1

9,

1

9,

1

9,

1

9� ,

As�X� � �1

3,

1

3,

1

3, 0, 0�

showing that Ls(X) and LX cross, and hence also Ls(X) LX.

If the Sampling Method Is Applied But with Dropping theZero Sources, Then Theorem I Is Not True, Nor Do WeHave an Opposite Result

Take X � (12,4,2,1,1), � � 0.25. Then here s(X)� (3,1,1) and so

AX � �12

20,

4

20,

2

20,

1

20,

1

20� ,

As�X� � �3

5,

1

5,

1

5�It is trivial to see that Ls(X) LX. Of course, any examplewhere no zero sources occur yields Ls(X) � LX, by Theorem

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—February 15, 2002 275

Page 6: Sampling and concentration values of incomplete bibliographies

I, because this sampling method and the one describedabove then coincide. The next is an example where Ls(X) andLX cross.

Take X � (3,3,2,2,1,1), � � 0.5. Then s(X) � (2,1,1,1,1),

AX � � 3

12,

3

12,

2

12,

2

12,

1

12,

1

12� ,

As�X� � �2

6,

1

6,

1

6,

1

6,

1

6� .

The equation of the line connecting (0,0) with �26, 612

� is y �32x. For x � 1

5this yields Lx�

15� �

310

, while Ls(X)(15)�2

6� 3

10.

But LX (25)�LX(2

6)�1

2�Ls(X)(

25). This shows that LX and Ls(X)

cross.This completes the study of all cases of sampling on

items. We now proceed with the case of sampling onsources.

Sampling Sources

Also here, the most important case of sampling insources, namely systematic sampling, will yield a result asin Theorem I, namely Ls(X) � LX. It is the case where wekeep the nonpicked sources as zero sources and where wesample in such order that the largest sources are kept in thesample. This will be described now.

Description of the Model of Sampling in Sources

Again, we represent the bibliography by X � (x1, . . .,xN)(decreasing), and we let � Q��]0,1[ such that is a

multiple of1

N. Otherwise stated, N is an

1

multiple. A

systematic sample in sources, giving priority to higher pro-ductive sources yields the bibliography represented by thevector

s�X� � �x1, . . . , x1/�1, 0,

x1/�1, . . . , x2/�1, 0, . . . , xN�1, 0� (14)

The decreasing order version of this vector is then

K

s�X� � �x1, . . . , x1/�1, x1/�1, . . . ,

x2/�1, . . . , . . . , xN�1, 0, . . . , 0� (15)

where

N �K

, K � �. (16)

We have the following result.

Theorem IIThe above sampling method in sources yields

Ls�X� � LX. (17)

Hence, the sampled bibliography is more concentrated thanthe original one.

Because the proof is rather technical we give it in Ap-pendix 2.

The Above Theorem Is Not Valid for the (Nondecreasing)s(X)

This is clear, because so many zeroes occur. An exampleTake X � (4,3,2,1), � 0.5. Then s(X) � (4,0,2,0). Hence

AX � � 4

10,

3

10,

2

10,

1

10�and

As�X� � �4

6, 0,

2

6, 0�

So

LX�1

4� �4

10� Ls�X��1

4� �4

6

but

LX�1

2� �7

10� Ls�X��1

2� �4

6.

The Above Theorem Is Not Valid for Not PerfectlyStratified Samples

Indeed, even the simplest case of replacing one source bya zero source, does not yield the result. Take X� (3,3,1,1,1,1) and s(X) � (3,1,1,1,1,0) (hence replacingone source with 3 items by a zero source). Now

AX � � 3

10,

3

10,

1

10,

1

10,

1

10,

1

10�and

As�X� � �3

7,

1

7,

1

7,

1

7,

1

7, 0�

Note that

LX�1

6� �3

10� Ls�X��1

6� �3

7

276 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—February 15, 2002

Page 7: Sampling and concentration values of incomplete bibliographies

but

LX�2

6� �6

10� Ls�X��2

6� �4

7.

Note that LX and Ls(X) intersect twice because

LX�5

6� � Ls�X��5

6� � 1.

The Above Theorem Is Not Valid If We Sample in theReverse Way

Here, instead of replacing each

xi/�i � 1, . . . ,K�

by 0 as in (14) we replace each

XN�i/�1 by 0.

Example: X � (10,1,1,1), � 0.5, s(X) � (1,1,0,0). But

AX � �10

13,

1

13,

1

13,

1

13�As�X� � �1

2,

1

2, 0, 0�

Hence LX and Ls(X) cross.

The Above Theorem Is Not Valid If We Delete Sources(instead of Replacing Them by 0)

Indeed, take X � (4,3,2,1), � 0.5 (and sample as in thetheorem, except that we delete nonpicked sources). Heres(X) � (4,2) and, hence

AX � � 4

10,

3

10,

2

10,

1

10�As�X� � �4

6,

2

6�Hence

LX�1

2� �7

10� Ls�X��1

2� �4

6.

Also, an opposite example exists: X � (10,1,1,1), � 0.5.Then s(X) � (10,1), hence

AX � �10

13,

1

13,

1

13,

1

13�

and

As�X� � �10

11, 1�

So LX Ls(X) here.

Conclusions

In this article we studied different sampling techniques inbibliographies, comprising sampling in items and samplingin sources.

When sampling in items we allow the probability for anitem to be picked to be increasing with the number of itemsin the sources. In this general setting we prove that, if westart sampling in the least productive sources (hereby keep-ing zero sources), the Lorenz curve of the sampled bibliog-raphy is above the Lorenz curve of the original one. In otherwords, the sampled bibliography is more concentrated thanthe original one. The model and result applies to the case ofsystematic sampling (often used as an approximation forrandom sampling) as well as to truncation of bibliographies[i.e., only using the “core” of the bibliography (see Egghe &Rousseau, (2001b)].

When sampling in sources, a similar result is proven.Here we show that a systematic sample in sources, keepingthe most productive sources and replacing nonpickedsources by a zero source, yields a sampled bibliography forwhich the Lorenz curve is above the one of the originalbibliography.

We also show that all variants of the above methods(reversing the order, deleting zero sources, etc.) do not yieldsuch (or another) result.

So, in the two major sampling methods, we have that thesampled bibliography is more concentrated than the originalone. This gives information about the concentration of bib-liographies (as we receive them, e.g., as the result of an IRaction) compared to the (unknown) complete one. In allcases we can say that the observed concentration is higherthan the concentration of the complete bibliography.

Note

As remarked by one of the referees, the requirement that

�j�1

N

�jxj � �

(in the case of sampling in items Theorem I) and the

requirement that N is an1

multiple (in the case of sampling

in sources—see earlier) is not always valid in practice. Verysimply, if N is prime, the sampling in sources is not evenpossible.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—February 15, 2002 277

Page 8: Sampling and concentration values of incomplete bibliographies

What is behind these requirements is that (1) for exactresults, we need them (because we show by example thatwithout them the results are wrong); (2) for large N (whichis always the case) one can drop the requirements, herebyonly making a mistake in the last item (or source) sampled.This yields an alteration of the Lorenz curve which, how-ever, goes to zero for N3 �. We estimate that in this case(large N) the found increase in concentration will be there.In short, we indicate that, in practical cases, concentrationincreases.

Problem

We leave it as an open problem (for further study) tomake explicite calculations of the difference of concentra-tion between a bibliography and its sampled version. Itwould yield information on the concentration values of acomplete (unknown) bibliography.

Acknowledgment

The author is grateful to Dr. R. Rousseau for interestingdiscussions on the topic of this article.

References

Carpenter, R.L., & Storey Vasu, E. (1978). Statistical methods for librar-ians. American Library Association, Chicago, USA.

Clarke, G.M., & Cooke, D. (1992). A basic course in statistics (3rd ed.).London: Edward Arnold.

Egghe, L. (1989). The duality of informetric systems with applications tothe empirical Laws. Ph. D. Thesis. The City University, London, UK.

Egghe, L. (1990). The duality of informetric systems with applications tothe empirical laws. Journal of Information Science, 16(1), 17–27.

Egghe, L., & Rousseau, R. (1990). Introduction to informetrics. Quantita-tive methods in library, documentation and information science. Am-sterdam: Elsevier.

Egghe, L., & Rousseau, R. (2001a). Elementary statistics for effectivelibrary and Information service management. London: Aslib.

Egghe, L., & Rousseau, R. (2001b). The core of a scientific subject: Anexact definition using concentration theory and fuzzy set theory. In:Proceedings of the 8th International Conference on Scientometrics andInformetrics. Volume 1, Syndney, Australia. The University of NewSouth Wales, 2001, 147–156.

Gini, C. (1909). Il diverso accrescimento delle classi sociali e la concen-trazione della richezza. Giornale degli Economisti, 11, 37.

Lorenz, M.O. (1905). Methods of measuring concentration of wealth.Journal of the American Statistical Association, 9, 209–219.

Rousseau, R. (1992). Concentration and diversity measures in informetricresearch. Doctorate Dissertation, University of Antwerp.

Rousseau, R. (1993). Measuring concentration: Sampling design issues, asillustrated by the case of perfectly stratified samples. Scientometrics,28(1), 3–14.

Appendix 1

Proof of Theorem I

That Ls(X) � Ls(X) is trivial because s(X) is the decreas-ing reorder of s(X) and, hence, for every i � 1, . . . , N

�j�1

i

s�xj� � �j�1

i

s�xj� (A1)

and note that

�j�1

N

s�xj� � �j�1

N

s�xj�.

Hence, it suffices to prove that Ls(X) � LX. Denote by T thetotal number of items,

�j�1

N

xj.

LX is obtained by linearly connecting the points

�0, 0�, � 1

N,

x1

T � , . . . , � i

N,

�j�1

i

xj

T � , . . . , �1, 1� (A2)

Ls(X) is obtained by linearly connecting the points

�0, 0�, �1

N,

s�x1�

�j�1

N

�jxj� , . . . , � i

N,

�j�1

i

s�xj�

�j�1

N

�jxj� , . . . , �1, 1�

(A3)

Now

�j�1

i

s�xj�

�j�1

N

�jxj

�j�1

N

s�xj� � �j�i�1

N

s�xj�

�j�1

N

�jxj

�j�1

N

�jxj � � �j�i�1

N

�jxj��j�1

N

�jxj

So, if the above number is larger than or equal to

�j�1

i

xj

T

for all i, we have shown that Ls(X) � LX. We can exclude thecase xi�1�. . .�xN � 0 because then the assertion is trivial.We have to show that

278 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—February 15, 2002

Page 9: Sampling and concentration values of incomplete bibliographies

�j�1

i

�jxj � �j�i�1

N

�jxj

�j�1

i

xj � �j�i�1

N

xj

�i�1

N

�jxj

T�

� �j�i�1

N

�jxj��

j�i�1

N

xj

Because for all a,b,c,d���

a

b�

c

df

a � c

b � d�

c

d�

�c�

d

it suffices to show that

�j�1

i

�jxj

�j�1

i

xj

�j�i�1

N

�jxj

�j�i�1

N

xj

(A4)

From the lemma below we have that

�j�1

i

�jxj

�j�1

i

xj

� i � ��i, �1� (A5)

unless x1 � . . . � xi � 0 but then, because X decreases, Xis the zero vector, which we exclude and (applied to �i�1,. . . ,�N,xi�1, . . .,xN)

�j�i�1

N

�jxj

�j�i�1

N

xj

� �i � ��N, �i�1� (A6)

(because xi�1 � . . . � xN � 0 is excluded).Because �i�1 � �i, for all i, we have that (A5) and (A6)

imply (A4), completing the proof of the theorem.

Lemma

�j�1

i

�jxj � i� �j�1

i

xj� (A7)

where i � [�i,�1].

Proof

Because all �j � [0,1] we have that

�j�1

i

�jxj � �0, �j�1

i

xj�and hence,

�j�1

i

�jxj � i� �j�1

i

xj� (A8)

for a certain i � [0,1]. Suppose now that i �i. Hence, i

�j for all j � 1, . . .,i, and hence,

i� �j�1

i

xj� � �j�1

i

ixj � �j�1

i

�jxj

unless x1 � . . . � xi � 0 in which case (A7) is trivial forany i. Hence, (A9) contradicts (A8). In the same way,suppose i � �1. Hence, i � �j for all j � 1, . . . ,N, hencefor all j � 1, . . . ,i. So,

i��j�1

i

xj� � �j�1

i

ixj � �j�1

i

�jxj (A9)

contradicting (A8) again. Consequently, i � [�i,�1].

Appendix 2

Proof of Theorem II

We have to compare the Lorenz curves LX of X � (x1, . . .,xN) and

K

s�X� � �x1, . . . , x1/�1, x1/�1, . . . , x2/�1, x2/�1, . . . ,

x�K�1�/�1, . . . , x�K/��1, 0, . . . , 0� (A10)

To prove that Ls(X) � LX we have to show (denote M�1

)

that

x1 � . . . � xM�1 � xM�1 � . . . � x2M�1 � . . .� xiM�1 � xiM�1 � . . . � xiM�j

T � ���1

K

x�M

�x1 � . . . � xiM�j�i

T(A11)

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—February 15, 2002 279

Page 10: Sampling and concentration values of incomplete bibliographies

where i � 1, . . .,K�1 ; j � 1,. . . .,M � 1 (note that j � M� 1 denotes the case where we have x(i�1)M�1 as last termin the nominator of the left-hand side and note that the casei � 0 yields (A11) trivially, for all j � 1, . . . ,M�1). (A11)gives

x1T � . . . � xM�1T � xM�1T � . . . � x2M�1T � . . .

� xiM�1T � xiM�1T � . . . � xiM�jT � x1T � . . .

� xiM�j�iT � xM�x1 � . . . � xiM�j�i� � x2M�x1 � . . .

� xiM�j�i� � . . . � xKM�x1 � . . . � xiM�j�i�

where KM � N.In the right-hand side, among the sum x1T� . . .

�xiM�j�iT there are exactly

� � iM � j � i

M �multiples of M, which remain after cancellation with theleft-hand side. This yields the equivalent condition

xiM�j�i�1T � . . . � xiM�jT � xMT � x2MT � . . . � x MT

� xM�x1 � . . . � xiM�j�i� � x2M�x1 � . . . � xiM�j�i� � . . .

� xKM�x1 � . . . � xiM�j�i�

Hence

xiM�j�i�1T � . . . � xiM�jT � x� �1�M�x1 � . . . � xiM�j�i�

� x� �2�M�x1 � . . . � xiM�j�i� � . . . � xKM�x1 � . . .

� xiM�j�i� � xM�xiM�j�i�1 � . . . � xN� � x2M�xiM�j�i�1

� . . . � xN� � . . . � x M�xiM�j�i�1 � . . . � xN�

� � �p�1

xpM�� �t�iM�j�i�1

iM�j

xt � �s�iM�j�1

N

xs� (A12)

Hence, we have the condition

� �t�iM�j�i�1

iM�j

xt��T � �p�1

xpM� � � �q� �1

K

xqM�� �r�1

iM�j�i

xr�� � �

p�1

xpM�� �s�iM�j�1

N

xs� (A13)

Note that

T � �p�1

xpM � �M � 1� �p�1

xpM

because X decreases, and, again because X decreases,

�t�iM�j�i�1

iM�j

xt � �m�iM�j�1

iM�j�i

xm � �m�iM�j�i�1

iM�j�2i

xm � . . .

� �m�iM�j��M�2�i�1

iM�j��M�1�i

xm

Hence

� �t�iM�j�i�1

iM�j

xt��T � �p�1

xpM� � � �p�1

xpM�� �m�iM�j�1

iM�j�i

xm

� �m�iM�j�i�1

iM�j�2i

xm � . . . � �m�iM�j��M�2�i�1

iM�j��M�1�i

xm�(M � 1 terms in the last factor).

� � �p�1

xpM�� �m�iM�j�1

iM�j��M�1�i

xm�Hence, it suffices to show, by (A13), that

� �p�1

xpM�� �m�iM�j�1

iM�j��M�1�i

xm� � � �q� �1

K

xqM�� �r�1

iM�j�i

xr�� � �

p�1

xpM�� �s�iM�j�1

N

xs�which reduces to

� �q� �1

K

xqM�� �r�1

iM�j�i

xr� � � �p�1

xpM�� �s�iM�j��M�1�i�1

N

xs�(A14)

or

�p�1

xpM

�q� �1

K

xqM

�r�1

iM�j�i

xr

�s�2Mi�j�i�1

N

xs

(A15)

280 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—February 15, 2002

Page 11: Sampling and concentration values of incomplete bibliographies

But for all p � 1, . . . ,

xpM �1

M�x�p�1�M�1 � . . . � xpM�

and for all q � � 1, . . . , K � 1

xqM �1

M�xqM�1 � . . . � x�q�1�M�

because X decreases, yielding, because N � KM,

�p�1

xpM

�q� �1

K

xqM

1

M�x1 � . . . � x M�

1

M�x� �1�M�1 � . . . � xN� � xN

�x1 � . . . � x M

x� �1�M�1 � . . . � xN�

x1 � . . . � xiM�j�i

x� �1�M�1 � . . . � xN(A16)

because M � [iM�j�i] � iM � j � i.Furthermore,

� � 1�M � 1 � M � M � 1 � iM

� j � i � M � 1 � 2iM � j � i � 1

because i � 1. Consequently, (A16) gives

�p�1

xpM

�q� �1

K

xqM

�r�1

iM�j�i

xr

�s�2iM�j�i�1

N

xs

which is (A15) and, hence, (A11). Consequently, Ls(X)

� LX in all cases.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—February 15, 2002 281