7
An Investigation of the Coauthor Graph Elisabeth L. Logan School of Library and Information Studies, Florida State University, Tallahassee, FL 32306 W. M. Shaw, Jr. School of Library Science, University of North Carolina, Chapel Hill, NC 27514 The structure of coauthor graphs and the statistical va- lidity of the associated author partitions are investigated as a function of productivity and collaborative thresh- olds. The productivity threshold determines the number of authors (points) in a coauthor graph, and the collabo- rative threshold determines the number of coauthor pairs (lines) in the graph. The statistical validity of author partitions is determined by the random-graph hypothe- sis. The results show that for “small” databases, statis- tically preferred partitions occur when all authors and coauthor pairs appear in the graph. For “large” data- bases, statistically preferred partitions occur when au- thors and coauthor pairs who publish only one article are excluded from the graph. Unlike other bibliometric rela- tionships, the highly selective nature of the collaborative relationship produces a wide range of threshold values for which the associated partitions are statistically valid. It remains to be shown how the statistical validity of partitions is related to the empirical significance of the same partitions. Introduction Communication networks that connect members of an academic discipline or subject specialty have been in- vestigated by a number of authors. Among these, Price and Beaver [I], Crawford [2], Crane [3], Griffith and Mullins [4], and Mullins [5] have examined complex structures based on a variety of social relationships that influence the process of informal communication. In each of these studies, the coauthor relationship contributes to the structure of the com- munication network. Studies have also indicated that in many scientific disci- plines there has been a significant increase in multiple au- thorship, although individual disciplines vary in both the rate of increase and the frequency of coauthorship [6-111. Price documented an increase in contributions by multiple authors cited in Chemical Abstracts from less than 20% in 1900 to more than 60% in 1960 [7]. More recently, Stefaniak reported that of those articles published by Polish authors in 1978 and cited in the 1980 edition of Chemi- cal Abstracts Condensates, 70% represented collaborative efforts [12]. On the basis of an extensive investigation of scientific collaboration, Beaver and Rosen concluded that “During the twentieth century, the increasing propen- sity of scientists to collaborate has been a significant trend within the overall pattern of a mushrooming scientific establishment” [ 131. Given the importance of collaboration in certain disci- plines, it is reasonable to assume that the coauthor re- lationship plays a significant role in the communication structure produced by a broader class of sociological re- lationships. Indeed, in some areas it might be possible to derive essentially the same information from an analysis of coauthor structures that can be derived from an analysis of more complex structures. Because of the potential im- portance of the coauthor relationship in the study of informal communication, it is desirable to understand certain formal properties of the relationship. Received January 2, 1986; accepted July 18, 1986 Q 1987 by John Wiley and Sons, Inc. Structures produced by the coauthor relationship, like those produced by other bibliometric relationships, are sen- sitive to certain threshold choices that heretofore have been made in an essentially arbitrary manner. The purposes of this article are to show how the coauthor structure is influ- enced by “productivity” and “collaborative” thresholds, and to show how the statistical validity of coauthor partitions is influenced by the same threshold choices. The results are compared and contrasted to similar results produced by the cocitation relationship and by an indexing representation. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. 38(4):262-268, 1987 CCC 0002-8231/87/040262-07$04.00

An investigation of the coauthor graph

Embed Size (px)

Citation preview

Page 1: An investigation of the coauthor graph

An Investigation of the Coauthor Graph

Elisabeth L. Logan School of Library and Information Studies, Florida State University, Tallahassee, FL 32306

W. M. Shaw, Jr. School of Library Science, University of North Carolina, Chapel Hill, NC 27514

The structure of coauthor graphs and the statistical va- lidity of the associated author partitions are investigated as a function of productivity and collaborative thresh- olds. The productivity threshold determines the number of authors (points) in a coauthor graph, and the collabo- rative threshold determines the number of coauthor pairs (lines) in the graph. The statistical validity of author partitions is determined by the random-graph hypothe- sis. The results show that for “small” databases, statis- tically preferred partitions occur when all authors and coauthor pairs appear in the graph. For “large” data- bases, statistically preferred partitions occur when au- thors and coauthor pairs who publish only one article are excluded from the graph. Unlike other bibliometric rela- tionships, the highly selective nature of the collaborative relationship produces a wide range of threshold values for which the associated partitions are statistically valid. It remains to be shown how the statistical validity of partitions is related to the empirical significance of the same partitions.

Introduction

Communication networks that connect members of an academic discipline or subject specialty have been in- vestigated by a number of authors. Among these, Price and Beaver [I], Crawford [2], Crane [3], Griffith and Mullins [4], and Mullins [5] have examined complex structures based on a variety of social relationships that influence the process of informal communication. In each of these studies, the coauthor relationship contributes to the structure of the com- munication network.

Studies have also indicated that in many scientific disci- plines there has been a significant increase in multiple au-

thorship, although individual disciplines vary in both the rate of increase and the frequency of coauthorship [6-111. Price documented an increase in contributions by multiple authors cited in Chemical Abstracts from less than 20% in 1900 to more than 60% in 1960 [7]. More recently, Stefaniak reported that of those articles published by Polish authors in 1978 and cited in the 1980 edition of Chemi- cal Abstracts Condensates, 70% represented collaborative efforts [12]. On the basis of an extensive investigation of scientific collaboration, Beaver and Rosen concluded that “During the twentieth century, the increasing propen- sity of scientists to collaborate has been a significant trend within the overall pattern of a mushrooming scientific establishment” [ 131.

Given the importance of collaboration in certain disci- plines, it is reasonable to assume that the coauthor re- lationship plays a significant role in the communication structure produced by a broader class of sociological re- lationships. Indeed, in some areas it might be possible to derive essentially the same information from an analysis of coauthor structures that can be derived from an analysis of more complex structures. Because of the potential im- portance of the coauthor relationship in the study of informal communication, it is desirable to understand certain formal properties of the relationship.

Received January 2, 1986; accepted July 18, 1986

Q 1987 by John Wiley and Sons, Inc.

Structures produced by the coauthor relationship, like those produced by other bibliometric relationships, are sen- sitive to certain threshold choices that heretofore have been made in an essentially arbitrary manner. The purposes of this article are to show how the coauthor structure is influ- enced by “productivity” and “collaborative” thresholds, and to show how the statistical validity of coauthor partitions is influenced by the same threshold choices. The results are compared and contrasted to similar results produced by the cocitation relationship and by an indexing representation.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. 38(4):262-268, 1987 CCC 0002-8231/87/040262-07$04.00

Page 2: An investigation of the coauthor graph

Coauthor Graph

The structure imposed on a set of authors by the coauthor relationship can be formally defined as a graph. A coauthor multigraph is composed of a set of authors, referred to as the point set, and pairs of distinct authors in the point set, referred to as lines, who have collaborated one or more times. The structure of a coauthor (threshold) graph is a function of two threshold values; the number of articles an author has published and the number of articles pairs of authors have published. Hereafter, a coauthor graph will be understood to mean a graph in which the points are authors who have published at least some threshold number of pa- pers, denoted by tp, and the lines are pairs of authors who have collaborated at least some threshold number of times, denoted by rc. Many combinations of threshold values will cause the authors to be grouped into disjoint clusters or components. The result is a disconnected graph in which the points are distributed among the components. In this case, a partition is produced such that

iPi=Py (1)

where s is the total number of components in the graph, p is the total number of points in the graph, and pi is the number of points in, referred to as the order of, the ith component. Components of order one (pi = 1) are referred to as isolated points, and a graph composed of one compo- nent (S = 1) is said to be connected. The number of lines in a graph is denoted by q [ 14,151. Figure 1 illustrates a disconnected coauthor graph with seven points (p = 7), five lines (q = 5), and three components (S = 3).

The partition of a disconnected graph can be expressed in terms of the component-order distribution, which specifies the number of components of a given order [ 161. In the case of a completely enumerated population, the partition can also be represented by the “relative statistical disorder” (the “relative entropy” or the “relative diversity”) of the associ- ated distribution [ 171. The relative statistical disorder of the partition is denoted by I,, and is given by

where

I = log,(p!/p,!, p*!, . . . ) pi !, . . . , ps!) , (3)

and

I max = log,[ p !] . (4)

The statistical disorder (I) and the maximum statistical dis- order (I,,) of a partition are expressed in “bits” [18]. The argument of the logarithmic function in Eq. (3) is the multi- nomial coefficient, which gives the number of ways p distinguishable objects (authors in this application) can be distributed among s mutually exclusive categories (the components of a disconnected coauthor graph in this ap- plication) such that the number of objects in each category remains the same. The magnitude of I is an explicit func- tion of the partition and is sensitive to changes in the

associated component-order distribution. The relative statis- tical disorder of the partition varies from zero, where the graph is composed of one component, to one, where all components are of order one.

The structure of a coauthor graph and the partition of the set of authors are dictated by the magnitudes of tp and t, [ 191. The value of tp determines the number of points (p) in the graph, and the magnitude of t, determines the number of lines (q) in the graph.

Random-Graph Hypothesis

It is assumed under the random-graph hypothesis (RGH) that the lines of a graph are selected randomly from the set of all possible lines. The assumption underlying the RGH constitutes a null hypothesis that defines a set of points and lines for which there is no “clustering structure.” No mean- ingful alternative hypothesis has been formulated; that is, no practical and mathematically usable definition of “clustering structure” has been presented. Given the above definition of no clustering structure, techniques associated with random- graph theory can be used to predict expected properties of random graphs [20, 211.

In the present application, partitions of authors produced by the coauthor relationship are investigated. For any graph withp points, the total number of possible lines, denoted by Q, is given by Q = p(p - 1)/2. An algorithm, imple- mented by a program called RANGRF, is used to select q lines without replacement from a set of Q possible lines, construct the graph, and compute the relative statistical dis- order of the associated partition [22]. The process continues until the most likely partition, the expected partition under the RGH, can be estimated. If the relative statistical disorder of the experimentally observed partition of p authors (points) with q coauthor pairs (lines), denoted by I,, is not significantly different from the expected partition, denoted by I,, the RGH must be accepted. Acceptance of the RGH provides strong evidence that no clustering structure exists in the data, and that the observed partition is merely an artifact of the clustering technique [23]. Partitions for which the RGH is accepted will be referred to as “statistically invalid,” and partitions for which the RGH is rejected will be referred as “statistically valid.” In the present applica- tion, acceptance of the RGH implies that all pairs of authors in some population of investigators are equally likely to collaborate.

It has been shown recently that the statistical validity of partitions produced by the cocitation relationship varies as a

FIG. I. An example of a disconnected coauthor graph.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-July 1987 263

Page 3: An investigation of the coauthor graph

function of the cocitation strength (an integer-valued simi- larity threshold conceptually equivalent to t,) [24]. It has also been shown that the statistical validity of partitions produced by an indexing representation varies as a function of a similarity threshold [25]. For both the cocitation re- lationship and the indexing representation, small values of the associated similarity threshold produce statistically in- valid partitions. That is, for these threshold values, the absolute value of the difference between the relative statis- tical disorder of observed and expected partitions, denoted by ) Z, - I, ) , is not significant. For these similarity values, the RGH is accepted, and it is concluded that there is no inherent clustering structure in the data. In both cases as the associated similarity threshold is increased, the magnitude of IL3 - Z, ( increases to a maximum value and then de- creases to a negligible difference, as illustrated in Figure 2. For intermediate values of the similarity thresholds, there is a significant difference between the observed and expected (on the basis of chance) partitions. As shown in Figure 2, critical similarity thresholds can be defined. Between and including these critical thresholds all partitions are statisti- cally valid. A “preferred” partition can also be defined. The observed partition for which (Z, - Z, 1 is maximized will be referred to as the preferred partition. That is, the observed partition with the greatest absolute deviation from the asso- ciated expected partition is preferred on the basis of statis- tical considerations. It is of interest to determine if the highly selective process of collaboration has an influence on the range of statistical validity, and if a preferred partition can be detected as a function of the productivity and collabo- rative thresholds, tp and t,, respectively.

1.0

- 0

l-r I 2

H -

0.0

Region of -statistical-

validity

Similarity Threshold

FIG. 2. An illustration of the absolute difference between the relative statistical disorder of observed and expected partitions (II, - [=I) as a function of the associated similarity threshold for the cocitation relationship and an indexing representation.

Results

Databases

The results presented here are based on bibliographies of three tropical diseases; Leishmaniasis, Trypanosomiasis, and Filariasis. The database for each disease is composed of journal articles published in the six-year interval 1977- 1982 that appear in the MEDLINE database. The total number of articles, authors, and coauthor pairs is given in Table 1.

Threshold Choices and the Structure of Coauthor Graphs

The importance of investigating the validity of author partitions as a function of tp and t, is related to the strong influence these thresholds have on the structure of the under- lying coauthor graph [26].

Because author productivity is highly skewed, the mag- nitude of tp can be expected to have a significant influence on the number of points in the graph [27]. Table 2 gives the number of points (p) in coauthor graphs as a function of the productivity threshold (t,) for the Leishmaniasis, Trypano- somiasis, and Filariasis databases. It can be seen that the magnitude of tP has a strong influence on the number of points in coauthor graphs. For example, in the case of Leish- maniasis, approximately 79% of the authors are excluded at tP = 2, and in the case of Filariasis, approximately 72% of the authors are excluded at tp = 2. For all three databases, the coauthor graphs are dominated by authors who have published only one article. The influence these authors have on the statistical validity of coauthor partitions is examined in the next section.

The productivity of coauthor pairs is also highly skewed, and the magnitude oft, can be expected to have a significant impact on the number of lines in the graph. Table 3 gives the number of lines (q) in coauthor graphs as a function of the collaborative threshold (t,) for the three databases. It can be seen that the magnitude of t, has a stronger influence on q than the magnitude of tp has on p. For example, in the case of Leishmaniasis, approximately 88% of the lines are elimi- nated at t, = 2, and in the case of Filariasis, approximately 80% of the lines are excluded at t, = 2. In all three cases, the structures of coauthor graphs are dictated by pairs of authors who have collaborated on only one article.

Because it can be demonstrated that the threshold values of tp and t, have a significant impact on the structure of a coauthor graph and the partition of the associated set of authors, and because it is customary to investigate biblio- metric structures at one combination of threshold values,

TABLE 1, Total number of articles, authors, and coauthor pairs.

Number of Number of Number of Authors Coauthor Pairs

Database Articles (P) (4)

Leishmaniasis 191 371 550 Trypanosomiasis 337 669 1314 Filariasis 617 1007 2083

264 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-July 1987

Page 4: An investigation of the coauthor graph

important questions arise. Do all threshold combinations produce equally meaningful structures and partitions? What is meant by a meaningful partition? The significance of coauthor partitions can be addressed formally by means of the RGH.

Threshold Choices and the Statistical Validity of Partitions

Presented below is an investigation of the relationship among II,, - Z, 1 , t,, and tp for each of the three databases. The results are described in terms of the range of statistically valid partitions and the location of statistically preferred partitions.

Results for the Leishmaniasis database are presented in Figure 3. In the figure, IZ, - Z, 1 is plotted against t, for fixed values of tp in the range 1 5 tp 5 4. It can be seen that for all values of tp, the greatest value of ( Z, - Z, 1 occurs at t, = 1; the maximum value of ( Zr, - I, 1 occurs at tp = 1 and t, = 1. Thus, for all values of tp, the statistically pre- ferred partitions occur at I, = 1, and the optimal partition (from a statistical perspective) occurs at tp = 1 and tc = 1.

It might be inferred from these initial results that a co- author graph, which includes all contributing authors and all collaborating pairs of authors, produces the “natural” struc- ture and the statistically preferred partition, and that higher values of tp or t,, which severely restrict the number of points and lines in the graph, produce “artificial” structures

0.81 1 I I I

‘P

0 I

l 2 A3 A 4

A3 A 4

1

0.79\

+e FTG. 3. The absolute difference between the relative statistical disorder of observed and expected partitions ((I, - I& as a function of the collabo- rative threshold (I,) for fixed values of the productivity threshold (to) and the Leishmaniasis database.

and less meaningful partitions. Subsequent results will show that the validity of this speculation varies as a function of the size of the graph.

TABLE 2. The number of points (p) in coauthor graphs as a function of the productivity threshold (I,) for the Leishmaniasis, Trypanosomiasis, and Filariasis databases.

Productivity Threshold

0,)

(Leishmaniasis) (Trypanosomiasis) (Filariasis) Number (Percent) Number (Percent) Number (Percent)

of Points of Points of Points

(PI (P) (PI

1 371 (100.0) 669 (100.0) 1007 (100.0) 2 78 ( 21.0) 172 ( 25.7) 279 ( 27.7) 3 26 ( 7.0) 81 ( 12.1) 157 ( 15.6) 4 16 ( 4.3) 40 ( 6.0) 96 ( 9.5) 5 7 ( 1.9) 16 ( 2.4) 71 ( 7.1) 6 4 ( 1.1) 11 ( 1.6) 50 ( 5.0) 7 8 ( 1.2) 31 ( 3.1) 8 4 ( 0.6) 25 ( 2.5)

TABLE 3. The number of lines (q) in coauthor graphs as a function of the collaborative threshold (z~) for the Leishmaniasis, Trypanosomiasis, and Filariasis databases.

(Leishmaniasis) (Trypanosomiasis) (Filariasis) Number (Percent) Number (Percent) Number (Percent)

of Lines of Lines of Lines

(4) (4) (q)

2083 (100.0) 402 ( 19.3) 165 ( 7.9) 70 ( 3.4) 40 ( 1.9) 15 ( 0.7) 10 ( 0.5) 9 ( 0.4)

Collaborative Threshold

(G)

550 (100.0) 63 ( 11.5) 18 ( 3.3) 4 ( 0.7) 1 ( 0.2)

1 314 (100.0) 202 ( 15.4)

68 ( 5.2) 17 ( 1.3) 8 ( 0.6) 5 ( 0.4) 4 ( 0.3)

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-July 1987 265

Page 5: An investigation of the coauthor graph

The t test can be used to determine if there is a statisti- cally significant difference between observed and expected partitions [ 161. For the Leishmaniasis database, all observed partitions are statistically valid, although the magnitude of IZ, - Z, 1 decreases as I, is increased from one to three. Only the lower of the two critical thresholds can be identi- fied with certainty, and unlike other bibliometric structures, the lower critical threshold occurs at the smallest similarity threshold; t, = 1 in this case.

The above results are quite different from those associ- ated with the cocitation relationship and the indexing repre- sentation that has been investigated. Firstly, there is no peak in the relationship between (Z, - Z, ( and the similarity threshold (tc) for any fixed value of the productivity thresh- old (tP); higher values of t, for any value of tp produce smaller values of (I, - Z, 1 . Secondly, the range of statisti- cally valid partitions produced by the coauthor relationship is greater than the range produced by the other bibliometric relationships that have been investigated. These results sug- gest and subsequent results confirm that coauthorship is much more likely than cocitation or an indexing represen- tation to produce statistically valid partitions. The highly selective nature of collaborative relationships ensures that the resulting structures and partitions are statistically meaningful.

Results for the Trypanosomiasis database are presented in Figure 4. In the figure, ) Z, - I, ) is plotted against t, for fixed values of tp in the range 1 5 tp 5 6. For this database, not all statistically preferred partitions occur at the lowest threshold values. At both tp = 2 and tp = 3, the maximum values of II, - I,, 1 and the preferred partitions occur at t, = 2. In both cases, there is evidence of an emerging regularity similar to that observed in other bibliometric re- lationships and illustrated in Figure 2. As in the case of the

Leishmaniasis database, however, the maximum value of (I, - Z, ( for any value of tP and the optimal partition occur at tp = 1 and tC = 1. As before, all observed partitions are statistically valid, and the upper critical threshold is not detected even though the magnitude of (Z, - Z, 1 is small at t, = 6.

The results for the Filariasis database are presented in Figure 5. In the figure, ) I, - Z, ) is plotted against t, for fixed values of tp in the range 1 % tp 5 9. The results show that for several values of tp (tp = 2, 3, 4, and 5), the mag- nitude of ) I,, - Z, ) is relatively “small” when t, = 1. For these values of t,, the magnitude of ( Z, - Z, 1 increases to a maximum value and then decreases as t, is increased from one to ten. For the values of tp in the range 2 I tp % 5, the maximum values of ( Z,0 - Ire 1 and the preferred partitions occur at intermediate values of t,. More important, the maximum value of IZ, - I, ( for any value of tp and the optimal partition occur at rP = 2 and tc = 2. As before, virtually all partitions are statistically significant so that only the lower critical threshold at t, = 1 can be detected. That the range of statistical significance cannot be defined pre- cisely over such a wide range oft, values is indicative of the highly selective (nonchance) factors that influence collabo- ration. The results for Filariasis reveal the emergence of a regularity in the magnitude of ( Z, - I, ( as a function t, that has been observed in other bibliometric structures.

Discussion

In Table 1, it can be seen that the three databases are ordered by the number of articles, authors, and pairs of collaborating authors. The Filariasis database includes al- most three times as many authors and almost four times as many coauthor pairs as the Leishmaniasis database. The

FIG. 4. The absolute difference between the relative statistical disorder of observed and expected partitions ()I, - I,)) as a function threshold (tC) for fixed values of the productivity threshold (t,) and the Trypanosomiasis database.

A 4 Cl 5 B 6

of the collaborative

266 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-July 1987

Page 6: An investigation of the coauthor graph

. 2 A 3 A 4 0 5 l 6 0 7 l 8 X 9

2 3 4 5 6 7 8 9 i 1

FIG. 5. The absolute difference between the relative statistical disorder of observed and expected partitions (II, - I,() as a function of the collaborative threshold (2,) for fixed values of the productivity threshold (I,) and the Filariasis database.

results suggest that above some critical number of authors, statistically preferred partitions include only those authors and pairs of authors who have published two or more ar- ticles. In an active field that involves many investigators, the authors who publish only one article may contribute a degree of randomness to the structure of the coauthor graph and the partition of the associated set of authors. At tp = 2 and t, = 1, the ratio of lines to points is so great that rela- tively few distinct partitions can occur. In this case, the relative statistical disorder of observed and expected par- titions can be similar. Thus, t,, = 2 and t, = 2 can be the first combination of threshold values that produce statisti- cally preferred partitions in a field which relies heavily on collaboration.

Conclusion

In this article the structure of coauthor graphs and the statistical validity of the associated author partitions are investigated as a function of productivity and collaborative thresholds. The productivity threshold controls the number of authors (points) in a coauthor graph, and the collaborative threshold controls the number of coauthor pairs (lines) in the graph. The results are based on three databases representing the tropical diseases Leishmaniasis, Trypanosomiasis, and Filariasis. The Leishmaniasis database includes the smallest number of authors and coauthor pairs, and the Filariasis database includes the largest number of authors and co- author pairs.

The results show that the structure of a coauthor graph is strongly influenced by the productivity and collaborative thresholds. The number of points in a coauthor graph de- creases rapidly as the productivity threshold is increased, and the number of lines decreases rapidly as the collabo- rative threshold is increased. The former result is expected

because of the highly skewed nature of author productivity. The latter result is consistent with that produced by the cocitation relationship.

The assumption underlying the random-graph hypothesis defines a set of points and lines for which there is no clus- tering structure and provides a formal criterion for evalu- ating coauthor partitions. The absolute difference between the relative statistical disorder of observed and expected (on the basis of chance) partitions is indicative of the influence of nonchance factors and provides a measure of “statistical preference.” For the Leishmaniasis and Trypanosomiasis databases, the statistically preferred partitions occur at the lowest threshold values. The preferred partitions occur when all points and lines appear in the coauthor graph. For most productivity thresholds, the difference between the relative statistical disorder of observed and expected par- titions decreases as the collaborative threshold is increased. For the Filariasis database, the difference between observed and expected partitions is relatively small at the lowest threshold values. For several productivity threshold values, the difference increases to a maximum and then decreases as the collaborative threshold is increased. A regularity emerges which is similar to that observed for the cocitation relationship and an indexing representation. In this case, the statistically preferred partition includes only those authors and pairs of authors who have published two or more ar- ticles. Unlike the cocitation relationship and an indexing representation, the highly selective nature of the collabo- rative relationship produces a wide range of threshold values for which there is a statistically significant difference be- tween observed and expected partitions.

Statistical considerations suggest that the threshold val- ues that dictate the structure of a coauthor graph and the partition of the author set cannot be chosen arbitrarily. For “large” databases it may be necessary to exclude authors and

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-July 1987 287

Page 7: An investigation of the coauthor graph

pairs of authors who have published only one article. Be- cause authors and pairs of authors who publish only one article are not likely to occupy important positions in a communication structure, it is empirically reasonable to in- clude only those who have published two or more articles. It remains to be shown how the least productive authors and coauthor pairs contribute to statistically meaningful struc- tures and partitions in the case of “small” databases and contribute a degree of randomness to large databases. It also remains to be shown how statistically preferred partitions are related to the empirical significance of the same par- titions [29]. A tentative inference suggests that imposing a variety of sociological relationships on a set of investigators, in an effort to produce a comprehensive communication structure, may increase the likelihood of producing statis- tically invalid partitions. The numerous pairwise relation- ships associated with the comprehensive structure might make all pairwise relationships appear to be equally likely, especially at low threshold values.

References

1.

2.

I.

8.

9.

10.

Price, D. deS.; Beaver, D. deB. “Collaboration in an Invisible Col- lege.” American Psychologist. 21:1011-1018; 1966. Crawford, S.“Informal Communication Among Scientists in Sleep Research.” Journal of the American Society for Information Science, 22:301-310; 1971. Crane, D. Invisible Colleges: Diffusion of Knowledge in Scientific Communities. Chicago: University of Chicago Press; 1972. Griffith, B. C.; Mullins, N. C. “Coherent Social Groups in Scientific Change.” Science. 1771957-964; 1972. Mullins, N. C. Theories and Theory Groups in ContemporaryAmeri- can Sociology. New York: Harper and Row; 1973. Pao, M. L. “Co-authorship and Productivity.” Proceedings of the American Socier?, for Information Science. 19:279-281; 1980.

Price, D. deS. Little Science, Big Science. New York: Columbia University Press; 1968. Gorden, M. D. “A Critical Assessment of Inferred Relations Between Multiple Authorship, Scientific Collaboration, the Production of Papers and Their Acceptance for Publication.” Scientometrics. 2:198-201; 1980. Heffner, A. G. “Funded Research, Multiple Authorship [,and] Col- laboration in Four Disciplines.” Scientometrics. 3:5-12; 1981. Clarke, B. L. “Multiple Authorship Trends in Scientific Papers.” Science. 143:822-824: 1964.

11.

12.

13.

14.

15. 16.

17.

18.

19.

20.

21.

22.

23.

24.

25.

26.

21.

28.

29.

Smith, M.“The Trend Toward Multiple Authorship in Psychology.” American Psychologist. 13576-577; 1950. Stefaniak, B. “Individual and Multiple Authorship of Papers in Chemistry and Physics.” Scientometrics. 4:331-337; 1982. Beaver, D. deB.; Rosen, R. “The Professional Origins of Scientific Co-authorship: Studies in Scientific Collaboration Part III.” Sci- entometrics. 1:231-245; 1977. Shaw, W. M., Jr. “Statistical Disorder and the Analysis of a Commu- nication Graph.” Journal of the American Society for Information Science. 34(2):146-149; 1983. Harary, F. Graph Theory. Reading, MA: Addison Wesley; 1969. Frank, 0. “A Review of Statistical Methods for Graph Analysis.” In: S. Leinhardt, Ed. Sociological Methodology. San Francisco: Jossey- Bass; 1981. Pielou, E. C. An Introduction to Mathematical Ecology. New York: Wiley-Interscience; 1969. Brillouin, L. Science and Information Theory. New York: Academic Press; 1962. Logan, E. L.; Shaw, W. M., Jr. “On the Statistical Validity of Co- author Partitions.” Proceedings of the American Society for Informa- tion Science. 21:208-211; 1984. Dubes, R.; Jain, A. K. “Clustering Methodologies in Exploratory Data Analysis.” In: M. C. Yovits, Ed. Advances in Computers. 19:113-228; 1980. Karonski, M. “A Review of Random Graphs.” Journal of Graph Theory. 6:349-389; 1982. Shaw, W.M., Jr. “A Bibliometric Software Package.” Paper presented at the Annual Conference of the American Association of Library Schools, Denver, CO; January, 1982. Ling, R. F. “Probability Theory of Cluster Analysis.” Journal of the American Statistical Association. 68(34): 159- 164; 1973. Shaw, W. M., Jr. “Critical Thresholds in Co-Citation Graphs.” Journal of the American Society for Information Science. 36(l): 38-43; 1985. Shaw, W.M., Jr. “On the Statistical Validity of Document Par- titions.” Proceedings of the American Society for Information Sci- ence. 22:116-119; 1985. Logan, E. L. “An Investigation of the Statistical Significance of Co- author Clustering.” Ph.D. dissertation, Case Western Reserve Uni- versity; 1984. Lotka, A. J. “The Frequency Distribution of Scientific Productivity.” Journal of the Washington Academy of Sciences. 16(12):317-323; 1926. Ott, L. Statistics: A Tool for the Social Sciences. Delmont, CA: Wadsworth Press; 1970. Logan, E. L. “Subject Specificity of Co-author Clusters.” Pro- ceedings of the American Society for information Science. 22:124-126: 1985.

268 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-July 1987