Optimal Grouping of Income Distribution Data

Optimal Grouping of Income Distribution DataAuthor(s): B. B. Aghevli and F. MehranSource: Journal of the American Statistical Association, Vol. 76, No. 373 (Mar., 1981), pp. 22-26Published by: American Statistical AssociationStable URL: http://www.jstor.org/stable/2287035 .

Accessed: 15/06/2014 19:07

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

.

American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journalof the American Statistical Association.

http://www.jstor.org

This content downloaded from 195.34.78.237 on Sun, 15 Jun 2014 19:07:59 PMAll use subject to JSTOR Terms and Conditions

http://www.jstor.org/action/showPublisher?publisherCode=astata

http://www.jstor.org/stable/2287035?origin=JSTOR-pdf

http://www.jstor.org/page/info/about/policies/terms.jsp


Optimal Grouping of Income Distribution Data B. B. AGHEVLI and F. MEHRAN*

We consider the problem of grouping income distribution data into a given number of groups such that the concealed income differences due to grouping are minimized. When income differences are measured by Gini's pairwise differences, this means minimizing the area between the grouped and ungrouped Lorenz curves. In this case, the necessary condition for optimal grouping is that each group limit be equal to the average income in its two adjacent groups. An iterative procedure for computation and applications including fractile groupings are dis- cussed. Alternative optimal groupings based on other measures of income differences are also considered.

KEY WORDS: Grouped data; Optimal grouping; Income groups; Lorenz curve; Gini index; Fractile grouping.

1. INTRODUCTION For practical reasons, almost all income distribution

data are tabulated in grouped form. As grouping generally results in concealing differences, the question arises as how the data should be best grouped so that the concealed differences are minimized.

One broad formulation of the grouping problem is as follows: given data on a distribution of income (expen- diture, wealth, etc.) group the data into a specified number of groups in such a way that income differences are minimized within the groups and maximized between the groups. In any particular circumstance, the optimal grouping depends on the manner in which differences are measured, which in turn depends on the ultimate use of the data. For example, if the grouped data are to be used for standard estimation of the Gini index of income inequality, one would require a grouping that minimizes the corresponding estimation error, a criterion that amounts to measuring income differences by Gini's pairwise difference in our formulation. Alternatively, if the income data are to be used for stratification in a sample survey, the appropriate criterion for measuring income differences may be the variance (Fisher 1958). The variance has also been used as a basis for optimal grouping for other purposes (Dalenius 1950, Cox 1957, Bofinger 1975).

In Section 2, we consider the Gini's pairwise difference as the measure of income difference and derive the condition that the group limits must satisfy for optimal group-

* B.B. Aghevli is Assistant Professor, College of Statistics and In- formatics, Tehran, Iran. F. Mehran is Senior Statistician, International Labour Office, Geneva, Switzerland. The International Labour Office is not responsible for any views expressed here. The authors are grateful to the referees for their valuable comments, especially for suggesting the name of the average condition.

ing. The optimal group limits may generally be computed on the basis of an iterative procedure described in Section 3. Applications of the optimal grouping procedure are presented in Section 4. Finally, in Section 5 we consider alternative measures of income differences and state the corresponding optimal grouping conditions.

2. THE AVERAGE CONDITION Let the nonnegative values x, y, . . . represent the in-

comes of members of a population with income distribution function F, with continuous and differentiable den- sity f and finite mean R. Suppose we want to group the data into k income intervals (ao - a,), (a, - a2), .

(ak-I - ak), where ao < a < . . .< ak, such that the concealed differences due to grouping are minimized. If differences are measured by Gini's absolute pairwise difference, that is, I x - y j, then the problem of optimal grouping is to derive group limits ao, al,... , ak, which minimizes the sum

k ai a

0 = E fai fa | x - y I dF(x)dF(y). (2.1)

Since01is bounded, 0 2 Ru, a minimum exists and the minimizing group limits may be obtained by setting to zero the partial derivatives of 0 with respect to ai, i

1, .. .,k- 1. This gives

-0 = 2f(ai)[I ai - xIdF(x) dai rai+ I

Ix- ai I dF(x) =0 ai+

Frai+I = 2f(ai) ai dF(x)

ai+ I

i- J xdF(x) = 0,

which, assuming f(ai) ? 0 and F(a1+ 1) ? F(aj_1 ), leads to

ai+ I ai+ I

ai = J xdF(x) J dF(x) ail ai- (2.2)

= E(XIai- ?X<ai+1),

where E is the average (expected) value operator. Thus we conclude that for a grouped income distribution func-

? Journal of the American Statistical Association March 1981, Volume 76, Number 373

Applications Section

22



Aghevli and Mehran: Optimal Grouping 23

tion to be optimal, it must satisfy the "average" condition (2.2); that is, each group limit must be equal to the average income of the population in its two adjacent groups.

This result has a simple interpretation in terms of the Lorenz curve and the Gini index of income inequality. Let yi = F(ai) - F(ai -1 ) and R,i f -a, x dF(x)lyi be the fraction of population and the average income in group i, respectively, i = 1, 2, ... , k. Then we have

k

0 = jJ I x - y I dF(x)dF(y) - p .izj I Fi - IljI ij (2.3)

= 2u (G - Gg),

where G = (2,u) -J I x - y I dF(x)dF(y) is the Gini index of the ungrouped distribution F and Gg = (2j x) y,ji

yj I Ii - Fj I is the Gini index of the grouped distribution (Yntema 1933). Hence, by minimizing0we minimize the error due to grouping in the estimation of the Gini index from grouped data. Note also that because G and ,u are fixed, minimizing0also implies maximizing Gg or, equivalently, the between groups income differences.

In geometric terms, because the Gini index is twice the area between the Lorenz curve and the diagonal line of perfect equality (see, e.g., Kendall and Stuart 1969, Sec. 2.25), it follows that minimizing0implies minimizing the area between the ungrouped Lorenz curve, L(p) = (IL)- l fA F- (t)dt, Where F- ' (t) = inf { x: F(x) - t }, 0 t c 1 (Gastwirth 1971) and the grouped Lorenz curve defined by the convex broken line with k break points at pi F(Ra), i = 0, 1, . . ., k (see Figure 1). A simple

1 ~~~~~~~~~~~~~B

Lp)c

L(p)

A

0 p

Figure 1. Construction of a Break Point for Optimal Grouping Into Two Groups (The tangent at C is parallel to AB. This maximizes the area of triangle ABC or, equivalently, minimizes the shaded area)

geometric argument shows that the area is minimized when the break points are picked such that at each break point the tangent to L(p) is parallel to the line connecting the two adjacent break points, that is,

=( = L(pi+) -L(pi-1) i = 1, 2, ... . k - 1, Pi+1 -Pi-I

(2.4) where L'(.) stands for the derivative of L(.). Note that L'(pi) = ailli and [L(pi+ 1) - L(pi 1)]I(pi+ 1 - Pi- 1) is the fraction of income in the two groups i and i + 1. Hence (2.4) is the geometric equivalent of the average condition (2.2).

Although the average condition is derived for an income distribution function that is continuous and differentiable, it also essentially applies to discrete income data. Consider n ordered numbers xi < x2 Xn . * x", representing the incomes of n persons to be grouped into k groups

Xl X . . .Xml;Xml+i 1 X.M2; . . . ;XXmk+, 1 . * *Xmk;

wherein, = npI, m2 = np2, . . .,mk = npkwithO = Po < Pl !5 -<Pk = 1. In this case the sum of concealed income differences due to grouping 0 is a triple sum- mation obtained by replacing F with the empirical distribution function Fn in (2.1). Similar to the continuous case, 0 is proportional to the area between the grouped and ungrouped Lorenz curves except that now the ungrouped Lorenz curve L(p) is not smooth as shown in Figure 1 but a broken line with break point for each x,, x2, . . . ,xn; that is, L(p) = (xl + X2 + . * . + Xlnp])I (xI + x2 + . . . + xn), where [np] stands for the greatest integer less than or equal to np and 0 c p c 1. In the discrete case, therefore, the tangents to L(p) at the break points do not exist: the derivative L' in (2.4) must be replaced with left and right derivatives as follows:

L(pi+) - L(pi) L(pi+l) - L(p_ 1) Pi -Pi Pi+ I -Pi- I

L(pi) - L(pi-) L(p+'l) - L(pi-1) -~~~~~~~~~~~~~~~~~ pi- Pi Pi+I - Pi-I

where Pi' and pi- are defined by [npi+] = mi + 1 and [npi-] = mi- 1, respectively, which simplifies to

xn cXni- l

+ l j+. Xtni+ (2.5) Mi+ I- Mi-I

Because x,"n is the largest income in group i and x,n,+, is the smallest income in group i + 1, it follows from (2.5) that (x,n-1+1 + ** + x,ni+)I(mi+l - mi_1) delimitate group i from group i + 1. Let

(Xmi I+I + + X,n I) ai = , i = 1,..., k - 1, (mi+ I - mi-)

(2.6)

and note that a, in (2.6) is equal to the average income in the two adjacent groups i and i + 1 and, therefore,



24 Journal of the American Statistical Association, March 1981

(2.6) is equivalent to (2.2) in which the expectation is taken over the empirical distribution function.

3. ITERATIVE PROCEDURE

Consider a set of n observations from an income distribution function F to be grouped into k groups. As shown in the preceding section, for optimal grouping we must chose group limits at that satisfy the average condition, that is, the system of equations given by (2.2) or (2.6). When a closed-form solution does not exist, the form of these equations suggests the use of an iterative procedure as now described. Starting with an initial grouping based on a set of values a/0), i = 1, 2, ... I k - 1, a new set of values aP', i 1,2, ... , k - I, are uniquely obtained by computing for each i the average income in the two initial groups i and i + 1. This new set a/') gives a new grouping of the data, which is then used to compute a net set ai/2), and so on. In general, at stage s of the iteration, each new ai(s+ ) is derived from the previous set a(s) by

a(s+1) = E(x aii (s) x < ai?i(s)) (3.1)

The sufficient condition for convergence is given by |aai+ i + aai-l | . + , < 1

3aa aai

(Scarborough 1966, Sec. 8.2), which simplifies to -ai ai) (aii - ai- ) I

-X+ -a < fa (3.2) XA+ l A,_ I f(ai)

where A' = yij 1 + -yi and yi is the fraction of population in group i.

Because the average condition is a necessary but not sufficient condition for optimal grouping, when the iterative procedure converges, the limiting values ai may not be the optimal group limits in the sense that they may only correspond to a relative minimum of **

4. APPLICATIONS

Applications of various aspects of the optimal grouping method follow.

4.1 Grouping in Two Groups When the income data are to be grouped into two

groups, the optimal group limit is the average income of the population,

a, E(x I ? = ao < x < a2 = ) = (4.1)

When grouping into three or more groups, the optimal grouping depends on the underlying distribution function.

4.2 Uniform Distribution and Population Fractiles If the underlying distribution function is uniform over

the whole range of incomes (ao, ak), then the average

condition reduces to

a1 = 1(a1?1 + a 1), i = 1, 2, k - 1, (4.2) which implies that ai = ao + (Ilk) (ak - ao) and pi - ilk. Hence, for the uniform distribution optimal grouping is achieved by grouping the data such that equal number of persons fall into each group. This result in turn implies that the standard and recommended practice of using population fractiles for grouping income data (United Nations 1977, p. 29) is not an optimal choice for income distributions, which typically are skewed to the right (Gastwirth 1972, p. 310).

4.3 Pareto Distribution and Income Fractiles For the Pareto distribution, F(x) = I - (xIao)-',

x ? ao and a > 0, the average condition is expressed by ao ai- i - ai+ i l-0+

a - 1 ai-i - ai+i (4.3)

i= 1,2,...,k- 1,

for which the solution must in general be obtained by the iterative procedure described earlier. However, when a = 2, (4.3) simplifies to 1lai = (/lai- I + 11ai 1)/2 and it may be solved directly to obtain ai = ao(klk - i). In terms of the Lorenz curve, we have Pi = 1 - (1 - ilk)2 and L(pi) = 1 - (1 - pi)"/2 = ilk. Hence for the Pareto distribution with shape parameter a = 2, optimal grouping is achieved if an equal amount of income falls into each group. This method of grouping may be called the income fractile grouping, in contrast with the population fractile grouping in the case of the uniform distribution.

The difference between the two optimal groupings may be best viewed in terms of the Lorenz curve diagram. While the population fractile grouping implies the division of the horizontal axis into equal distances, the income fractile grouping implies equal distance division along the vertical axis. The vertical division emphasizes the upper income range more than the horizontal division does, thus making it a more suitable grouping procedure for typical income data.

Table 1 shows the numerical values for the two fractile groupings in 10 groups (the deciles) for the Pareto distribution with a = 2. The corresponding Gini indices for 5, 10, and 20 groups are compared with the theoretical Gini index in the following tabulation:

k G(population) G(income) G(theoretical) 5 .2995 .3200 .3333 10 .3210 .3300 .3333 20 .3289 .3325 .3333

Note that the income fractile grouping results in closer estimates of the Gini index then the population fractile grouping. As the number of groups increases, the difference between the two groupings diminishes.

4.4 Grouping by Income Intervals Given in the Appendix is a set of income data obtained

from the results of a household survey conducted jointly



Aghevil and Mehron: Optimal Grouping 25

Table 1. Population and Income Deciles for Pareto Distribution (ot = 2)

a. Population Deciles b. Income Deciles

Cumulative Cumulative Cumulative Cumulative Population Income Income Population

10 5 10 19 20 10 20 36 30 16 30 51 40 22 40 64 50 29 50 75 60 37 60 84 70 45 70 91 80 54 80 96 90 68 90 99

100 100 100 100

by the International Labour Office and the University of Nairobi (Knowles and Anker 1977). The data refer to the annual earnings in shillings of a random subsample of size 200 from 566 urban employees in the original sample. Applying the iterative procedure (3.1) to the data with an initial grouping into five groups of equal size, we obtain convergence after 21 iterations, and the resulting optimal group limits are 4,507, 7,363, 12,916, and 23,648 shillings, which we round to convenient figures for the final tabulation shown in Table 2.

The Gini index based on the grouped data is Gg = .4125 and based on the ungrouped data is G = .4348. Thus the concealed differences due to grouping is0 = (2,u) (G - Gg) = 420.5 shillings. This value would have been 540.8 shillings or almost 30 percent higher had we tabulated the data into five intervals with group limits 2,500, 5,000, 10,000, and 20,000 shillings, that is, intervals that cover the same span in logarithmic terms as suggested in United Nations (1977).

5. ALTERNATIVE OPTIMAL GROUPINGS

The average condition was based on the absolute pairwise difference, I x - y 1, as the measure of income difference. This choice was made because it corresponds to the commonly used Gini index of income inequality. One could, however, use other measures of income inequality and derive alternative optimal grouping condi-

Table 2. Kenyan Earnings Data in Five Groups

Earnings Group Number of Average (shillings) Persons Earnings

Less than 5,000 75 3,508.8 5,000-7,499 45 6,234.4 7,500-12,499 38 9,521.0 12,500-24,999 29 17,936.6 25,000 and over 13 35,537.6

Total 200 9,438.3

Source: Knowles and Anker (1977). Data listed in the Appendix.

tions. For example, the squared mean deviation (x - Vti)', where t,i is the mean income in group i, a measure which corresponds to the variance, leads to the optimal grouping condition

ai= A(i + Pti+I), i = 1,2, . . .2 , k - 1, (5.1) (Dalenius 1950, Fisher 1958). Another example is given by the absolute median deviation, I x - ,i 1, where fRi is the median income in group i. This gives ai = (iii + fi+ 1)/ 2, i = 1, 2, . . . , k - 1.

For a more general example, consider the family of measures of income inequality defined by Mh =

[f h(x)dF(x) - h(pf)]lh([i), where h(x) is a convex function with h(O) = 0, (Gastwirth 1975). The optimal grouping condition for this family is given by

1 ai (h= h') [(pii+1h+j - tiih,) - (hi+, -hi),

i= 1,...,k- 1, (5.2)

where hi = h(pRi) and hi is the derivative of h evaluated at Ri. For the particular case h(x) = x2, Mh is the square of the coefficient of variation and (5.2) reduces to (5.1). For h(x) = x log(x), Mh corresponds to the Theil index of income inequality (Theil 1967), and (5.2) becomes ai = (>i+ - VLi)/(logli+ I - logRi), i = 1, 2, . . . , k - 1. Similarly, one may use the family of measures of in-

come inequality introduced by Atkinson (1970) and Meh- ran (1976) to derive alternative optimal grouping conditions.

Note that in any particular application different optimal grouping conditions give different numerical results. The choice must be made with reference to the ultimate use of the data. For example, consider a problem raised by Hansen, Hurwitz, and Madow (1953, p. 235) based on the frequency distribution of 13,641 Atlanta families in 1933 grouped in 10 income classes. The problem was in the context of stratified sampling and essentially consisted of combining the ten classes into three larger strata such that the estimate of mean income for all families, based on a stratified sample, had a small variance. The solution, in effect given by Fisher (1958), is to use (5.1) as the optimal grouping condition and combine the ten classes in three strata such that 69 percent of the families are in stratum 1, 29 percent in stratum 2, and 2 percent in stratum 3. Had we used, for instance, the average condition, which minimizes Gini's mean difference and not the variance, we would have obtained a different distribution of families, namely, 57 percent in stratum 1, 28 percent in stratum 2, and 15 percent in stratum 3, and the resulting estimated sample. standard deviation, $663, would have been higher than the estimate, $558, based on the grouping that minimizes the variance.

APPENDIX

The 200 Kenyan annual earnings data (in shillings) (Knowles and Anker 1977) used in the text are listed on the next page.



26 Journal of the American Statistical Association, March 1981

600 3,323 4,200 5,095 6,461 8,307 10,800 19,200 629 3,323 4,200 5,316 6,507 8,400 10,800 19,200 923 3,600 4,246 5,400 6,507 8,640 10,800 19,938

1,061 3,600 4,320 5,400 6,507 8,640 10,800 20,169 1,107 3,600 4,320 5,400 6,600 8,676 10,800 21,692

1,200 3,600 4,320 5,400 6,600 8,861 10,800 21,912 1,661 3,600 4,338 5,400 6,646 8,861 10,800 22,000 1,800 3,600 4,338 5,400 6,646 8,861 12,096 22,153 1,800 3,600 4,338 5,427 6,646 8,861 12,600 22,153 2,160 3,600 4,430 5,538 6,840 9,000 12,960 22,153

2,215 3,634 4,430 5,538 6,960 9,072 12,960 22,375 2,282 3,692 4,555 5,538 7,020 9,470 14,399 24,000 2,307 3,715 4,615 5,760 7,199 9,540 14,400 26,400 2,326 3,738 4,723 5,760 7,200 9,600 14,400 26,400 2,400 3,840 4,776 5,793 7,200 9,600 14,400 26,584

2,520 3,876 4,800 5,815 7,200 9,600 14,400 27,692 2,653 4,000 4,800 6,000 7,200 9,600 15,507 30,000 2,968 4,004 4,800 6,000 7,200 9,761 16,269 31,200 2,990 4,020 4,800 6,000 7,200 9,761 16,615 32,400 3,000 4,020 4,800 6,000 7,430 9,761 16,615 33,230

3,000 4,080 4,800 6,000 7,753 9,761 16,800 33,960 3,000 4,140 4,800 6,100 7,753 9,865 16,800 34,200 3,156 4,153 4,800 6,230 8,160 9,969 16,984 42,000 3,240 4,200 4,800 6,230 8,238 10,680 18,000 50,520 3,253 4,200 4,800 6,240 8,307 10,744 19,107 67,403 [Received February 1979. Revised September 1980.]

REFERENCES

ATKINSON, A.B. (1970), "On the Measurement of Inequality," Jour- nal of Economic Theory, 2, 244-263.

BOFINGER, E. (1975), "Optimal Condensation of Distributions and Optimal Spacing of Order Statistics," Journal of the American Sta- tistical Association, 70, 15 1-154.

COX, D.R. (1957), "Note on Grouping," Journal of the American Sta- tistical Association, 52, 543-547.

DALENIUS, T. (1950), "The Problem of Optimal Stratification," Skan- dinavisk Aktuarietidskrift, 33, 203-213.

FISHER, W.D. (1958), "On Grouping for Maximum Homogeneity," Journal of the American Statistical Association, 53, 789-798.

GASTWIRTH, J.L. (1971), "A General Definition of the Lorenz yurve," Econometrica, 39, 1037-1039.

(1972), "The Estimation of the Lorenz Curve and The Gini Index," Review of Economics and Statistics, 54, 306-316.

(1975), "The Estimation of a Family of Measures of Economic Inequality," Journal of Econometrics, 3, 61-70.

HANSEN, M.H., HURWITZ, W.N., and MADOW, W.G. (1953), Sam- ple Survey Methods and Theory, Vol. I, New York: John Wiley.

KENDALL, M.G., and STUART, A. (1969), The Advanced Theory of Statistics, Vol. 1, New York: Hafner.

KNOWLES, J.C., and ANKER, R. (1977), "An Analysis of Income Transfers in a Developing Country: The Case of Kenya," Working Paper WEP 2-21/WP.59, Geneva: International Labour Office.

MEHRAN, F. (1976), "Linear Measures of Income Inequality," Econ- ometrica, 44, 805-809.

SCARBOROUGH, J.B. (1966), Numerical Mathematical Analysis (6th ed.), Baltimore: Johns Hopkins Press.

THEIL, H. (1967), Economics and Information Theory, Amsterdam: North-Holland.

UNITED NATIONS (1977), Provisional Guidelines on Statistics of the Distribution of Income, Consumption and Accumulation of House- holds, Studies in Methods, Series M, No. 61, New York: United Nations Publications.

YNTEMA, D. (1933), "Measures of the Inequality in the Personal Dis- tribution of Wealth or Income," Journal of the American Statistical Association, 28, 423-433.



Documents

Optimal Grouping of Income Distribution Data