21
Methods of ecological inference for disaggregation problems in operations research Rogério Silva de Mattos 1 Universidade Federal de Juiz de Fora Faculdade de Economia e Administração [email protected] Abstract. Operations research techniques are widely used in business projects planning. Usu- ally, a preliminary task of project planners is to assess the potential market for a business proj- ect, which involves defining the target population and evaluating its size. As the potential mar- ket is part of a larger aggregate of consumers, techniques for disaggregate data estimation (DDE) are often needed, and for long entropy optimization techniques have been used with this purpose. However, a line of research on DDE techniques displaying promising developments is Ecological Inference (EI). The paper presents a brief review of the recent literature on EI tech- niques, aiming at diffusing them among operations research analysts, and also a new method proposed by the author to make EI via the statistical technique known as EM Algorithm. Keywords: ecological inference, disaggregation, intensive computing, EM Algorithm. Resumo. Técnicas de pesquisa operacional são muito usadas no planejamento de projetos em- presariais. Uma tarefa preliminar dos planejadores consiste em avaliar o mercado potencial, o que implica em conceituar a população alvo e medir seu tamanho. Sendo o mercado potencial parte de um agregado maior de consumidores, técnicas para estimação de dados desagregados (EDD) se fazem necessárias e, há muito, métodos de otimização da entropia vêm sendo usados com essa finalidade. Entretanto, uma linha de pesquisa em EDD que vem apresentando desen- volvimentos promissores é Inferência Ecológica (IE). Este artigo apresenta uma breve revisão da literatura recente sobre técnicas de IE, visando divulgá-las entre analistas de pesquisa opera- cional, e também um novo método proposto pelo autor para se fazer IE através da técnica es- tatística conhecida como Algoritmo EM. Palavras-chave: inferência ecológica, desagregação, computação intensiva, Algoritmo EM. 1. Introduction A number of business projects, as the launching of a new product in the market, the lo- cation of an industrial plant or of a shopping center, and the creation/expansion of a distribution network, for just a few examples, makes intensive use of operations re- search (OR) techniques. However, at early stages of project planning, a preliminary task - important in determining project feasibility - is to assess the potential market (or target population) to be attended by the products/services of the business activity. Being the potential market part of a larger aggregate of individuals/institutions, techniques for disaggregate data estimation (DDE) are needed to evaluate a variety of attributes of interest, like, for instance, market size and mean income by consumer groups. By far, the most used technique for market research has been sample survey, which allows for the assessment of a variety of market characteristics and is firmly grounded on statistical theory. A number of reasons, but in particular the high costs in- volved in making sample surveys, have led scientific literature to devise alternatives. A major among these alternatives was found in entropy optimization (EO), largely used as a DDE tool in urban and transport planning research since the 1960s (Wilson, 1970; Novaes, 1982). However, a line of research on DDE displaying promising developments at the present is Ecological Inference . Researchers on EI techniques use to say that EI is 1 Ph.D. Student in Control Theory and Statistics at The Department of Electrical Engineering/PUC-Rio. This paper is based on an ongoing research being developed under my Ph.D. thesis work, and I gratefully acknowledge fellowship support from the PICDT-UFJF/CAPES.

Methods of ecological inference for disaggregation

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Methods of ecological inference for disaggregation

Methods of ecological inference for disaggregationproblems in operations research

Rogério Silva de Mattos1

Universidade Federal de Juiz de ForaFaculdade de Economia e Administraçã[email protected]

Abstract. Operations research techniques are widely used in business projects planning. Usu-ally, a preliminary task of project planners is to assess the potential market for a business proj-ect, which involves defining the target population and evaluating its size. As the potential mar-ket is part of a larger aggregate of consumers, techniques for disaggregate data estimation(DDE) are often needed, and for long entropy optimization techniques have been used with thispurpose. However, a line of research on DDE techniques displaying promising developments isEcological Inference (EI). The paper presents a brief review of the recent literature on EI tech-niques, aiming at diffusing them among operations research analysts, and also a new methodproposed by the author to make EI via the statistical technique known as EM Algorithm.

Keywords: ecological inference, disaggregation, intensive computing, EM Algorithm.

Resumo. Técnicas de pesquisa operacional são muito usadas no planejamento de projetos em-presariais. Uma tarefa preliminar dos planejadores consiste em avaliar o mercado potencial, oque implica em conceituar a população alvo e medir seu tamanho. Sendo o mercado potencialparte de um agregado maior de consumidores, técnicas para estimação de dados desagregados(EDD) se fazem necessárias e, há muito, métodos de otimização da entropia vêm sendo usadoscom essa finalidade. Entretanto, uma linha de pesquisa em EDD que vem apresentando desen-volvimentos promissores é Inferência Ecológica (IE). Este artigo apresenta uma breve revisãoda literatura recente sobre técnicas de IE, visando divulgá-las entre analistas de pesquisa opera-cional, e também um novo método proposto pelo autor para se fazer IE através da técnica es-tatística conhecida como Algoritmo EM.

Palavras-chave: inferência ecológica, desagregação, computação intensiva, Algoritmo EM.

1. Introduction

A number of business projects, as the launching of a new product in the market, the lo-cation of an industrial plant or of a shopping center, and the creation/expansion of adistribution network, for just a few examples, makes intensive use of operations re-search (OR) techniques. However, at early stages of project planning, a preliminarytask − important in determining project feasibility − is to assess the potential market (ortarget population) to be attended by the products/services of the business activity. Beingthe potential market part of a larger aggregate of individuals/institutions, techniques fordisaggregate data estimation (DDE) are needed to evaluate a variety of attributes ofinterest, like, for instance, market size and mean income by consumer groups.

By far, the most used technique for market research has been sample survey,which allows for the assessment of a variety of market characteristics and is firmlygrounded on statistical theory. A number of reasons, but in particular the high costs in-volved in making sample surveys, have led scientific literature to devise alternatives. Amajor among these alternatives was found in entropy optimization (EO), largely used asa DDE tool in urban and transport planning research since the 1960s (Wilson, 1970;Novaes, 1982). However, a line of research on DDE displaying promising developmentsat the present is Ecological Inference . Researchers on EI techniques use to say that EI is

1 Ph.D. Student in Control Theory and Statistics at The Department of Electrical Engineering/PUC-Rio.This paper is based on an ongoing research being developed under my Ph.D. thesis work, and I gratefullyacknowledge fellowship support from the PICDT−UFJF/CAPES.

Page 2: Methods of ecological inference for disaggregation

2

concerned with inferring individual behavior from aggregate, or "ecological", data(e.g.,King, 1997). As such, EI techniques are suitable to tackle DDE problems in vari-ous application areas, particularly in those where socio-economic data are needed (ORanalysts may face DDE problems, for instance, when determining efficient ways to sup-ply a line of products to a stratified market: in this case, disaggregate market informa-tion is essential for project planning success).

The field of EI emerged from American studies on disaggregate voting behaviorin the first decades of this century, but just a small number of EI techniques had beenproposed until recently. A new perspective was open by a recent book from King(1997), where this author introduced new approaches based on modern resources ofmathematical statistics and intensive computing. His study rekindled the interest ofvarious researchers in developing new EI techniques. In this paper, we will make a re-view of new developments in these literature, aiming at diffusing them among OR ana-lysts. We will focus, though, on King’s EI model, as an alternative approach proposedby this paper’s author to estimate it, based on a statistical technique known as EM Al-gorithm, will also be presented.

The rest of this paper is organized as follows: In section 2, we state in more pre-cise terms the type of DDE problem with which EI Research is concerned; in Section 3,notation for the EI problem is introduced; in Section 4, the recent EI developments arebriefly reviewed; in Section 5, the parametric model for EI proposed by King (1997) isdescribed; and in Section 6, an alternative approach based on the EM Algorithm to es-timate King’s model with partial results already available are presented; Finally, in Sec-tion 8, some remarks are made.

2. Disaggregation problems in OR

A variety of disaggregation problems arise in business planning that demands the use ofOR techniques. For instance, the well known "Transportation Model", presented instandard textbooks on OR (e.g., Whitehouse and Weschler, 1976) is a tool designed totackle a problem of distributing resources available from various sources to variousdestinations, with the target of minimizing the total transportation cost involved in theoverall set of distribution flows. This problem is illustrated by Table 1.

Table 1Illustration of the standard transportation problem

D E S T I N A T I O N S

D1 D2 . . . DN

S1 ? ? . . . ?

S2 ? ? . . . ?

. . . . . . . . . . . . . . .

S O

U R

C E

S

SM ? ? . . . ?

Information on quantities of resources supplied at sources and demanded at destinations,and the unit cost per travel route are available to us. Our goal is to determine resourceflows from each source to each destination: say, the unknown contents of table’s cells,represented by the question marks "?" in Table 1. To attain it, we could use a linearprogramming technique to find a best solution: we mean, a linear objective functionrepresenting total cost of transportation would be minimized subject to a set of linearconstraints representing adding-up to row (sources) and adding-up to column (destina-

Page 3: Methods of ecological inference for disaggregation

3

tions) totals. More sophisticated approaches based on non-linear programming tech-niques might also be used.

A particular feature of this problem is that cells’ contents are decision variablesunder our control. When determining a best solution, what we are doing is deciding,aided by an OR technique, how much from each source should we allocate to each des-tination. The solution is thus a deterministic one. Another particular feature is that it is adisaggregation problem: what we know is the aggregate information from the row andcolumn totals. The disaggregate information, say, the unknown contents of table’s cells,we have to seek using an OR technique.

The EI problem is a disaggregation problem quite similar to that in Table 1, butwith a few important differences. In EI, we also know the row and column (aggregate)totals of a table (or tables), and alike we want to determine the (disaggregate) contentsof table’s cells. Formally, the two problems may be represented with the same type ofadding−up constraints. However, even when we know with certainty the aggregate in-formation, EI solutions are always subject to uncertainty. When making EI, we do notseek for an efficient decision managing variables under our control, but for a solutionthat maximizes the chances that variables describing phenomena of our interest assumecertain values. Also formally, the two problems have different types of objective func-tions, as they are solved with different purposes.

We mentioned in the Introduction that EO techniques have been largely used tosolve OR problems where EI approaches would also be appropriate. However, while EOis a general approach to tackle a variety of disaggregation problems, EI is specific forthe estimation of data, usually socio-economic data, unavailable in disaggregate form.Methodologically, EO makes use of pure mathematical programming techniques whichusually provide deterministic solutions; EI, by its turn, is committed to statistical meth-ods, where randomness and uncertainty are essential features of the solutions obtained.

3. The EI problem

In technical terms, the EI problem represents a situation where the analyst/planner isinterested in cell data for one or more contingency tables (or values tables), but he/sheknows only the row and column totals of the table(s). These totals are called the aggre-gate (or ecological) data. Analyst’s goal is to determine the contents of table’s cells. Ta-ble 2 depicts this situation for the simplest case of a 2×2 tables problem.

Table 2Representation of the EI problem for the 2×2 tables case

Variable B

1 2 Totals

1 11iβ 111 iβ− iX

2 21iβ 211 iβ− iX−1

Var

iabl

e A

TotalsiT iT−1 1

Page 4: Methods of ecological inference for disaggregation

4

where:11iβ = proportion of 1st category of variable B in 1st category of variable A;21iβ = proportion of 1st category of variable B in 2nd category of variable A;

iX = proportion of 1st category of variable A in the total of its two categories;

iT = proportion of 1st category of variable B in the total of its two categories.

Variables in Table 2 are defined as proportions, instead of absolute values, be-cause it allows for a direct interpretation of results. For instance, in voting behaviorstudies, variable A might be race, e.g. blacks and whites, and Variable B might be parti-san candidate, e.g. Republican or Democrat; thus, 11

iβ would represent proportion of

blacks, and 21iβ proportion of whites, voting for the Republican candidate. By their

turn, iX would be the proportion of blacks, and iT the proportion of people voting forthe Republican candidate, both in total turnout of voting age population.

The notation in Table 2 is general and applicable to a variety of contexts: Ineconomics, variable A might be levels of family income, and variable B number ofgoods purchased; in sociology, variable A might be number of crimes by city regions,and variable B number of crimes by type; in transport planning, variable A might benumber of residents by residential colonies, and variable B number of jobs by trade ar-eas. Thus, EI techniques are suited for a wide range of DDE problems (including somearising in OR applications). Also, EI research is concerned with developing techniquesfor R×C tables problems, though implementation of such techniques are still limitedwith this respect (King, 1999).

In Table 2, iT and iX represent the known aggregate data. Subscript i indexes

the tables, or sample units, ranging from 1 to P − the number of tables used in the EIanalysis. By their turn, 11

iβ and 21iβ represent the unknown disaggregate data and the

target of EI problem solving; for this reason, 11iβ and 21

iβ are called the quantities of

interest. And once they become known, the contents of all cells in all tables will alsobecome known. Figure 1, illustrate this distinct feature of EI techniques as compared toothers that allow for just one table at a time.

Figure 1The use of various tables

11Pβ 111 Pβ− PX

21Pβ 211 Pβ− PX−1

PT PT−1 1

11iβ 111 iβ− iX

21iβ 211 iβ− iX−1

iT iT−1 1

111β 11

11 β− 1X

211β 21

11 β− 11 X−

1T 11 T− 1

Source: Adapted from Mattos and Veiga (1999).

Page 5: Methods of ecological inference for disaggregation

5

The EI problem is solved in such a way that cells’ contents in all tables are determinedsimultaneously. The proportions appearing outside the 2×2 tables of Figure 1, say, thepairs { ),( ii XT : i = 1,...,P}, are known aggregate data used to estimate each quantity ofinterest in each table. Various tables allows for the use of more observations (each tableis a sample unit), and, by another side, for "borrowing strength" from the information inother tables in the estimation of cells’ values of a particular table. This latter aspectbrings efficiency to estimation, if it happens that all tables have "something in common"(King, 1997). In practice, not always exist such a communality, at least among all the Ptables, and King’s model admits extensions that allows different mean patterns for thequantities of interest in different tables.

4. Approaches to solve the EI problem

To solve the EI problem, posed in its 2×2 version, a number of approaches have beensuggested. Though the issue is old (Achen and Schively, 1995; King, 1997), a large partof EI approaches were proposed quite recently. Previous to the work of King (1997),the most used technique was Goodman Regression (Goodman, 1953). To understand itsbasic features, consider the following accounting identity from Table 2:

)1(2111iiiii XXT −+= ββ (4.1)

Relation (4.1) is a deterministic fact of any contingency or values table, not an assump-tion. To estimate the quantities of interest, Goodman (1953) assumed they are fixed forall tables, say: { 1111 ℜ=iβ , 2121 ℜ=iβ : i = 1,...,P}, where 11ℜ and 21ℜ are two con-

stants. Any discrepancy between right and left-hand sides of (4.1) should be due to arandom error, namely εi, and (4.1) might be re-written as:

iiii XXT ε+−ℜ+ℜ= )1(2111 (4.2)

Expression (4.2) is a linear regression model without a constant, being estimable by

ordinary least squares. Constant estimates across sample units 1111 ˆˆ ℜ=iβ and 2121 ˆˆ ℜ=iβmay be produced for the quantities of interest, and it is immediate to extend Goodman’smodel for problems with R×C tables. Yet, Goodman Regression is a limited techniquebecause parameter constancy along different tables is a strong assumption, easy to beviolated in a number of cases (e.g., Cho, 1997 and 1998) and because no restriction ex-ist for the values of parameter estimates, which, as proportions, must lie within the [0,1]interval. As a consequence, it may happens that we find higher than 100% estimates ornegative proportions using this approach (cf. King, 1997, p. 16).

In an innovative work, King (1997) introduced a method for EI that overcomesall limitations of Goodman Regression. His method allows for quantities of interest tovary along different tables, never produces estimates outside the [0,1] interval, and in-corporates a method of deterministic bounds from Duncan and Davis (1953). King’smethod also admits extensions that allow for the effects of explanatory variables anduses modern intensive computing resources to reliably estimate parameters and disag-gregate data. His work produced a sound impact on research, by reviving the interest onEI and inducing a number of studies: some applying and testing his method, others pre-senting new approaches (Cho, 1997 and 1998; Rivers and Cho, 1997; King, Rosen andTanner, 1998; Penurbati and Shuessler, 1998; Rivers, 1998; and Lewis, 1998).

Critical works have also been written (Freedman et al,1998; Cho, 1998;Freedman et al, 1999; and Ferree, 1999). The major limitations pointed against King’smodel relies, first, on its diagnosis side, since formal statistical tests for checking good-ness of fit and distributional assumptions are absent, while the graphical diagnosticssuggested may be misleading (Cho, 1998); and second, on that it is restricted to 2×2

Page 6: Methods of ecological inference for disaggregation

6

tables, lacking an implementation for the R×C tables case − actually, King (1997,Chapter 15) generalized his method to RxC tables, but did not implement it.

King, Rosen and Tanner (1998), for short KRT, presented a more sophisticatedapproach: the hierarchical binomial-beta model. KRT claim it can address a variety ofEI problems, being more flexible than King’s model. First, because a wide range ofshapes are allowed for posterior distributions, an attribute from the beta distribution,which makes KRT’s model capable of recognizing data generated under distributionalassumptions of King’s model and, thus, of providing "data analytic checks" (KRT, 1998;p. 1) for the latter. Second, becaise the binomial-beta model can be generalized to anumber of EI problem types, including R×C tables, due to the use of modern MarkovChain Monte Carlo (MCMC) methods (Gibbs Sampling and nested Metropo-lis−Hastings algorithms) in the estimation process. The authors warn, however, that thisincreased flexibility is paid at the cost of increased computation, as MCMC methods areintensive consumers of computer time. Yet, generalizations to other situations bringsnew research challenges, and KRT conclude their work opens important paths for futureEI research.

In this paper, we will be focusing on the EI method of King (1997), referringthe reader to the work of KRT (1998) for more details on the hierarchical binomial-betamodel for EI. We have chosen to present and discuss King’s method because, as told inthe Introduction, it is the object of a research on an alternative estimation approachwhich will be presented in Section 6.

5. King’s EI normal model

The EI normal model proposed by King (1997) is being called this way here because itis based on an assumption of bivariate normality. It was designed to solve the simplest2×2 EI problem of Table 2. Its first feature is the use of two deterministic facts of thatproblem:

a) The accounting identity )1(2111iiiii XXT −+= ββ , and

b) the method of bounds, introduced by Duncan and Davis (1953).

The importance of this method of bounds is that, given the aggregate data, the quantitiesof interest may belong to narrower intervals than [0,1], which reduces problem uncer-tainty regarding their estimation. For more details and examples, see Anchen andSchively (1995, pp. 190-193), King (1997) and Mattos and Veiga (1999). Using thenotations L and U to denote lower and upper bounds, respectively, for a quantitiy ofinterest, the method of bounds is deterministic because, given the aggregate data Ti andXi, it is a true fact that ],[ 111111

iii UL∈β and ],[ 212121iii UL∈β , where2:

11,min

0)1(

,0max

11

11

=

−−=

i

ii

i

iii

X

TU

X

XTL

11,1

min

0,0max

21

21

=

−=

i

ii

i

iii

X

TU

X

XTL

(5.1)

A second feature of the EI normal model consists of the following assumptions:

1. The Xi (i = 1,...,P) variables are fixed, say, they are not random variables;

2. The quantities of interest follows a bivariate normal distribution truncated onthe unit square 2]1,0[]1,0[ R∈× , as:

2 The formulas for the accounting identity in a) and for the method of bounds in (5.1) formulas can begeneralized to R×C tables (King, 1997, Chapter 15).

Page 7: Methods of ecological inference for disaggregation

7

× 2

212111

2111211

11

]1,0[]1,0[21

11

;~σσσρ

σσρσββ

((((

((((

(

(

wi

i TN i = 1,...,P (5.2)

(The symbol "∪" over the parameters in (5.2) is used to indicate they arefrom an untruncated distribution which was truncated to produce (5.2));

3. The quantities of interest are mean independent of the regressors:111111ii εβ +ℜ= (5.3)212121ii εβ +ℜ= (5.4)

It means 11iβ and 21

iβ , are random variables with constant means which are

thus independent of iX and iX−1 ; the notations 11ℜ in (5.3) and 21ℜ in

(5.4) are not covered by a "∪" symbol to indicate they are the real truncatedmeans;

4. The conditional random variable ii XT | is independent across different tablesor sample units (it is usually called the spatial independence assumption);

From the facts and assumptions above, King (1997) builds his EI method in twostages: First, he estimates the "untruncated" parameters of the truncated bivariate nor-mal in (5.2), say, the elements of the following vector:

T221

211

2111 ],,,,[ ρσσψ (((

((

( ℜℜ= (5.5)

Second, he uses the parameters estimates to simulate the marginal posterior distributionsfor 11

iβ and 21iβ , given the aggregate data T, and the resulting simulated values to

compute estimates of these quantities of interest and of their associated standard errors.These stages are detailed next.

1st Stage: Estimating parameters of the truncated bivariate normal

Assumption 4 of the EI normal model allows the likelihood function to be written as:

∏=

=p

iiTPTP

1

)|()|( ψψ ((

(5.6)

In (5.6), vector ],...,[ 1 pTTT = is the given sample of aggregate data and

),()|( 2iii TNTP σµψ =(

represent the truncated normal distribution of each iT , where

)1( iw

ib

i XX −ℜ+ℜ=((

µ and )1(2)1( 211122

2122

112

iiiii XXXX −+−+= σσρσσσ (((((

are,

respectively, the mean and the variance of the untruncated normal version of )|( ψ(iTP .Actually, King does not use the likelihood formulation in (5.6), but a reparameterizedone based on a one-to-one onto transformation of the parameter vector, given byφ ψ= t( )

(

, where T54321 ],,,,[ φφφφφφ = is such that:

25.0

5.0211

11

1 +−ℜ=

σφ

(

(

(5.7)

25.0

5.0221

21

2 +−ℜ=

σφ

(

(

(5.8)

113 lnσφ (= (5.9)

214 lnσφ (= (5.10)

Page 8: Methods of ecological inference for disaggregation

8

φρρ5 05

1

1=

+−

. ln

(

((5.11)

This reparameterization have three purposes: facilitate numerical optimization; reducethe correlation between means and variances (i.e, between 11ℜ

(

and 211σ( , and between

21ℜ(

and 221σ( ); and allow for a better normal approximation of the likelihood (or poste-

rior) function. The reparameterized likelihood is written:

∏=

=p

iiTPTP

1

)|()|( φφ (5.12)

Under a bayesian approach, where priors are used for the parameters in (5.5), we shouldwork with the posterior function:

)|()()|( φφφ TPPTP ∝ (5.13)

where )(φP is the prior distribution for φ . The parameters estimates are obtained bymaximizing (5.12) or (5.13) for φ , thus finding:

[ ])|(maxargˆ TP φφφ

= (5.14)

King implements this maximization via the Constrained Maximum Likelihood (CML)module, developed by Schoenberg (1995) and available for use with the GAUSS Pro-gramming Language (Aptech Systems, Inc.)

2nd Stage: Estimating the quantities of interest by simulation

The CML/GAUSS module also produces an estimate, namely )ˆ(φV , of the variance-covariance matrix of the parameters (cf. Schoenberg, 1995). If P is a large number,meaning that aggregate data is plenty, then a normal approximation can be adopted forthe likelihood or posterior function (Tanner, 1996), like:

( ))ˆ(;ˆ~| φφφ VNT (5.15)

The next step involves computing the posterior distributions for the quantities ofinterest. Considering the case of 11

iβ first, if φ were known we would just need to com-

pute its marginal posterior )|( 11 TP iβ ; in such an instance, it would be equivalent to

),|( 11 φβ TP i . However, by (5.15), φ is a random vector and thus )|( 11 TP iβ should be

obtained by averaging (integrating) )|,( 11 TP i φβ over the uncertainty in φ, say:

φφββ dTPTP ii ∫Θ∝ )|,()|( 1111 (5.16)

where Θ is the parameter space for φ. However, solving analytically the multiple(5−tuple) integral in (5.16) is too a complex task. Alternatively, King obtains (5.16) byMonte Carlo Simulation: for each i-th table, a large number of values of 11

iβ and 21iβ

are randomly drawn and used to summarize their respective posteriors. The overall pro-cess is briefly described in Box 1. The reader is warned that step 3 in this Box uses themarginal distribution for 11

iβ conditioned on Ti and the parameters, say:

−+ℜ=

2

22

21111 ;),|(

i

iii

i

iii TNTP

σωσε

σωψβ (

(

(

(5.17)

where )1(2111211 iii XX −+= σσρσω ((((

and )1(2111iiii XXT −ℜ−ℜ−=

((

ε .

Page 9: Methods of ecological inference for disaggregation

9

Box 1Estimating the disaggregate data by simulation

1. From the multivariate normal T|φ in (5.15), generate a random value φ~ ;2. Reparameterize back the simulated value φ~ to the original parameterization us-

ing )~

(~ 1 φψ −= t( , the inverse transformation of equations (5.7)-(5.11); then use im-

portance sampling to improve the normal approximation (producing a new ψ~( );

3. Use ψ~( in (5.16) and draw from )~

,|( 11 ψβ (

ii TP a random value 11~iβ ;. if it happens

to fall outside ],[ 1111ii UL (see (5.1)), repeat this step until ],[

~ 111111iii UL∈β ;

4. Compute a simulated value for the second quantity of interest, say 21~iβ , using the

accounting identity (4.1) rearranged as follows:1121 ~

11

~i

i

i

i

ii X

X

X

T ββ−

−−

= (B1)

Note: a) This relation between 21~iβ and 11~

iβ is a fixed one, when the aggregates

Ti and Xi are given; and b) 21~iβ is generated from the simulated 11~

iβ and not from

),|( 21 ψβ (

ii TP , which is done to preserve model consistency;

5. Repeat steps 1-4 a large number of times until truncated simulated distributionsfor )|( 11

ii TP β and )|( 21ii TP β are obtained with a desired degree of precision;

6. Repeat steps 1-5 for each table or sample unit (say, for i = 1,...,P).

Figure 2 displays the types of truncated marginal posteriors that may result fromsuch an exercise. Graph a) shows a situation where truncation is not operant; graph b)shows a posterior with truncation being operant just on one of its side; and graph c)shows a posterior with truncation being operant on both sides; truncation being operantwith L = 0 and U = 1 may also happen.

Figure 2Examples of marginal posteriors simulated for the quantities of interest

(a) (b) (c)

The 2nd stage ends when means and standard deviations of the marginal posteri-ors for 11

iβ and 21iβ , respectively, are computed for all P tables, as follows:

∑=

=K

k

kii K 1

)(1111 ~1ˆ ββ ( )∑=

−=K

ki

kii K

SE1

211)(1111 ˆ~1)( βββ (5.18)

∑=

=K

k

kii K 1

)(2121 ~1ˆ ββ ( )∑=

−=K

ki

kii K

SE1

221)(2121 ˆ~1)( βββ (5.19)

Page 10: Methods of ecological inference for disaggregation

10

where K is the number of simulated values for the quantities of interest in each table.Computing confidence intervals at any desired level for 11

iβ and 21iβ (i = 1,...,P) using

expressions in (5.18) and (5.19) is straightforward.

The EI normal model admits an extension where means’ patterns are allowed tovary along the sample unit tables. This extension is introduced in model structure by aslight modification of the means’ assumption 3. Instead of remaining constants, the"untruncated" mean values 11ℜ

(

and 21ℜ(

become functions of vectors of explanatoryvariables 11

iZ and 21iZ , respectively, as:

( )[ ] ( ) 1111112111

11 5.025.0 ασφ T

i ZZ −+++=ℜ (

(

(5.20)

( )[ ] ( ) 2121212212

21 5.025.0 ασφ T

i ZZ −+++=ℜ (

(

(5.21)

These two expressions are derived by inverting (5.7) and (5.8) for 11ℜ(

and 21ℜ(

,and adding to the right-hand side of each resulting expression the terms 111111 )( αT

i ZZ −and 212121 )( αT

i ZZ − , respectively. Vectors 11α and 21α are parameters vectors whose

elements are the corresponding coefficients of the explanatory variables in 11iZ and 21

iZ .

The vectors 11Z and 21Z contain the respective means of these explanatory variables.According to King (1997; p. 170), the explanatory variables enters as deviations fromtheir means so as not to change the interpretation of the original parameters and so thatthe vectors 11

iZ and 21iZ need not include a constant term. Vectors 11α and 21α are

jointly estimated with the other parameters of the reparameterized vector φ . For such,the posterior in (5.13) must be re-written as:

),,|(),,()|,,( 211121112111 ααφααφααφ TPPTP ∝ (5.22)

and then (by using CML/GAUSS) the following estimates are obtained:

)|,,(maxarg]ˆ,ˆ,ˆ[ 2111

,,

2111

2111TP ααφααφ

ααφ= (5.23)

]ˆ,ˆ,ˆ[^

]ˆ,ˆ,ˆ[ 21112111 ααφααφ CovV = (5.24)

The procedures for estimating the disaggregate data remain the same adopted for thecase with no explanatory variables. The difference is just that the means of the quanti-ties of interest now are given by (5.20) and (5.21), and their estimates may follow dif-ferent patterns because of the explanatory variables’ effects.

6. An approach based on the EM Algorithm

We present now a different approach to estimate the EI normal model. It is based on atechnique called Expectation Maximization Algorithm, for short EM Algorithm, that isan alternative to the optimization techniques used for maximizing the likelihood in(5.12) or the posteriors in (5.13) and (5.22). Three reasons for using the EM Algorithmfor EI are:

a) At the same time it estimates model’s parameters, it also estimates the quanti-ties of interest;

b) It can be naturally extended to cope with R×C tables problems; and

c) For certain cases, it can be faster.

The EM Algorithm will be described here in brief (references with detailed material willbe present ahead). Its application to King’s EI normal model is a research being devel-oped by the author of this paper as part of his Ph.D. Thesis (see more at the web address

Page 11: Methods of ecological inference for disaggregation

11

http://www3.ufjf.br/~rmattos). Research findings which will be presented here are pre-liminary, as they are still for the 2×2 tables case without allowance for explanatory vari-ables effects. They have the only purpose of illustrating possibilities of the EM Algo-rithm technique in the context of EI .

Key concepts of the EM Algorithm formalism are those of complete and incom-plete data. In various statistical applications, we might be concerned with a given vari-able for which observations are not available or measurements are impossible. It repre-sents a variety of situations, like a time series data with some periods missing observa-tions, a survey research database with non-responded items, or a set of values for whichits is known only their sum. The complete data is the sample information we shouldhave to estimate a model’s parameters, but actually we cannot observe it. The incom-plete data is an associated sample information we can observe but whose informationcontent for parameter estimation is lower than that of the complete data.

The two forms of the data are related to each other by a many-to-one mapping:say, to each sample of incomplete data, it is associated many possible samples of com-plete data. In our EI problem, for instance, we observe a sample of aggregate data (rowand column totals for a number of tables): but, in a strict sense, we should have the dis-aggregate data (contents of tables cells) to estimate model parameters by traditionalmethods. We know there may be many, indeed an infinity, of samples of unobserveddissaggregate data consistent with our (unique) sample of observed aggregate data.

Formally speaking, suppose we are interested in a continuous (or a discrete) ran-dom variable X, generated by a probabilistic model )|( θxf , were f is a known density,x an observation from X, and θ a vector of unknown parameters. Suppose also we can-not observe X. However, we can write its likelihood function )|( θxf , where

T1 ],,[ Pxx K=x is a (non-observable) random sample from X. The vector x is called the

complete data, and )|( θxf the likelihood of the complete data (LCD). However, as-

sume we observe a sample vector ],,[ 1 Qyy K=y , which is deterministically related to

x, say: y = h(x), such that Q < P. The vector y is called the incomplete data. Now, let Xbe the sample space for x, and Y the sample space for y. Since Q < P, then h: X→Y is amany-to-one mapping relating X to Y. A single point y ∈Y has associated to it a subsetof X − namely X(y), the inverse image of y under h − containing many points x ∈ X.Then, we can write the likelihood of the incomplete data (LID) as:

∫=)

)|()|(X(y

xxy dfg θθ (6.1)

How does the EM Algorithm works? What the EM Algorithm does is to find esti-mates of θ using the incomplete data y and the form of the LCD to indirectly maxi-mize the LID. It is not undertaken at once, but in a sequence of iterations, where, ineach iteration, two stages are performed: the Expectation Stage (E-Stage), and theMaximization Stage (M-Stage). The general EM scheme is as follows:

S1. Assume a guess for θ , say kθ ;

S2. E-Stage: Given a guess kθ , compute:

[ ]xyxx

yx

X(y)dff

fEQ

k

kk

),|()|(log

,|)|(log),(

θθ

θθθθ

×=

=

∫(6.2)

S3. M-Stage: Maximize ),( kQ θθ for θ , finding:

),(maxargˆkQ θθθ

θ= (6.3)

Page 12: Methods of ecological inference for disaggregation

12

S4. Set θθ ˆ1 =+k ;

S5. Repeat steps S1−S4 a number of times until convergence.

The EM Algorithm scheme S1-S5 presents good convergence properties: For in-stance, the LID never decreases in each iteration, and, under general conditions, it willalways converge to a stationary point. For a systematic accounting of the EM Algorithmtheory and various convergence theorems, see the recent book from McLachlan andKrishnan (1997). The EM Algorithm is appealing for EI because it also produces esti-mates of the complete data. It is done, according to S1-S5, in every E-Stage performedin every iteration. Of interest are only the complete data estimates from the last itera-tion, after convergence has been achieved. By this property, a path is open to use theEM Algorithm for EI. For instance, to put the EI normal model (2x2 tables case) fromKing (1997) under the EM formalism, we may assume the following:

a) Complete data = disaggregate data: ),( 2111 βββ vec= ;

b) Incomplete data = aggregate data: T1 ],...,[ pTTT = ;

c) Parameter vector: T221

211

2111 ] ,,,,[ ρσσψ (((

((

( ℜℜ= or T54321 ],,,,[ φφφφφφ = ;

d) Many to one mapping = accounting identity: T = 2111 ][ ββ XIX −+ P ;

where T11111

11 ],...,[ pβββ = , T21211

21 ],...,[ pβββ = , IP is the P×P identity matrix, and

)],.([ T1 PXXdiag K=X . Note the many-to-one mapping is given by the accounting

identity (4.1), presented in a generalized form in d). Now, if we let B be the samplespace for β and Γ the sample space for T , then B(T) ⊂ B is the inverse image of thepoint T under the mapping Γ→B:h . The LID, the LCD and the Q−Function can bedefined as:

e) LCD: )|( ψβ (

P ;

f) LID: ∫=)(

),|()|(TB

dTPTP βψβψ ((

;

g) Q-Function:

[ ]βψβψβ

ψψβψψ β

dTpp

TpEQ

kT

kk

),|()]|([log

,)|(log),(

)(

((

((((

×=

=

∫Β(6.4)

To turn operational the methodology, we must know the forms of the LCD, ofthe conditional (on T) LCD, and of the LID. The latter we already know from (5.6). Theformers are given by:

− LCD:

P

P

iiiP

i

wi

bi

R

NPP

)(

)|,()|,()|( 1

21112

1 ψ

ψββψββψβ

(

(

((

∏∏ =

=== )(B T∈β (6.5)

− Conditional LCD:

∏=

==P

i ii

ii

STN

N

TP

PTP

1

21112

)()|(

)|,(

)|(

)|(),|(

ψψψββ

ψψβψβ

((

(

(

(

(

)(B T∈β (6.6)

)|(, ψ(N indicates the normal distribution and )(ψ(iS the truncation factor of the marginal

posterior for 11iβ .

Given definitions a)-g) and (6.4)-(6.6), we are able to use the EM Algorithm toestimate the EI normal model: we just have to follow steps S1−S5. Let us call it theEM−EI methodology. A major difficulty to implement it is associated with computing

Page 13: Methods of ecological inference for disaggregation

13

the multiple (2P−tuple) integral in (6.4). It is hard to be solved in an analytical closedform, and thus we have been working on two approaches to compute an approximation:numerical integration, and Monte Carlo simulation of the E−Stage (Tanner, 1996). Theprocedure adopted here is based on the first approach, and consists of:

a) Generating, for each quantity of interest, a grid of equally spaced values be-tween their corresponding bounds;

b) Estimating those quantities as means of their truncated marginal posteriors,given a guess kψ( say:

[ ] ∫==bi

bi

U

L ikiikii dTPTE 1111111111 ),|(,|ˆ βψββψββ ((

i = 1,...,P (6.7)

[ ] 112121 ˆ11

,|ˆ βψββi

i

i

ikii X

X

X

TTE

−+

−== (

i = 1,...,P (6.8)

(note that (6.7) is easily computed via numerical integration); and

c) Substituting the estimates from (6.7)−(6.8) in the following approximation forthe Q-Function:

)|ˆ,ˆ(ˆlog)(log),(ˆ 2111* ψββψψψ ((

kkk PPQ += (6.9)

where:

( ) ∏=

−=P

iikik

Pkk NRP

1

21112

2111 )|ˆ,ˆ()()|,(ˆ θα ψββψψββ (((

(6.10)

Part of the approximation to (6.4) that (6.9) represents comes from the fact that (6.10) isan approximation to (6.5). The term inside the production operator in the right−handside of (6.10) is the untruncated bivariate normal density to θ, and the term outside isthe volume of this density over the unit square )(ψ(R elevated to −αP. We shall rewritethe approximation (6.9) as:

( ) ∑=

+−+=P

iikikk NRPPQ

1

21112

* )|ˆ,ˆ(log)(log)(log),(ˆ ψββθψαψψψ (((((

(6.11)

Parameters α and θ should be positive and play the hole of regulating the degree ofapproximation. They weight the elements of the truncated density, say, α weights thetruncation factor log and θ the untruncated density log. At the moment, they are deter-mined manually (research work is in progress regarding their optimization together withthe other model parameters). In the exercise discussed next, good results were obtainedby setting3 α = 1/P and θ = 2/P.

To illustrate the methodology, we ran a simulated exercise. A number of 100i.i.d. values for the disaggregate data pairs { ),( 2111

ii ββ ,: i = 1,...,100} were drawn froma bivariate normal distribution truncated4 on ]1,0[]1,0[ × , assuming for the true parame-ters that:

3 A reason for using these weights is that the alternative of setting α = θ = 1 in (6.10) may downweightsthe prior and the untruncated normal term relative to the truncation factor log in (6.11), which is just anapproximation and not the exact Q−Function. We note this problem in preliminary runs, where the priorshad little effect in improving estimates. In an attempt to correct this problem, we used the weights. Inaddition, until now, we are certain only of the concavities of priors and the untruncated normal term;nothing we can say in this regard about the log of the truncation factor )(ψ(R .4 To be strict, the disaggregate data pairs were drawn as follows: b

iβ values were drawn from theuntruncated version of his conditional normal distribution (5.17), parameterized as in (6.12), with values

Page 14: Methods of ecological inference for disaggregation

14

[ ]0,25.0,25.0,5.0,5.0],,,,[ 221

211

2111T =ℜℜ= ρσσψ (((

((

(

(6.12)

For the aggregate data X, 100 i.i.d. values were drawn from a uniform distribution de-fined on the interval [0,1]; then, 100 values for Ti were computed via the accountingidentity (4.1). The EM-EI Software (Alpha Version), developed by the author using theMatLab Language (Mathworks, Inc.), was used to compute estimates of the parametervector ψ( and of the disaggregate data vectors 11β and 21β from the true aggregate datavectors T and X.

The basic settings used to run the EM−EI methodology over this simulated da-taset were the following:

a) Priors: lognormal priors (with means = 0.5 and standard deviations = 0.35)for5 the variances 2

bσ( and 2wσ( were adopted; as reparameterization

(5.7)−(5.11) was used, this implied in normal priors (with means = −0.89 andstandard deviations = 0.63) for 3φ and 4φ ; in addition, a normal prior (with

mean 0 and standard deviation = 0.25) was adopted for 5φ , the reparameter-

ized correlation coefficient ρ( ;

b) EM initialization: the EM scheme is initialized at ]0,1,1,0,0[0 =ψ( ;

c) Parameter space: the optimization algorithm used in the M−Stage is re-stricted to search a solution in a subset of the parameter space, described bythe following limits imposed to the elements of the parameter vector:

)]110(,10,10,24,24[ 566 −−−= −−−lowψ( and )]101(,5,5,100,100[ 5−−=upψ( ;

d) Termination criteria: for the EM sequence, if ||ˆˆ|| 1−− kk φφ is less than 10−5

for or if iterations exceeds 100; for the M−Stages, if the norm of the directionvector is less than 10−4 or if iterations exceeds 500;,

e) Disaggregate data estimation: a grid with 101 equispaced points were gen-erated for each quantity of interest between their corresponding deterministicbounds;

Table 3 displays results of parameters’ estimation: 57 were necessary for con-vergence, lasting 4.02 mins./secs. on an Intel Celeron 366 MHz processor with 56 MB

of RAM. Because of reparameterization, the 57φ̂ vector ("FI Parameters") is displayed

first, followed by the 57ψ̂( vector ("PSI Parameters") and the true ψ( vector ("True Pa-

rameters"). The methodology provided good estimates for the means and the correla-tion coefficient; but, for each variance, a large negative deviation was obtained. Figure 3displays the evolution of the LID along the 57 iterations: A clear convergence pattern isdisplayed, despite the absence of monotonicity6. By its turn, Figure 4 shows the evolu-tion of parameters estimates: The five graphs also display a clear convergence towardsa fixed value, though only three of them converged to a point quite close to their truevalues.

falling outside [0,1] being rejected until 100 valid values ∈ [0,1] were obtained; the 100 values for wiβ

where computed via the relation (B1) from Box 1.5 The EM−EI software uses the transformations 2

3 log bσφ (= and 24 log wσφ (= instead of bσφ (

log3 = and

wσφ (

log4 = , which are used by King (1997, p.136) and were presented here as (5.9) and (5.10).6 The absence of monotonicity was to be expected, because, again, we are using an approximation insteadof the exact Q−Function. For instance, non−monotonicity of the LID along EM iterations is well knownto be lost when the E−Stage is performed using Monte Carlo simulations (McLachlan and Krishnan,1997). What was unexpected were two peaks of the LID above its limiting value, what happened at thebeginning of the EM sequence in Figure 3 .

Page 15: Methods of ecological inference for disaggregation

15

Table 4 displays methodology performance for disaggregate data estimation:52% of the points for 11β and 45% for 21β displayed less than 10 percentage points ofabsolute deviation from the true value; considering 15 percentage points, the coverageaugments to 61% for both. The latter results are visualized in Figure 5. Scatter plots (a)and (b) of true versus estimated values show 61 points for 11β , and 61 for 21β fallinginside the error region of ± 15 percentage points far from the 45o line (between the dot-ted lines). Graphs (c) and (d) display the same scatter plots but with vertical lines pass-ing through each point; these lines represent the bounded intervals delimited by the de-terministic bounds. These two graphs show a concentration of larger intervals aroundestimates with value 0.5, indicating a high influence of interval size in methodologyperformance7. Best predictions are for points with shorter intervals; the bulk of pointslaying outside the dotted lines (graphs (a) and (b)) have larger intervals which stayaround the value 0.5.

Table 3Parameters estimation results

EMEI: EM ALGORITHM FOR ECOLOGICAL INFERENCE

RESULTS OF ESTIMATING KING´S BASIC NORMAL MODEL

* Data File : datsmpl.mat * Nobs : 100

* E-Stage Option : 4

* M-Stage Option : 1

* Number of iterations : 57 (Max. = 100; Tol. = 1e-005) * CPU Time in Minutes.Seconds: 4.02

* Reparameterization Used with Priors.

* FI PARAMETERS VALUES:

FI1 = −0.0258 FI2 = −0.0319 FI3 = -1.1224 FI4 = -0.9482 FI5 = 0.0205

* PSI PARAMETERS VALUES:

Rb = 0.4908 Rw = 0.4872 Vb = 0.1060 Vw = 0.1501 Rho = 0.0205

* TRUE PSI PARAMETERS VALUES:

Rb = 0.5000 Rw = 0.5000 Vb = 0.2500 Vw = 0.2500 Rho = 0.0000 * Likelihood of Incomplete Data:

Raw = 8.970e+011 Log = 27.5223

Source: Output from EM-EI Software (Alpha Version)

7 A desired result would be that methodology performance were less dependent on bounded intervals andmore dependent on model form, in cases where those intervals are large, say, when they are near or equalto the [0,1] interval. Graphs (c) and (d) of Figure 5 show that, in the present exercise, there is a visibletendency of quantities of interest to be predicted close to the 0.5 value when bounded intervals are near orequal to [0,1].

Page 16: Methods of ecological inference for disaggregation

16

Figure 3

Source: Output from EM-EI Software (Alpha Version)

Figure 4EM sequences of parameters guesses for the EI normal model

Source: Output from EM-EI Software (Alpha Version)

Page 17: Methods of ecological inference for disaggregation

17

Tabel 4Disaggregate (complete) data estimation results

EM-EI: EM ALGORITHM FOR ECOLOGICAL INFERENCE

FITTING OF ESTIMATED AGAINST SIMULATED COMPLETE DATA

P.ABS. <= .10(b): 52.00 % P.ABS. <= .15(b): 61.00 %

P.ABS. <= .15(w): 45.00 % P.ABS. <= .15(w): 61.00 %

MAE(b) : 0.3727 MAE(w) : 0.3805

MSE(b) : 0.1877 MSE(w) : 0.1897 P.ABS. <= MAE(b) : 93.00 % P.ABS. <= MAE(w) : 93.00 %

P.ABS. <= MSE(b) : 73.00 % P.ABS. <= MSE(w) : 72.00 %

Source: Output from EM-EI Software (Alpha Version); Note: MAE = Mean Absolute Error; MSE = Mean Squared Error.

Figure 5Graphics of disaggregate (complete) data estimation Results

(a) (b)

0(d) (e)

Source: Outut from EM−EI Software (Alpha Version)

Page 18: Methods of ecological inference for disaggregation

18

Though demanding of improvements, these results are somewhat reasonable.Benoit and King (1988) developed the EzI Software to implement King’s EI method. Ifwe run their software on the same simulated dataset, we find that the results we obtainedwith the EM−EI methodology at least parallel those of King’s maximum likeli-hood/posterior method, as we can see by means of Tables 5 and 6. In the first of thesetables, parameters’ estimation performance is summarized. EzI runs substantially fasterand produce quite similar estimates for the means; however, besides an estimate for bℜ

(

quite close to the true value 0.5, the estimate for wℜ(

is inferior than that produced byEM−EI. EzI’s estimates of variances are poor, even poorer than those of EM−EI. And,the same happens with the estimate of the correlation coefficient ρ( , which was quitesatisfactorily predicted by EM−EI8.

Table 5Parameters estimation results using King’s EI method

RESULTS OF ESTIMATING KING´S BASIC NORMAL MODEL

* Data File : datsmpl.dat * Nobs : 100

* CPU Time in Minutes: 0.4413 ( ≅ 26 seconds)

* Reparameterization Used with Priors.

* FI PARAMETERS VALUES:

FI1 = −0.0194 FI2 = −0.1212 FI3 = -2.7306 FI4 = -1.0087 FI5 = 0.1055

* PSI PARAMETERS VALUES:

Rb = 0.4939 Rw = 0.4255 Vb = 0.0652 Vw = 0.3647 Rho = 0.1051

* TRUE PSI PARAMETERS VALUES:

Rb = 0.5000 Rw = 0.5000 Vb = 0.2500 Vw = 0.2500 Rho = 0.0000

* Likelihood of Incomplete Data: Raw = 2.654e+048 Log = 111.500

Source: Elaborated by the author using results obtained with the EzI Software, fromBenoit and King (1998);Note: Variance parameters Vb and Vw correspond to the square of the standard devia-tions (SB and SW) produced by the EzI Software.

Table 6 displays EzI’s results concerning disaggegate data estimation. The com-parison with EM−EI’s perfomance presented in Table 4 shows that the results producedby the two approaches are practically the same . This is shown more clearly in thegraphs of Figure 6. Scatters (a) and (b) in this figure are quite similar to scatters (a) and(b) in Figure 5. The same is true regarding graphs (d) and (e) of both figures, though aslightly wider spread of the larger vertical lines around the estimated value 0.5 occurs inFigure 6. This slight difference, however, is not enough to be reflected in the numbersof Table 6. From another side, it points, also in the case of King’s EI method, to thesame problem of a high dependency of disaggregate data estimates on the bounded in-tervals sizes, which we noted before.

8 We must take note here that the priors used in EzI and those used in EM−EI, as explained before, aredifferent and this may be a relevant factor to explain the differences between the two methods. Accordingto King (1997, pp. 138−139), EzI uses half normal distributions with variance 0.5 as priors for the two

variances 2bσ( and 2

wσ( , and a normal distribution with mean 0 and standard deviation .5 as a prior for 5φ .

Page 19: Methods of ecological inference for disaggregation

19

Tabel 6Disaggregate (complete) data estimation results

using King’s EI method

FITTING OF ESTIMATED AGAINST SIMULATED COMPLETE DATA(No. of simulation points used for estimation: 100)

P.ABS. <= .10(b): 48.00 % P.ABS. <= .15(b): 62.00 %

P.ABS. <= .15(w): 49.00 % P.ABS. <= .15(w): 60.00 %

MAE(b) : 0.3751 MAE(w) : 0.3750 MSE(b) : 0.1936 MSE(w) : 0.1849

P.ABS. <= MAE(b) : 91.00 % P.ABS. <= MAE(w) : 95.00 %

P.ABS. <= MSE(b) : 73.00 % P.ABS. <= MSE(w) : 69.00 %

Source: Elaborated by the author using results obtained with the EzISoftware, from Benoit and King (1998).

Figure 6Graphics of disaggregate (complete) data estimation results

using King’s EI method

(a) (b)

(d) (e)Source: Output from EM−Ei Software (Alpha Version) using Betab11 and Beta21estimates produced with the EzI Software, from Benoit and King (1998).

Page 20: Methods of ecological inference for disaggregation

20

7. Final Remarks.

In this paper, we presented new developments on EI research, arguing it is a promisingfield for advances on DDE, since a number of new methods have been proposed re-cently. The applicability of EI techniques for particular types of disaggregation prob-lems arising in business planning was pointed, and their usefulness as additional toolsfor OR analysts stressed. We also made a brief review of the recent EI literature, focus-ing on the innovative work by King (1997), say, his EI normal model. It was presentedin detail, with us taking note of its sharp methodological improvements, as compared toprevious EI approaches, including its adoption of modern techniques of intensive com-puting to provide reliable estimates for the disaggregate data .

Basic limitations of King’s model were pointed, like the absence of formal meth-ods to diagnosticate both goodness of fit and the underlying distribution for the quanti-ties of interest. In our view, however, such limitations seem to be transients, in the sensethat additional research efforts will probably fulfill, in the future, the spaces for im-provements. A major limitation of EI research is the absence of a consistent implemen-tation of an EI model for the general R×C tables problem. King (1997) did provide ageneralization of his EI normal model, but not an implementation; and KRT’s model isstill for the 2x2 tables case.

In a second part of the paper, we presented preliminary results of a research to-wards the implementation of an alternative method, based on the EM Algorithm, toestimate King’s EI normal model. Though preliminary, these results are stimulatingwhen showing that an EM−EI approach is functioning and presenting a performance ofestimation similar to King’s EI method, in a comparison between the two approaches wehave presented using a simulated dataset. Of course, more is to be done to improveEM−EI’s estimates of the disaggregate data, to make it run faster and to implement it forEI problems with R×C tables. While facing these challenges, we hope to be achievingsome new developments soon.

References

Achen, C.H. e W.P. Shively (1995). Cross−level inference. Chicago: University of Chi-cago Press.

Benoit, K. e G. King (1998). EzI: An easy program for ecological inference. Draft. July,28.

Cho, W. K. T. (1997) Structural shifts and deterministic regime switching in aggregatedata analysis. Draft Paper. University of Illinois.

Cho, Wendy K. T. (1998). Iff the assumption fits... A comment on the King ecologicalinference solution. Political Analysis, 7 (in press).

Ducan, O. D. e B. Davis (1953). An alternative to ecological correlation. AmericanSociological Review,18, 665−666.

Ferree, K. E. (1999). Iterative approaches to RxC ecological inference problems: wherethey can go wrong. Draft Paper. Department of Government, Harvard University.

Freeedman, D. A., S. P. Klein, M. Ostland e M. Roberts (1998). On solutions to theecological inference problem. Technical Report no. 515. Berkeley: Statistics De-partment, University of California, Berkeley.

Goodman, Leo (1953). Ecological regression and the behavior of individuals. AmericanSociological Review, 18, 663−4.

King, G. (1997). A solution to the ecological inference problem: reconstructing individ-ual behavior from aggregate data. Princeton: Princeton University Press.

King, G. (1999). The future of Ecological Inference research: a reply to Freedman et al.Journal of the American Statistical Association, March (in press).

Page 21: Methods of ecological inference for disaggregation

21

King, G., O. Rosen and M. A. Tanner (1998). Binomial−beta hierarchical models forecological inference. Sociological Research and Methods (in press).

Lewis, J. B. (1998). A consideration and test of King’s ecological inference estimator.Draft Paper. Princeton: Princeton University.

Mattos, R. S. e A. Veiga (1999). Estimação de dados desagregados através de Re-gressão Ecológica. Biannual Schools of Regression Models from the Brazilian Sta-tistical Association.

McLachlan, G.J. and T. Krishnan (1997). The EM Algorithm and extensions. Wiley Se-ries in Probability and Statistics. New York: Wiley.

Novaes, A. G. (1982). Modelos em planejamento urbano, regional e de transportes. SãoPaulo: Edgard Blücher.

Penubarti, M. e A. A. Shuessler (1998). Inferring micro− from macrolevel change:ecological panel inference in surveys. Annual meetings of The American PoliticalScience Association.

Rivers, D. e W. K. T. Cho (1997). Yet another solution to the ecological regressionproblem. Paper in progress. University of Illinois.

Rivers, D. (1998). Nonparametric estimation of ecological regression models. Annualmeetings of The American Political Science Association.

Schoenberg, R. (1996). Constrained maximum likelihood. Draft Paper. University ofWashington.

Whitehouse, G. E. e B. L. Wechsler (1976). Applied operations research: a survey.New York: Wiley.

Wilson, A. G. (1970). Entropy in urban and regional modelling. Monographs in spatialand environmental systems analysis. London: Pion.