27
African Census Analysis Project (ACAP) UNIVERSITY OF PENNSYLVANIA Population Studies Center 3718 Locust Walk Philadelphia, Pennsylvania 19104-6298 (USA) Tele: 215-573-5219 or 215-573-5169 or 215-573-5165 Fax: 215-898-2124 http://www.acap.upenn.edu Email: [email protected] Multivariate Analysis Using Grouped Census Data: An Illustration on Estimating the Covariates of Childhood Mortality Amadou Noumbissi, Tukufu Zuberi and Ayaga A. Bawah ACAP Working Paper No 12, October 1999 This research was done as part of the African Census Analysis Project (ACAP), and was supported by grants from the Rockefeller Foundation (RF 97013 #21; RF 98014 #22), from Andrew W. Mellon Foundation, and from the Fogarty International Center and the National Institute of Child Health and Human Development (TW00655-04). We would like to thank Timothy Cheney for computer programming assistance.

African Census Analysis Project (ACAP) · nts. We use the grouped file to demonstrate how to estimate the covariates of childhood mortality at the individual level using selected

  • Upload
    dongoc

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

African Census Analysis Project (ACAP) UNIVERSITY OF PENNSYLVANIA

Population Studies Center 3718 Locust Walk Philadelphia, Pennsylvania 19104-6298 (USA)

Tele: 215-573-5219 or 215-573-5169 or 215-573-5165 Fax: 215-898-2124 http://www.acap.upenn.edu Email: [email protected]

Multivariate Analysis Using Grouped Census Data: An Illustration on Estimating

the Covariates of Childhood Mortality

Amadou Noumbissi, Tukufu Zuberi and Ayaga A. Bawah

ACAP Working Paper No 12, October 1999 This research was done as part of the African Census Analysis Project (ACAP), and was supported by grants from the Rockefeller Foundation (RF 97013 #21; RF 98014 #22), from Andrew W. Mellon Foundation, and from the Fogarty International Center and the National Institute of Child Health and Human Development (TW00655-04). We would like to thank Timothy Cheney for computer programming assistance.

Recommended citation: ACAP W.P. # 12: Amadou Noumbissi, Tukufu Zuberi and Ayaga A. Bawah. 2000. Multivariate Analysis Using Grouped Census Data: An Illustration on Estimating the Covariates of Childhood Mortality. ACAP Working Paper No 12. October 1999. The African Census Analysis Project (ACAP), Population Studies Center, University of Pennsylvania, Philadelphia, Pennsylvania.

Multivariate Analysis Using Grouped Data

us micro-data

can be made

y of African

governments. The data were grouped as part of the African Census Analysis Project

dissemination strategy. Employing both the grouped and ungrouped census micro-data

rate through different regression procedures that the grouped

and the ungrouped census micro-data yield identical results. The advantages of using

grouped data are then discussed.

Abstract

This paper presents a procedure for using grouped data from African cens

for multivariate regression analysis. Grouped African census micro-data

available to researchers without violating the concerns of confidentialit

from Zambia, we demonst

1

Multivariate Analysis Using Grouped Data

Multiv : An Illustration on Estimating the Covariates of Childhood Mortality

stration of the

e did this by

using a grouped file created from the 1990 Zambia census micro-data archived by the

African Census Analysis Project (ACAP). Grouping the data in this form reduces the

importantly,

s micro-data

of African

nts. We use the grouped file to demonstrate how to estimate the covariates of

childhood mortality at the individual level using selected variables from the Zambia 1990

census data.

ormation for

us micro-data

rican nations.

rchers, administrators, and scholars should have easy access to African

census data. To this end, in the future ACAP will allow access to the grouped data over

the Internet. Most African nations are receptive to this idea and this paper is a first step in

realizing this objective.

While the World Fertility Surveys (WFS) and its successor, the Demographic and

Health Surveys (DHS), have been useful in helping researchers to study various

ariate Analysis Using Grouped Census Data

Introduction

In this article, we estimated the covariates of childhood mortality as an illu

potential for using grouped census data to do individual-level analysis. W

number of records thereby substantially reducing computation time. More

using grouped data in this way allows ACAP to make African censu

available to researchers without violating confidentiality concerns

governme

African census micro-data are an invaluable source of inf

understanding demographic processes; ACAP has created a unique cens

collection. The current collection consists of over 47 censuses from 25 Af

Government resea

2

Multivariate Analysis Using Grouped Data

demographic phenomena in most of the developing world, including sub-Saharan Africa,

these data do not adequately lend themselves to the study of spatial distributions of

social factors

ata

henomena at

od mortality

differentials we employ different regression procedures to show that results from the

group data are invariant to those from the individual level data.

a tabular or

is (Noumbissi

1996; Allison 1999). The grouped file is simply a tabular data set that can be analyzed

using event/trials or equivalent syntax. The data are grouped using the same technique as

allows data to

ix-like format

1999). The process summarizes the individual raw data and the data are pre-

sum AS 1999) by

or dependent

variables.

This paper demonstrates that grouped census data can be used to estimate the

covariates of socio-demographic phenomena such as childhood mortality. Although there

is no practical limitation on the number of sub-cubes and the number of analysis or

demographic phenomena and their relationships to various ecological and

because of small sample size constraints at sub-national levels. In contrast, census d

provide the large numbers of cases required for adequate analysis of these p

sub-regional and lower levels. In our illustrative analysis of childho

Grouping Procedure

From individual records in the census file we summarize information in

group format that can be used for both aggregate and non-aggregate analys

the multidimensional database or MDDB, a specialized storage facility that

be pulled from a data warehouse or other data sources for storage in a matr

(SAS

marized and stored as “NWAY cubes and zero and more sub-cubes” (S

distinguishing the classification or independent variables from analysis

3

Multivariate Analysis Using Grouped Data

dependent variables, we limit the analysis in this paper to a few variables in order to

facilitate the demonstration. The variables employed are: place of residence (urban or

and mother’s

les for the

interest,, the

five selected

independent variables, we created a file containing 2026 records and five variables (see

an extract from the data table in Table 1).

le 1 mp e bt fro nsu cro-data

rural), province or region of residence, level of education, marital status,

age. From the census micro-data we create a data table of independent variab

total number of women who possess the characteristics or attributes of

number of children ever born, and the number who have died. With the

Tab : Exa le of the type of r cords o ained m ce s mi

15-19 Urba a 3 68 187 n No Education Married Lusak 180 11

20-24 Urba a 5 32 1161 n No Education Married Lusak 424 70

25-29 Urba a 1 031 2209 n No Education Married Lusak 442 13

30-34 Urba a 3 798 3232 n No Education Married Lusak 431 18

Age

group

Residence Education Marital

status

Province Number

of women

Children

ever born

Children

dead

35-39 Urban No Education Married Lusaka 3687 19925 3361

40 04 -44 Urban No Education Married Lusaka 3864 23354 47

45-49 Urban No Education Married Lusaka 3064 18748 4313

We group mother’s age, the first variable, into seven five-year age categories. The

second variable, place of residence has two values, urban and rural; while the third

variable, level of education completed, has four values corresponding to no education,

primary, secondary, and college or higher level. The fourth variable, marital status, has

five values (never married, currently married, divorced, separated, and widowed); and the

4

Multivariate Analysis Using Grouped Data

last variable, province of residence, has nine values corresponding to eight provinces of

the country and Lusaka, the main city itself constituting one region that is 16.7 percent

4*5*9=2520;

variables do not have values in some cells, the actual

gro

lier, grouping

the data in this form reduces substantially the number of records in the file, which

reduces computing time for statistical analysis and allows data analysis over the Internet.

Thi e populations

s of African

he analysis in

this paper demonstrates that even without access to the actual individual-level micro-data,

it is possible to estimate demographic events accurately with the grouped or tabular data

tional census.

inear models,

een variables.

hroughout the paper. The variables were chosen from

those typically found in African census data. They are not an exhaustive list of covariates

that might affect mortality. Our choice of variables is meant to be illustrative of the type

of analysis associated with child mortality.

rural and 83.3 percent urban. The expected number of records is 7*2*

however, because some of the

uping yielded 2026 records.

What is the advantage of grouping data in this form? As noted ear

s is especially advantageous when using census micro-data, where whol

are enumerated and often involve millions of records.

More importantly, grouped data allow us to respect the concern

governments and to maintain confidentiality of their national data. Indeed, t

that can be generated using the individual-level micro-data from the na

Also, by presenting data in this tabular form, it becomes easy to use log-l

factor analysis, or correspondence analysis to examine the relationship betw

We used a simple set of variables t

5

Multivariate Analysis Using Grouped Data

An Illustration with Multivariate Mortality Analysis

Over the past 25 years, Brass-type estimates of childhood mortality in developing

972; Trussell

aus routinely

retrospective

births and the

survival status of the resulting children. While census bureaus tend to use such estimates

to examine mortality trends, they are also important for understanding differentials in

regarding the

ild mortality.

e of maternal

education for child mortality in Ibadan (Caldwell 1979). Other researchers have also

used the Brass-type questions collected in African censuses to investigate childhood

and Preston

howing large

y, Hamid and

ifferentials of

infant and child mortality among the regions of Egypt by sex and socioeconomic status.

They discovered large variations in the levels and differentials of infant and child

mortality in the country, with those in Upper Egypt having the highest probabilities of

dying. A 15-country study by the United Nations (1985) also estimated socioeconomic

differentials in childhood mortality noting differences within different social and

countries have become standard procedures (Brass et al. 1968; Sullivan 1

1975; Preston and Palloni 1977; Trussell and Preston 1982). Census bure

make and report such estimates. These census estimates are based upon

reports given by women regarding their past cumulative number of live

child mortality.

Brass-type analyses have been the basis of several studies

relationship of social and economic factors associated with levels of ch

Caldwell used such a strategy in his multivariate analysis of the importanc

mortality differentials in Africa. Using the 1976 Sudan census data, Farah

(1982) for instance, estimated child mortality differentials in the Sudan, s

differences between the northern and southern parts of the country. Similarl

Ahmed (1988) estimated the levels, trends (from 1976 to 1986), and d

6

Multivariate Analysis Using Grouped Data

economic groups, both within and between countries. Using the 1900 and 1910 United

States census data, Preston and Haines (1991) estimated historical mortality differences

betw .

iate statistical

ass-type data

community).

These authors compared various statistical models for estimating the association between

childhood mortality and socioeconomic and cultural characteristics of the family,

ical evidence

assumed that

g the same

indexes that require the selection

of a mortality standard for their computation. Their indexes are incorporated into a

regression model to estimate the covariates of childhood mortality.

deaths (Oi) to

men based on

; Preston and

(WLS) or the

Tobit regression model. For the WLS regression, Preston and Haines (1991) suggest the

as a weight. The Tobit model is recommended

when the dependent variable is truncated or bounded, as is the case with the Trussell and

Preston Mortality Index (Trussell and Preston 1982).

een different social, income, and racial groups at the turn of the century

Trussell and Preston (1982) provide the first review of multivar

procedures for estimating the association between child mortality from Br

and various social and economic processes (i.e., family, household, and

household, and community in which the child is born. Considering empir

from the analysis of sets of model life tables, Trussell and Preston (1982)

the patterns of mortality should be unique for individuals possessin

characteristics. They then proposed a series of mortality

The Mortality Index is computed based on the observed number of

women of a certain age and the expected number of deaths (Ei) to those wo

the pattern of mortality from a selected standard (Trussell and Preston 1982

Haines 1991).1 They suggested either the use of weighted least squares

use of the number of children ever born

1-- For an evaluation of these indexes, see Noumbissi (1996).

7

Multivariate Analysis Using Grouped Data

In our analysis, we considered different models using both the grouped and the

ungrouped data, including those recommended by Trussell and Preston (1982). We

review below the different models used in this paper.

The Tobit model is defined using the following equation as a latent variable:

ik are a set of K dummy

independent variables measured on the ith woman. Using this latent variable, a censored

random variable ki is defined as followed:

hed by Brass

bit models to

estimate the covariates of mortality from both grouped and ungrouped data. Brass (1975)

used standard mortality and fertility models and established that the proportion of

children dead for women aged x at the time of census or survey (Dx) is approximately

equal to the probability of dying from birth to age ax (ax ranging from 0 to x–α):

0k ifk ** >= iiik

The Tobit Model

Where k* is the estimated morta

iikkiii ezzzk +++++= ββββ ...22110*

lity index and Zi1, Zi2… Z

The Logit, Complementary Log-Log, and Probit Models

We also propose a simplified procedure based on the relationship establis

(1975) that allows the use of either the logit, complementary log-log, or pro

0k if0 * == iik

8

Multivariate Analysis Using Grouped Data

where f(y) is the age-specific annual fertility rate for women age y, an

probability of a child’s death in the interval from birth to age x–y. Age a is u

),()(

)()(xx

x

aqdyyf

dyyxqyfDx ≅

−=

∫∫

α

α

d q(x–y) the

x nknown and

the principle of the Brass method for estimating the level of mortality involves

establishing correspondences between the age of the mother at the time of the census and

the age of children reported by the woman (Preston, Heuveline, and Guillot 2001).

ather than the

dead as an

ix n of children

dead for woman i at age x, which is the underlying probability of dying from birth to age

x–m for the children of this woman. Dix/(1–Dix) can be translated as the odds and if Zi1,

Zi2… Zik are a set of K dummy independent variables measured on the ith woman, the

logit regression can be used as follows (see Long 1997; Menard 1995; Halli and Rao

1992; Allison 1999):

This then allows us to us variates of child

mortality. The estimates for this model can be obtained by using the child or the mother

as the unit of analysis. The results are invariant to the unit of analysis. Note that unlike

the Mortality Index, which is based on the observed and expected child deaths for each

Since we are interested in estimating the covariates of mortality r

levels, this relationship allows us to use the proportion of children

approximation of the force of mortality. Therefore, let D be the proportio

Logit D zix j ij( ) ln=−

= + ∑0β βD

Dix

ix j

k

=1 1

e simple logistic regression to estimate the co

9

Multivariate Analysis Using Grouped Data

woman, our estimates are based on the survival status of each child. We create a

dichotomized variable (children dead and children surviving) as the dependent variable,

and

. The logistic

of assuming

ay assume the Gompertz distribution and

use the complementary log-log models defined as follow:

use the logistic model to estimate the covariates of childhood mortality.

Each of the estimated models assumes a pattern of underlying errors

regression assumes that this pattern follows a logistic distribution. Instead

that the errors have logistic distribution, one m

∑=

umes that the errors have a normal

distribution with a conditional mean of zero and a variance of one (Long 1997).

Following this assumption the probit model is defined as follow:

φ–1 probit) of a

standard normal variable (Allison 1999).

The Negative Binomial Model

ildren born or

dead—it is possible to use the negative binomial regression model, which is based on

count data. The negative binomial model is a generalization of the Poisson regression

model and is better suited to handling observed and unobserved heterogeneity among

individuals (Long 1997). The number of children dead for each woman (yi) should have a

Poisson distribution with a conditional mean (λi) that depends on individual observed

(Zi1, …Z2k) and unobserved (gI) characteristics. Since the conditional mean λi depends

+=−−=k

jzDDC

1))1ln(ln()log(log ββ

We also employ the probit model, which ass

ijjixix 0

∑=

− +=j

ijjix zD1

01 )( ββφ

k

where is the inverse of the cumulative distribution function (called

Considering the questions asked in the census—a count of the number of ch

10

Multivariate Analysis Using Grouped Data

first on the level of fertility of the woman and the length of time her children are exposed

to death, an offset variable (v) is incorporated into the model as follow:

or

The probability of having yi children dead given the woman’s characteristics is then

given by:

We control for the level of fertility of the woman by creating an offset variable

constituted by the number of children ever born to the women i (Bi), thus vi=Bi. We used

age of mother (ai) to control for the duration of the children’s exposure to the risk of

mortality. The coefficients βj are then estimated by a maximum likelihood function

assu s a gamma distribution with parameters 1/α that is equal to the

var

The Log-Linear Model

The grouped data also allow us to do exploratory data analysis using log-linear

models, which test hypotheses about the distribution and/or statistical independence

between variables without necessarily making a distinction between dependent and

independent variables (Sobel 1995). All the variables investigated by log-linear models

)exp(),|(1

0 i

k

jijjiiijii zvzyE εββελ ++== ∑

=

iijjii zv εββλ +++= ∑0)log(log k

j=1

),|(Pri

iiiiii

zy ε =!

)(exp)( y

iiji y

vv i ελελ −

ming that exp(gI) ha

iance of exp(gI).

11

Multivariate Analysis Using Grouped Data

are treated as “response variables” and should be categorical and built in a multi-way

contingency table, as will be for grouped data. Log-linear models are useful in exploring

ity rates give

as mortality (Clogg and

Eliason 1987). The saturated log-linear model for rates is defined as follows:

data presented in a contingency table. The log-linear models for mortal

identical results to the logit models when events are rare such

klmijklm

ijklmO ABCDE

ijDEkm

ABij

Em

Dl

Ck

Bj

Aiceb

λλλλλλλλµ ++++++++++= ......log

ijklm ijklm

number of children ever born in the cell ijklmn. The high number of parame

by this model is the main weakness of log-linear analysis (Clogg and E

Once there are more than three variables in the models, interpreting the re

saturated model becomes difficult. The use of correspondence analysis, deve

Where O and ceb are, respectively, the expected number of children dead and the

ters produced

liason 1987).

sults from the

loped for the

analysis of contingency tables (Benzécri 1969; Goodman 1991), may also be useful in

such cases, especially where there is possible interaction among the independent

the

grouped data and those from the individual-level data (Tables 2 and 3). As noted earlier,

the grouped data estimates are based on events/trial syntax for the Logit, Probit and

complementary log-log models. The death of a child is the event and the trials represent

the number of children ever born, thus the outcome of this ratio is an approximation of

the force of mortality as established by Brass (1975). Consequently, each child is

variables. But the topic is beyond the scope of this paper.

Results

In all the models, we have compared the covariates of mortality estimated from

12

Multivariate Analysis Using Grouped Data

considered as the unit of analysis for these models. Table 2 gives the results of the

regression on the vector of selected variables using both individual and grouped data.

Table 2: Logistic, Probit, and Clog-log resu tim and individual-level data on the odds of ch rt

Logit ficients Probit Coeffic C Coefficients

lts illustrating es ates from groupedildhood mo ality

Coef ients log-logGr ed In vidual rouped Individua Group

Intercept 58* .4 -1 -3.446*** -3.4 ** -3 50*** .90 * 1** -1.898*** -3.454***Age

9 * 0.17 -0 -0.158*** 24 (Ref) ---- -- ----

9 47* 0.04 0. ** 0.044*** 4 38* 0.03 0. ** 0.037***

040** 0.039 0.0 0 * *** 0.039*** .159** 0.158 0.0 0 * *** 0.146***

48* 0.24 0. ** 0.224*** ince

tral -0.012* -0.012* -0.0 -0 * 10 -0.010 .00 -0. - 2 -0.003 00* .39 0. ** 0.357***

ula 90* 0.38 0. ** 0.348*** ---- ---- ---- ---- ---- ----

rthern 10* 0.20 0. ** 0.191*** estern -0.116*** -0.118* -0.06 -0. 3*** -0.105***

n -0.055*** -0.054*** -0.032 ** -0.031*** - *** -0.047*** n 92* 0.39 0. ** 0.352*** f Residen

0.205*** 0.204* 0.11 0. ** ** 0.187*** ---- ---- --- - ---- ----

l 60* 1.55 0. ** 1.481***

y 53* 1.35 0. ** 1.295*** dary 94* 0.89 0. ** 0.866***

---- ---- ---- atus

arried -- -- --- ---- ied (Ref) 7 0 0.115***

1* 0. 0.200*** 98*** 0.296*** 0.167*** 0.166*** 0.267*** 0.266***

Wid wed 0.309*** 0.308*** 0.174*** 0.174*** 0.276*** 0.276*** Number of cases 2026 1053033 2026 1053033 2026 1053033

15-1 -0.173 ** - 3*** .095*** -0.095*** -0.158***20- -- ---- ---- ----25-2 0.0 ** 7*** 025*** 0.025*** 0.044*30-3 0.0 ** 8*** 018*** 0.018*** 0.037*35-39 0. * *** 18*** .018** 0.04040-44 0 * *** 86*** .085** 0.14645-49 0.2 ** 7*** 138*** 0.138*** 0.224*Prov Cen 08* .008 -0.0Copper Belt -0 3 004 0.003 -0.003 -0.00Eastern 0.4 ** 0 8*** 227*** 0.226*** 0.359*Luap 0.3 ** 7*** 220*** 0.219*** 0.351*Lusaka No 0.2 ** 9*** 116*** 0.116*** 0.192*North-W ** 6*** 067*** -0.10Souther * 0.048Wester 0.3 ** 1*** 221*** 0.221*** 0.353*Place o ce Rural ** 3*** 112* 0.188*Urban (Ref) - --- Education No Schoo 1.5 ** 9*** 783*** 0.783*** 1.482*Primar 1.3 ** 2*** 665*** 0.665*** 1.297*Secon 0.8 ** 4*** 423*** 0.424*** 0.866*College (Ref) ---- ---- ----Marital StNever M -- -- - ---- ---- Marr 0.12 *** 0.125*** .069*** 0.068*** 0.116***Separated 0.22 ** 0.222*** 123*** 0.1 ** 23* 0.200***

oup

data di data

Gdata

l data

ed data

Individual data

Divorced 0.2o

Deviance (Value/DF) 4.654*** 4.648*** 4.768*** 4.761*** 4.608*** 4.601***-2Log L 4597599.7*** 4592418.8***4597827.4*** 4592644.8***4597505.8*** 4592324.9***Likelihood ratio 109673.8*** 109155.5*** 109446.0*** 108929.5*** 109767.7*** 109249.4***DF 22 22 22 22 22 22* P<0.05; **P<0.01; ***P<0.001

13

Multivariate Analysis Using Grouped Data

Results from the grouped data reproduce exactly the same estimates as those

obtained from the individual-level data. Slight differences occur however, in the

(B=0) is the

ce is slightly

ata estimates.

, are invariant

to grouping insofar as the groupings are identical with respect to the variables in the

model (Allison 1999). Allison (1999) argues that the deviance based on grouped data is

ul as a goodness-of-fit measure than the deviance based on individual-level

data a chi-square

ncy tables or

grouped data have a log-linear model that is exactly equivalent (see Figure 1). Because

log-linear models have more potential for analysis, grouping census micro-data in the

odels to test

variables with

ining causal theories. This may also be extended to the use of the

correspondence analysis developed for the analysis of contingency tables, and may

complement regression analysis in cases of suspected interaction and association among

independent variables.

goodness-of-fit statistics. While the –2log L used to test the null hypothesis

same for both the grouped and individual-level micro-data, the devian

smaller in the case of the individual-level estimates compared to the group d

Maximum likelihood estimates, their standard errors and the log-likelihood

more “usef

because the deviance from the individual-level data does not have

distribution.”

As noted earlier, the logit, probit and clog-log models for continge

form proposed here allows us to explore the data using log-linear m

hypotheses about the distribution and/or statistical independence between

the aim of exam

14

Multivariate Analysis Using Grouped Data

Figure 1: Comparison of results

Logistic regression

y = 0.9997x - 0.0007

R2 = 1

-0.50

0.5

11.5

2

-0.5 0 0.5 1 1.5 2

Gr oupe d da t a

Probity = 1.0005x - 0.0004

R2 = 1

-0.20000.00000.20000.40000.60000.80001.0000

-0.2000 0.0000 0.2000 0.4000 0.6000 0.8000 1.0000

Gr oupe d da t a

CLogLog

y = 0.9995x - 0.0006R2 = 1

-0.50000.00000.50001.00001.50002.0000

-0.5000 0.0000 0.5000 1.0000 1.5000 2.0000Grouped data level

Indi

vidu

al d

ata

leve

l

Negative binomialy = 0.8952x - 0.0287

R2 = 0.9595

-0.5

0

0.5

1

1.5

-0.50000 0.00000 0.50000 1.00000 1.50000 2.00000

Gr oupe d da t a

Tobit

y = 3.3553x + 0.1889

R2 = 0.7944

0

12

3

4

Gr oupe d da t a

Weight Least Squaresy = 0.9042x - 0.0212

R2 = 0.9069

-0.4-0.2

00.20.40.60.8

0.80000

Gr oupe d da t a

-1-0.20000 0.00000 0.20000 0.40000 0.60000 0.80000 1.00000

-0.40000 -0.20000 0.00000 0.20000 0.40000 0.60000

For the weighted least square, Tobit, and negative binomial regressions, slight differences

are observed between results given by the grouped data and the estimates obtained from

the micro- or individual-level data (Table 3 and Figure 1). The effect on women of

grouping data without controlling for the number of children ever born and the number of

deaths for each woman explains these differences. We correct for this effect by

incorporating the number of children ever born and the number of deaths as independent

variables when grouping the data for the results in Table 4 and Figure 2.

15

Multivariate Analysis Using Grouped Data

Table 3: Weighted least square, Tobit, and negative binomial results illustrating estimates from grouped and in n the o morta

ht u gre ial Regression dividual-level data o dds of childhood lity

Weig ed Least Sq are To t Rebi ssion N om. BinGr

dped a

In ividual ata

rouped data

dividual data

Groupeddata

Intercept 48 .1 -0 - -3.480*** -0.1 *** 0 31*** .210*** 5.226*** -3.922***Age

9 * -0.0 -0 -0 -0.144*** 24 (Ref) ---- - ---- 29 0 0. 0. 0. 0.038*** 34 2 0. 0. 0. 0.027***

0.058*** -0.024*** 0.049 0.873 ** 0.080*** 0.027*** 4 36 .1 -0.0 0 0 * 0.124***

12* -0.1 -0 0.9 0 * 0.195*** ince al 0.0 -0 -0.007

-0 0 0 -0.004 62 .3 0 0 0.316***

la 70 .3 0 0 0.314*** a --- - ----

0.177*** 0.164*** 0.140 0.379 ** 0.168*** 0.171*** estern 0.1 -0 -0 -0.092***

n -0.083*** -0.053*** -0.073 ** -0.105 -0.088*** -0.045*** n 09 .3 0 0 0.305***

of Residenc 89 .1 0 0 0.164***

f) ---- ---- -- - ----

l 49 .7 0 3 1.416*** y 32 .5 0 3 1.263***

dary 79 .2 0 2 0.864*** e (Ref) --- - ---- l Status

ed ---- --- ---- 0.2 *** 0 00*** .177*** 0.7 0 0.119*** 0.2 *** 0.181*** 0.1 0.193***

ivorced 0.3 *** 0 48*** .302*** 0.916 ** 0.486*** 0.256*** Widowed 2*** 57*** .313*** 0.970*** 0.497*** 0.258***

0.074 0.295

15-1 -0.158 ** 25*** .054*** .523*** -0.212***20- --- ---- ---- ---- 25- 0.05 *** 008 049*** 481*** 0.077***30- 0.06 *** 006 063*** 756*** 0.072***35-39 *** *40-4 0.0 *** -0 10*** 10*** .958*** .209**45-49 0.0 ** 55*** .043*** 86*** .295**Prov Centr -0.033*** - 17*** .023*** -0.012 -0.017***Copper Belt -0.024*** .001 .009*** .040*** 0.002 Eastern 0.3 *** 0 46*** .303*** .673*** 0.319*** Luapu 0.3 *** 0 32*** .306*** .706*** 0.335*** Lusak - --- ---- ---- ---- Northern *** *North-W -0.105** -* 04*** .09 * 8** .21 * 2** -0.1 ***11Souther * ***Wester 0.3 *** 0 27*** .256*** .532*** 0.284***Place e Rural 0.0 *** 0 55*** .113*** .322*** 0.129***Urban (Re -- --- ---- Education No Schoo 0.7 *** 0 26*** .833*** .229*** 1.496***Primar 0.6 *** 0 53*** .734*** .059*** 1.390***Secon 0.3 *** 0 65*** .470*** .134*** 0.989***Colleg - --- ---- ---- ---- Marita Never Marri - ---- ---- ---- Married (Ref) 17 .1 0 68 *** .352***Separated 44 97*** 0.698*** 0.386*** D 29 .2 0 *

0.36 0.2 0

ouat

dd

G In Individual data

Dispersion Number of cases 2026 1053033 2026 1053033 2026 1053033 Deviance (Value/DF) 1.053 0.912 -2Log L Likelihood ratio -1408782.5 28844288834 -655262.947 DF 0.634299 0.056775 * P<0.05; **P<0.01; ***P<0.001

16

Multivariate Analysis Using Grouped Data

Table 4: Weighted least square, Tobit, and negative binomial results illustrating estimates from grouped and i n h rrection

ial Coefficients ndividual-level data o the odds of child ood mortality after co

Weight easted L Sq euar T Coobit ef ntsficie N. BinomGr

duped ta

In ividual data

Grouped data

Individu atad

Groupeddata

Intercept 13 . - * * -3.480*** 0. 7*** 0 131*** 5.284*** -5.226** -3.454**Age

9 -0.0 -0 - ** * -0.144*** 24 (Ref) --- ---- 29 0. 0 ** * 0.038*** 34 0. 0 ** * 0.027***

-0.025** -0.024 0.884*** 0. ** 0.0 ** 0.027*** 4 .1 0. ** ** 0.124***

.156 -0. 1 ** ** 0.195*** ince al 1 . -0.007

. - 0 ** -0.004 4 . 0.316***

la 3 . 0.314*** ----

0.165** .164* 0.382*** 0. ** 0.1 ** 0.171*** estern 0 . - ** * -0.092***

n -0.053*** -0.053 -0.10 *** -0 *** -0.046*** -0.045*** n 2 . 0 ** * 0.305***

of Reside 5 . 0 ** * 0.164***

f) ---- -- ----

l 3 . 3 ** * 1.416*** y 5 . 3 ** * 1.263***

dary 6 . 2 ** * 0.864*** e (Ref) -- ----

ed -- - ----

0.0 ** 0 00*** .767*** * ** 0.119*** 0.1 ** 0.1 0 ** 0.193***

ced * 48*** .921*** 0. * 0.264*** 0.256*** ** 0.257*** 0.978*** 0.970*** 0.270*** 0.258***

Overdispersion/Scale NA NA 2.732 2.685 0.630 0.295

15-1 23 .025** 0.533*** -0.523* -0.137**20- - ---- ---- ---- ---- 25- 006 0.008 .486*** 0.481* 0.039**30- 004 0.006 .766*** 0.756* 0.029**35-39 * *** 873* 30*40-4 -0.112*** -0 10*** 970*** 0.958* 0.132*45-49 -0 *** 155*** .000*** 0.986* 0.207*Prov Centr -0.0 8*** -0 017*** -0.012 -0.012 -0.003 Copper Belt -0 002 0.001 .042*** 0.040* -0.005 Eastern 0.3 6*** 0 346*** 0.677*** 0.673*** 0.310***Luapu 0.3 2*** 0 332*** 0.712*** 0.706*** 0.315***Lusaka ---- ---- ---- ---- ---- Northern * 0 ** 379* 70*North-W -0.1 4* -0** 104 *** 0.2 **14* -0 2*.21 -0 7**.08Souther *** 3 .105Wester 0.3 9*** 0 327*** .538*** 0.532* 0.298**Place nce Rural 0.1 8*** 0 155*** .329*** 0.322* 0.157**Urban (Re -- ---- ---- ---- Education No Schoo 0.7 2*** 0 726*** .263*** 3.229* 1.388**Primar 0.5 8*** 0 553*** .092*** 3.059* 1.245**Secon 0.2 7*** 0 265*** .152*** 2.134* 0.851**Colleg -- ---- ---- ---- ---- Marital Status Never Marri -- --- ---- ---- ---- Married (Ref) 97* .1 0 0.7 8*6 * 0.115*Separated 81* 81*** .703*** 0.698* 0.202***Divor 0.248* * 0.2 0 916**

oa

d al

Individual data

Widowed 0.257*

Number of cases 56332 1053033 56332 1053033 56332 1053033 Deviance (Value/DF) 0.779 0.912 -2Log L Likelihood ratio -1416337 -1408782.5 -664393 -655263 R-Square 0.0564 0.0568 * P<0.05; **P<0.01; ***P<0.001

17

Multivariate Analysis Using Grouped Data

Figure 2: Graphic comparison of individual and grouped results (after correction)

Tobit y = 0.9897x + 0.0004R2 = 1

-1-0.5

00.5

11.5

22.5

33.5

-1.00000 0.00000 1.00000 2.00000 3.00000 4.00000

Neg. Bin.y = 1.0196x - 0.0044R2 = 0.9998

-0.4-0.2

00.20.40.60.8

11.21.41.6

-0.50000 0.00000 0.50000 1.00000 1.50000

y = 0.9928x + 0.0008

R2 = 1

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.40000 0.60000 0.80000

WLSOLS

y = 0.9954x + 0.0012R2 = 0.9999

-0.2-0.1

00.10.20.30.40.50.60.70.8

-0.20000 0.00000 0.20000 0.40000 0.60000 0.80000

-0.40000 -0.20000 0.00000 0.20000

concern to African governments and researchers alike. While governments

about confidentiality when national data are made publicly available, res

scholars view such data as minefields from which to learn much about the d

Africa. On the other hand, African governments are not averse to the idea of

Discussion and Conclusion

The issue of making Africa census micro-data available publicly continues to be of

are concerned

earchers and

emography of

making their

data public in an aggregate form. In this paper, we demonstrate how researchers can

make use of such grouped data to study different demographic phenomena without losing

the advantages that are otherwise offered by the micro-data. Results of different

regression procedures on childhood mortality comparing estimates from both the micro-

level data and the grouped data yield identical results. The different regression

procedures applied to both the grouped and ungrouped data are WLS, Tobit, Logit,

18

Multivariate Analysis Using Grouped Data

Probit, Complementary log log, and Negative Binomial regression models. Each of these

procedures relies on different assumptions.

The Trussell and Preston Mortality Index uses the observed numb

ever born and the number of children surviving over the expected number of

dependent variable. The negative binomial regression uses a count of t

deaths as the dependent variable. Both the Trussell and Preston Mortality

included in the model. On the other hand, the grouped data are expecte

estimate with less unobserved variation because the same explanatory var

reflect the averaging influence of the larger group. The results for the i

grouped data are invariant in that the estimates are identical with respect to

results. Initial differences observed between the grouped and

er of children

deaths as the

he number of

Index and the

negative binomial regression use a dependent variable in the individual-level analysis that

can be expected to reflect the idiosyncratic variation unaccounted for by the variables

d to yield an

iables tend to

ndividual and

the nature of

the variables. In the case where the idiosyncratic nature of the individual-level results

has a significant impact on the estimation, we should expect to observe differences in the

individual results for the

WL valence in the

y controlling

The grouped file reduces the concern of some African governments regarding the

confidentiality of census micro-data and produces results identical to those obtained

using the actual individual-level micro-data. For the purposes of our demonstration we

use the national census of Zambia collected in 1990 and selected only five variables often

associated with childhood mortality. The selection of the variables is not meant to

S, Tobit, and negative binomial regressions result from the lack of equi

offset variable for the two forms of data. We adjusted for these differences b

for the number of children ever born and the number of surviving children.

19

Multivariate Analysis Using Grouped Data

provide an exhaustive list of factors affecting mortality, but is illustrative of variables

considered important and available in African census data.

As shown in Tables 2 and 3 the results from the different models n

but also confirm our expectations and are consistent with the literature.

among the five variables selected for demonstration, maternal education app

most important factor associated with childhood mortality. This is consiste

Caldwell 1979; United Nations 1985; Cleland, Bicego and Fegan 1992). Be

women often have better access to resources, both material and human cap

them to ha

ot only agree,

For example,

ears to be the

nt with earlier

research (Farah and Preston 1982; Hobcraft, McDonald, and Rutstein 1984; Cleland and

van Ginneken 1988; Cochrane, O’Hara, and Leslie 1980; Tabutin and Akoto 1992;

tter educated

ital, allowing

ve a comparative advantage in providing better health for their children

(Cl 1984; United

Nations 1985).

With regard to marital status, the odds of child mortality are lower for children of

other factors.

ed) children

d women have better survival chances, which is consistent with the literature.

We of the never

married, but we believe that other factors might be important in explaining these

differences.

The regions with the highest childhood mortality are the Eastern, Luapula,

Western, and the Northern regions, compared to the Copper Belt, Central, and Lusaka

regions, which reportedly have an intermediate level of childhood mortality. North-

eland and van Ginnekan 1988; Hobcraft, McDonald, and Rutstein

nonmarried women than all other marital groups, even when we control for

However, compared to the remaining groups (separated, divorced, and widow

of marrie

do not fully investigate the reason behind the advantage for children

20

Multivariate Analysis Using Grouped Data

Western and Southern provinces appear to have the lowest mortality (Central Statistical

Office 1990). As expected, childhood mortality is higher in rural areas than in urban

areas. Finally, the age effect for the Logit, Probit, Complementary log-log

binomial models also follows the expected pattern, except among the 15

, and negative

-19-year-old age

gro lot 2001).

estimation of

the covariates of childhood mortality, the procedures can easily be applied to other- social

phenomena. Considering that researchers are rarely given the census micro-data, this

dem do regression

data.

h will allow

rtake detailed

multivariate regression analysis. Theoretically there will be no limitation on the number

of dependent variables since the procedure uses the multidimensional database (MDDB)

This storage system allows data to be pulled from a data warehouse or other

source for storage in a matrix-like format (SAS 1999). Thus ACAP can make the

grouped data available to researchers online without violating the concerns of African

governments.

up, which typically produces erratic results (Preston, Heuveline and Guil

Although we demonstrated the application of the procedures with an

onstration shows that it is possible to use grouped census micro-data to

analysis and obtain results identical to those obtained from the actual micro-

ACAP is in the process of grouping African census data whic

researchers to have access to such grouped data and still be able to unde

technique.

21

Multivariate Analysis Using Grouped Data

References

llison, D. Paul. 1999. Logistic RegressA ion Using the SAS System, Theory and

Brass, William et al. 1968. The Demography of Tropical Africa. Princeton, NJ: Princeton

B stimating Fertility and Mortality from Limited and

Defective Data he University

Benzécri J. P. 1969 “Statistical analysis as a tool to make patterns emerge from data,” in

w York: Academic

f

Nigerian data,” Population Studies 33: 395.

C n, Housing and Agriculture Analytical Report, Vol. 10. Lusaka, Zambia.

equalities in childhood mortality: The 1970s to the 1980s,” Health Transition Review Vol. 2.

C . “Maternal education and child survival in developing countries: The search for pathways of influence,” Social Science and

ar analysis,”

rch 16(1): 8-44.

n on health,” nk.

in Sudan,”

view 8: 365. Goodman, L. A. 1991. “Measures, models, and graphical displays in the analysis of

cross-classified data (with discussion),” Journal of the American Statistical Association 86: 1085-1138.

Halli, Shiva S. and Vaninadha K Rao. 1992. Advanced Techniques of Population

Analysis. New York: Plenum Press.

Application. Cary, NC: SAS Institute Inc., 304 p.

University Press.

rass, William. 1975. Methods of E. Laboratories for Population Statistics, Chapel Hill: T

of North Carolina at Chapel Hill, 159 p.

Methodologies of Pattern Recognition, ed. S. Watanabe. NePress, pp. 35-73.

Caldwell, J. C. 1979. “Education as factor in mortality decline: An examination o

entral Statistical Office (Zambia). 1990. Census of Populatio

Cleland, J., George Bicego, and Greg Fegan. 1992. “Socioeconomic in

leland, J. G and J. K van Ginneken.. 1988

Medicine 27(12): 1357-1368.

Clogg, C. C. and S. Eliason. 1987. “Some common problems in Log-LineSociological Methods and Resea

Cochrane, S. H., D. J. O’Hara, and J Leslie. 1980. “The effects of educatioWorld Bank Staff Working Paper No. 405. Washington, DC: World Ba

Farah, A. A. and S. H. Preston. 1982. “Child mortality differentialsPopulation Development Re

22

Multivariate Analysis Using Grouped Data

Hamid, Abd El N. M. and F. A. Ahmed. 1988. “Infant and childhood motrends and differentials, 1986 census,” in Demographic Analysis of Data. Enlarged Sample. Volume II: Nuptiality, Fertility, Mortality aSegments of Population, compiled by Egypt, Central Agency

rtality levels, 1986 Census nd Selected

for Public Mobilisation and Statistics [CAPMAS], Population Studies and Research Centre.

Hobcraft J. N, J. W. McDonald, and S. O. Rutstein. 1984. “Socio-economic factors in

infant and child mortality: A cross national comparison,” Population Studies 38:

ited Dependent

Variables. Thousand Oaks, CA: Sage.

M ression Analysis. Thousand Oaks, CA: Sage.

. Applications

Pr Mortality in Late Nineteenth-Century America. Princeton, NJ: Princeton University Press.

Preston, H. Samuel, P. Heuveline, and M. Guillot. 2001. Demography: Measuring and Modeling Population Process. Oxford: Blackwell Publishers Inc.

Preston, H. S. and A. Palloni. 1977. “Fine tuning Brass-type mortality estimates with data in ages of surviving children,” Population Bulletin of the United Nations, No. 10

SAS Institute Inc. 1999. SAS/MDDB Server Administrator's Guide, Cary, NC: SAS

Sobel, Michael E. 1995. “The analysis of the contingency tables,” in G. Arminger, C. C.

he Social and

Sullivan, J. M. 1972. “Models for the estimation of the probability of dying between birth

and exact age of early childhood,” Population Studies 26 (1): 79-97. Tabutin, D. and E. Akoto. 1992. “Socio-economic and cultural differentials in the

mortality of sub-Saharan Africa,” in Mortality and Society in sub-Saharan Africa, E. van de Walle, G. Pison, and M. Salad-Diakanda (eds.). Oxford: Clarendon Press, pp.32-64.

Cairo: CAPMAS, 165-190.

193.

Long, J. Scott. 1997. Regression Models for Categorical and Lim

enard, Scott. 1995. Applied Logistic Reg

Noumbissi, A. 1996. Méthodologies d’analyse de la mortalité des enfantsau Cameroun. Louvain-la-Neuve: Académia.

eston, H. S. and M. R. Haines. 1991. Fatal Years: Child

pp. 72-91.

Institute Inc.

Clogg, and M. E. Sobel (eds.), Handbook of Statistical Modeling for tBehavioral Sciences. New York: Plenum Press, pp. 251-310.

23

Multivariate Analysis Using Grouped Data

24

T Brass technique for determining childhood survivorship rates,” Population Studies 29(1): 97-107.

T of childhood mortality from retrospective reports of mothers,” Health Policy and Education 3: 1-

United Nations. 1985. Socioeconomic Differentials in Child Mortality in Developing

Countries. ST/ESA/SER. A/97. New York.

russell, J. 1975. “A re-estimation of the multiplying factors for the

russell, James and Samuel Preston. 1982. “Estimating the covariates

36.

Last Working Papers published

W. P. 1: Tukufu Zuberi and Ansom Sibanda, Fertility Differentials in sub-Saharan

Africa: Applying Own-Children Methods to African Censuses, January 1999. W. P 2: Herbert B. S. Kandeh, Using indigenous knowledge in the demarcation of the

enumeration areas: A case study of Banta Chiefdom, Moyamba District, Sierra Leone, January 1999.

W. P. 3: Etienne Van de Walle, Households in Botswana: An exploration, February 1999. W. P. 4: Amadou Noumbissi, Mortality analysis using Cameroon 1987 census micro

data, March 1999. W. P. 5: Monde Makiwane Fertility in rural South Africa: The case of Transkei, March

1999.

W. P. 6: Tukufu Zuberi and Akil K. Khalfani, Racial Classification and Colonial Population Enumeration in South Africa, March 1999.

W. P. 7: Tukufu Zuberi and Akil K. Khalfani, Racial Classification and The Census in

South Africa, 1911-1996, March 1999.

W. P. 8: Amson Sibanda and Tukufu Zuberi, Contempory Fertility Levels and Trends in South Africa: Evidence from Reconstructed Birth Histories, April 1999.

ACAP W.P. # 9: Etienne van de Walle. Where are the Children of Botswana? June 1999

ACAP W.P. # 10: Gideon Rutaremwa. Regional Differences in Infant and Child Mortality: A comparative Study of Kenya and Uganda, July 1999.

ACAP W.P. # 11: Ayaga A. Bawah and Tukufu Zuberi. Estimating Childhood Mortality from Census Data in Africa: The case of Zambia, August 1999.