33
Modelling Census Under-Enumeration A logistic regression perspective Bernard Baffour The views expressed in this paper are those of the author and do not necessarily represent those of the General Register Office for Scotland. Abstract Inevitably, not everyone in Scotland was enumerated in the 2001 Census. In Scotland, as in the rest of the UK, the raw Census figures were adjusted to compensate for the differential levels of under-enumeration that existed between different age-sex groups and areas. This was achieved by the One Number Census project which produced an individual level Census database, with synthetic individuals used to fully adjust for the differential levels of under-enumeration. Census under-enumeration stems from a wide variety of factors – for example demographic, socio-economic and household composition. An accurate identification and measurement of under-enumerated people allows the implementation of initiatives to maximise coverage. For example new designs of the census form can be trialled in households most susceptible to under-enumeration with the aim of making it easier to complete. Therefore this paper seeks to model under-enumeration, by ascertaining the underlying characteristics of under-enumerated areas, determined using the non-response rates. These characteristics are found via logistic regression modelling. Logistic regression analysis controls for all variables simultaneously during modelling and can therefore reveal which factors are significant predictors of the outcome variable in question. The results of the series of logistic regression models show that there are a variety of factors that affect under-enumeration. However, from a strategic viewpoint, an area will be susceptible to under-enumeration if it has a disproportionate number of people who are transient and/or deprived . The deprivation characteristics are income and health deprivation, non-car ownership, overcrowding and multi-occupancy. The transient characteristics are single (never married), young people (aged 25-29), living in privately rented accommodation and working in intermediate occupations. Some people – some single mothers, for example – could be construed as being both transient and deprived. 1

Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

Embed Size (px)

Citation preview

Page 1: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

Modelling Census Under-Enumeration A logistic regression perspective

Bernard Baffour The views expressed in this paper are those of the author and do not necessarily represent those of the General Register Office for Scotland.

Abstract

Inevitably, not everyone in Scotland was enumerated in the 2001 Census. In Scotland, as

in the rest of the UK, the raw Census figures were adjusted to compensate for the

differential levels of under-enumeration that existed between different age-sex groups

and areas. This was achieved by the One Number Census project which produced an

individual level Census database, with synthetic individuals used to fully adjust for the

differential levels of under-enumeration. Census under-enumeration stems from a wide

variety of factors – for example demographic, socio-economic and household

composition.

An accurate identification and measurement of under-enumerated people allows the

implementation of initiatives to maximise coverage. For example new designs of the

census form can be trialled in households most susceptible to under-enumeration with

the aim of making it easier to complete.

Therefore this paper seeks to model under-enumeration, by ascertaining the underlying

characteristics of under-enumerated areas, determined using the non-response rates.

These characteristics are found via logistic regression modelling. Logistic regression

analysis controls for all variables simultaneously during modelling and can therefore

reveal which factors are significant predictors of the outcome variable in question.

The results of the series of logistic regression models show that there are a variety of

factors that affect under-enumeration. However, from a strategic viewpoint, an area will

be susceptible to under-enumeration if it has a disproportionate number of people who

are transient and/or deprived. The deprivation characteristics are income and health

deprivation, non-car ownership, overcrowding and multi-occupancy. The transient

characteristics are single (never married), young people (aged 25-29), living in privately

rented accommodation and working in intermediate occupations. Some people – some

single mothers, for example – could be construed as being both transient and deprived.

1

Page 2: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

1. Introduction A Census provides a ‘snapshot’ of the entire national population and has the distinct advantage of providing an opportunity to gather a wealth of information about demographic, economic, social and housing characteristics of the population. Furthermore, unlike surveys, Censuses allow the analysis of these population characteristics right down to very small geographical areas. Population figures are also used by international bodies like the UN and EU as a basis for formulating international policy and world population estimates. Therefore, in most countries the requirement to hold a Census, or a least have a methodology in place to obtain up-to-date population figures, is enshrined in the constitutional legislation. In the UK, a decennial Census aims to provide an accurate up-to-date count of all the residents. The last Census, held in 2001, estimated that a response rate of 96.1% was achieved in Scotland; which compares favourably with other countries (Boyle and Dorling 2004). However what was particularly noticeable from the Census was the differential response rates that existed between different age-sex groups and areas (see Figure 1.1.1). This is known as “biased under-enumeration”, and the 2001 Census was the first Census that sought to measure this non-uniformity in response. The 2001 Census sought to accurately account for the differential levels of under-enumeration that existed between different age-sex groups and areas. This was achieved by the One Number Census project which produced an individual level Census database, with synthetic individuals used to fully adjust for the differential levels of under-enumeration. An important aspect of the project was an independent follow-up survey (the Census Coverage Survey, CCS) carried out after the official Census had taken place. The basic statistical model - called dual system estimation - assumes that a first enumeration, the Census, is followed by a second enumeration (the Census Coverage Survey) which subsequently takes place a few weeks later. Then matching techniques can be used to match records from the Census Coverage Survey to those from the Census. From the results of the matching (and dual system estimation), an estimate of the people missed by both the Census Coverage Survey and the Census can be made. A key assumption here is that the first and second processes are statistically independent. That is, the probability of enumeration in the second process given enumeration in the first is identical to the probability of enumeration in the second process given the person was missed in the first process. This identity assumption provides a basis for estimating the number that were not enumerated in either the first or second processes – referred to as the undercount or under-enumeration. Consequently, this population undercount is combined with the Census population to give an estimate of the true population. Therefore, the One Number Census methodology can be thought of as comprising three processes that work together. Firstly, the Census provides a population count. Secondly, the Census Coverage Survey provides an estimate of those missed by the Census. Finally dual system estimation provides an estimate of the people missed by both the Census and Census Coverage Survey. This produces a new estimate of the population that has been adjusted to account for the missing, by imputation.

2

Page 3: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

Figure 1.1.1: 2001 Scottish Census response rates by age-sex group

From Figure 1.1.1, it can be seen that females had better response rates overall than males – with males aged 20-24 having the lowest coverage. Furthermore, it was found that, accounting for geographical region, this demographic spread of non-response is much wider, and there was an identification of a broader cross-section of the missed population. As an example, some parts of West Dunbartonshire with a high percentage of social housing stock had a large proportion of young males identified as missing. Census under-enumeration is a function of a wide range of factors including

i. demographic ii. social and economic iii. housing structure and composition iv. public attitudes and perceptions v. Census enumeration strategy vi. effectiveness of the Census under-enumeration measuring instrument

Assuming that (v) and (vi) are correctly accounted for by the Census methodology, the other factors can be investigated by looking at the associations and relationships that exist between these factors and under-enumeration. Under-enumeration is likely to be higher in certain areas with particular social, economic and demographic attributes. This paper aims to use logistic regression to model the Census under-enumeration by identifying the attributes that are significant predictors of under-count. The object of the analyses in this paper is two-fold:

1. To assess the association between under-enumeration and key deprivation and Census variables,

2. To develop one or more models that could be used to predict areas of under-enumeration based on the area attributes and their relationships with under-enumeration.

It is important to note that the analysis does not establish any causal relationships with under-enumeration. (Causality can only be determined by carefully controlled experimental study.) Nor does it look at the effect of public perceptions and attitudes to the Census.

3

Page 4: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

The analysis was carried out at a ward level because this level of geography is low enough to avoid any issues of heterogeneity in the population size of council areas, but large enough so that the proportion imputed was greater than zero. Scotland is divided into 1,222 wards. The ward boundaries are of similar size across Scotland with between 650 and 9,200 people in each ward. Figure 1.1.2 shows the distribution of residents in as a total and broken down by sex. Although the spread is large the distribution is roughly normal – most wards have a population size of between 3,000 and 5,000 people. Figure 1.1.2.: Distribution of resident population in wards

0

50

100

150

200

250

300

0500

10001500

20002500

30003500

40004500

Number of residents

Num

ber o

f war

ds

Males

Females

0

50

100

150

200

250

01000

20 0030 00

40 0050 00

60 0070 00

80 0090 00

Number of residents

Num

ber o

f war

ds

1.1 The Scottish Index of Multiple Deprivation Deprivation data was obtained from the Scottish Index of Multiple Deprivation (SIMD) developed by the Social Disadvantage Research Centre, at Oxford University, and commissioned by the Scottish Executive (for the full report see Noble et al, 2003). The SIMD measures the different aspects of deprivation with the view of combining them into a local index. It presents individual or household levels of deprivation in the form of an aggregate score for each ward in Scotland. The fundamental assumption is that individuals and families experiencing deprivation would cluster together in the same geographical location. The Scottish Index of Multiple Deprivation seeks to answer two questions:

a. How far do deprived individuals geographically cluster together? b. How are other non-deprived individuals affected by the deprivation?

The advantage of the SIMD is that it recognises the different dimensions of deprivation and provides a measure of the concentration of deprivation (independent of population size). 1.1.1 Social exclusion and deprivation Deprivation is defined as the lack of money or material possessions (Townsend, 1979, p.31). Social exclusion on the other hand is difficult to define: the Prime Minister described social exclusion as “a shorthand label for what happens when individuals or areas suffer from a combination of linked problems such as unemployment, poor skills, low incomes, poor housing, high crime environments, bad health and family breakdown” (Scottish Office, 1999).

4

Page 5: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

What is apparent is that social exclusion and deprivation are inextricably linked and manifest themselves in a variety of ways. However, the vagueness of their definitions means that statisticians and social scientists concentrate on material deprivation. It has been relatively easy to measure material deprivation in relation to diet, housing, environment, household facilities and employment. Because of the fact that deprivation and social exclusion do not seem to separately exist - i.e. a person who is deprived can be said to be (socially) excluded and vice versa – the two are used interchangeably in literature. Hence, intuitively one would expect social exclusion (henceforth, deprivation) to be connected with under-enumeration. The SIMD is therefore based on a concept that articulates multiple deprivation as an accumulation of single deprivations. Multiple deprivation is not a separate form of deprivation but a combination of more specific forms of deprivation which, in essence, are deemed more measurable. 1.1.2 The Domains of the Scottish Index of Multiple Deprivation The SIMD uses a process to produce a score for each of five “domains”, which are subsequently ranked across Scotland to give a relative picture of each dimension of deprivation. The SIMD domains are Income Deprivation, Employment Deprivation, Health Deprivation and Disability, Education, Skills and Training Deprivation and Geographical Access to Services. The domains consist of indicators that are determined to be directly relevant to the purpose of each domain. Care was taken to prevent duplication by exploring inter-correlations of variables within (and outwith) the domains, eliminating those which largely reproduce the more effective measures of the relevant deprivation. Therefore, the indicators combine to give a direct measure of the particular aspect of deprivation defined by the domain. There are other domains that could have been measured. The previous Scottish Area Deprivation Index, commissioned by the Scottish Office in 1998, included Housing Deprivation and Crime and Social Order domains. There is great need to measure housing deprivation (especially in the present economic climate of rising house prices) in order to help inform policy and target particular groups of people. However the varying approaches in the measurement of this domain – accessibility to housing, poor housing quality, overcrowding etc. – makes it difficult to quantify directly. Secondly there is currently no up-to-date data to address these issues at ward level. Crime and Social Order are important elements in measuring deprivation at the small area level, and can be used for local initiatives. Unfortunately small area data on crime or social order for the whole of Scotland is currently not available (although it is expected to be available in future as a result of the new Crime and Victimisation Survey). Secondly there are some aspects of crime (e.g. city centre vandalism) that may be difficult to attribute to a specific geographical area. In most cases these crimes can legitimately be attributed to areas of those who live elsewhere. It is important that attention is drawn to the domain of Access deprivation. It is a well known fact that poor geographical access to services is deemed to be an important aspect of multiple deprivation (Noble et al, 2000). However, this domain did not consider non-geographical issues - in particular language and cultural barriers. Secondly, car ownership (or more accurately, the lack of) was also not considered. A recent ONS report found that the proportion of people in households without a car who claimed to experience some difficulty in accessing services was nearly twice as great as those with

5

Page 6: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

a car (Ruston, 2002). Nevertheless, car ownership was omitted from the index because of its ambiguity – is the non-ownership of a vehicle (especially in cities) indicative of deprivation or a reflection of personal choice, a more environmentally friendly approach or good public transport? 1.1.3 Standardisation and weighting of the SIMD Combining the individual domain deprivation indices into a single multiple deprivation index needed a standardisation procedure. The procedure had to ensure that the most deprived wards were easily identified. Also it was required that each domain had a common distribution and did not conflate size with level of deprivation. The procedure that best meets the criteria is the exponential transformation. Weighting is particularly important when creating an index of multiple deprivation as a sum of individual indices of deprivation. The obvious way is to sum each of the individual indices. However research shows that some aspects of deprivation are more important than others (Gordon et al, 2000 and Noble et al, 2000). Thus the overall Scottish multiple index of deprivation is weighted as: SIMD = (0.3 x Income) + (0.3 x Employment) + (†) (0.15 x Education) + (0.15 x Health) + (0.1 x Access) 1.2 The Scottish 2001 Census The analysis of the factors that affect under-enumeration is given in Section 2 onwards. The remaining part of this section gives a description of the Census variables used presenting results from the 2001 Census. As previously stated, under-enumeration is a resultant of a wide-ranging variety of factors. The SIMD data provided some deprivation, social and economic attributes which could be analysed to ascertain how associated they were with under-enumeration. The rest were obtained from the Census database, available over the internet through the Scottish Census Results OnLine (SCROL) website (www.scrol.gov.uk). The Census results provided the demographic and household variables used in analyses. It is imperative to realise that the variables chosen from the Census are by no means exhaustive. It is very feasible that some variables not considered will be found to be predictive of whether or not an area is under-enumerated.

6

Page 7: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

A. Age Structure Figure 1.2.1

Bar chart of the age distribution

0%

20%

40%

60%

80%

100%

Scotland

Aberdeen City

Aberdeenshire

Angus

Argyll & Bute

Clackm

annanshire

Dum

fries & Gallow

ay

Dundee C

ity

East Ayrshire

East Dunbartonshire

East Lothian

East Renfrew

shire

Edinburgh, City of

Eilean Siar

Falkirk

Fife

Glasgow

City

Highland

Inverclyde

Midlothian

Moray

North Ayrshire

North Lanarkshire

Orkney Islands

Perth & Kinross

Renfrew

shire

Scottish Borders

Shetland Islands

South Ayrshire

South Lanarkshire

Stirling

West D

unbartonshire

West Lothian

75 & over 60 to 7445 to 5930 to 4425 to 29 20 to 2415 to 1910 to 14 5 to 90 to 4

Population: All people Source: Census Table KS02

The distribution of the age structure in each local authority area shows that in general there is an ageing population. The major cities – Glasgow, Edinburgh, Dundee and Aberdeen – have a high percentage of 20-24 year olds; this is perhaps because of people in university or on graduate schemes. Eilean Siar, Angus and Highland - areas with few or no higher educational institutions - have a lower proportion of 20-24 year olds. In Scotland as a whole there are migration peaks for males and females in their late teens and early twenties. Furthermore, the 2003 Registrar General’s Annual Report showed a marked net migration gain at ages 19 and 20, and net migration loss at ages 23 and 24. These patterns are consistent with an influx of students from the rest of the UK (and overseas) starting higher education, followed by a further move after graduation.

7

Page 8: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

B. Car Ownership Figure 1.2.2

Car Ownership

0

10

20

30

40

50

60

Scotland

Aberdeen C

ity

Aberdeenshire

Angus

Argyll & Bute

Clackm

annanshire

Dum

fries & Gallow

ay

Dundee C

ity

East A

yrshire

East D

unbartonshire

East Lothian

East R

enfrewshire

Edinburgh, C

ity of

Eilean S

iar

Falkirk

Fife

Glasgow

City

Highland

Inverclyde

Midlothian

Moray

North A

yrshire

North Lanarkshire

Orkney Islands

Perth &

Kinross

Renfrew

shire

Scottish Borders

Shetland Islands

South Ayrshire

South Lanarkshire

Stirling

West D

unbartonshire

West Lothian

Perc

enta

ge o

f hou

seho

lds

with

no

cars

Population: All households Source: Census Table KS17 Glasgow and Dundee have a high proportion of households with no cars. This could be due to the cost of parking in inner city areas, a much more environmentally friendly approach or good public transport - although, as remarked earlier, non ownership of a car is considered to be an indicator for access deprivation. C. Marital status and living arrangements Figure 1.2.3

Living Arrangements

0%

20%

40%

60%

80%

100%

ScotlandAberdeen C

ityAberdeenshireAngusArgyll & B

uteC

lackmannanshire

Dum

fries & Gallow

ayD

undee City

East AyrshireEast D

unbartonshireEast LothianEast R

enfrewshire

Edinburgh, City of

Eilean SiarFalkirkFifeG

lasgow C

ityH

ighlandInverclydeM

idlothianM

orayN

orth AyrshireN

orth LanarkshireO

rkney IslandsPerth & K

inrossR

enfrewshire

Scottish BordersShetland IslandsSouth A

yrshireSouth LanarkshireStirlingW

est Dunbartonshire

West Lothian

Widowed Divorced Separated Not living together - MarriedSingleCohabitingLiving together - Married

Population: All adults aged 16 and above Source: Census Table KS03

8

Page 9: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

Figure 1.2.4 Marital Status

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Scotland

Aberdeen City

Aberdeenshire

Angus

Argyll & Bute

Clackm

annanshire

Dum

fries & Gallow

ay

Dundee C

ity

East Ayrshire

East Dunbartonshire

East Lothian

East Renfrew

shire

Edinburgh, City of

Eilean Siar

Falkirk

Fife

Glasgow

City

Highland

Inverclyde

Midlothian

Moray

North Ayrshire

North Lanarkshire

Orkney Islands

Perth & Kinross

Renfrew

shire

Scottish Borders

Shetland Islands

South Ayrshire

South Lanarkshire

Stirling

West D

unbartonshire

West Lothian

WidowedDivorcedSeparatedRe-marriedMarriedSingle

Population: All adults aged 16 and above Source: Census Table KS03 In general Scottish household structures have been changing. These changes are in line with the trend observed in most Western countries. One of the most significant changes in Scottish household composition over the last 20 years has been the rapid expansion of one-person households. In Scotland, the rate of one person households has increased from 21.9% in 1981 to 32.9% in 2001 (Miller, 2003). Furthermore, there has been an increase in the number of couples in cohabiting unions. This can be shown by comparing the proportion of single people in the two figures above - Figure 1.2.3 looks at the breakdown of the adult population by living arrangement, while Figure 1.2.4 shows the respective breakdown by marital status. Figure 1.2.4 shows how the inclusion of cohabitating couples reduces the number of single people at each ward. By using the marital status definition of ‘single’ we fail to account for co-habitation. This was one of the strengths of the One Number Census, in that part of its remit was to seek to accurately identify the more complex (less traditional, per se) household types. D. Single parenthood The number of single parent households in a ward has previously been used as an indicator of economic deprivation. In Scotland, 91.6% of single parent households (with dependent children) have a female head of household. This is similar to the UK standard. However if this is broken down by age, there are lot more young female single parents: 94.6% of lone parents aged 24 or under with a child under four years old are female. This could be attributed to the fact that Scotland has one of the highest rates of teenage motherhood in Europe – approximately four times the Western European average despite a decline in fertility rates across Europe (Miller, 2003). However the changing household composition in Scotland, and the UK as a whole, shows that this group can be used to identify the more complex households, mentioned earlier. As an example, because a single parent is eligible for higher social security benefits than a cohabiting couple (with dependents), the number of single parent households might have been over-stated by the Census. Such households differ from the conventional definition of cohabitation, although they have similar household

9

Page 10: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

compositions. It is best illustrated by an example. Suppose a single mother is in receipt of housing benefits. During the Census she has a boyfriend who lives with her, but might use his parents’ residence as a mailing address. When his parents were filling in the Census form, he was omitted from the usual household residents, since they consider him to be living with his girlfriend. At the same time he is omitted from the form at his girlfriend’s address. Evidence from previous censuses and the pilot studies prior to the actual Census taking place in 2001 showed that a much more rigorous approach to under-enumeration was needed. Therefore, the Census Coverage Survey was carried out on a much larger scale, with enumerators adept at clarifying any ambiguities during the interview process*. They were able to identify these young males and the results were fed into the imputation process. Placing these males into the appropriate households alters them, and in most cases changed single parent households to a new household structure with a female and a live-in boyfriend. E. Household Tenure Figure 1.2.5

Tenure

0%

20%

40%

60%

80%

100%

ScotlandAberdeen C

ityAberdeenshireAngusArgyll &

ButeC

lackmannanshire

Dum

fries & Gallow

ayD

undee City

East AyrshireEast D

unbartonshireEast LothianEast R

enfrewshire

Edinburgh, City of

Eilean SiarFalkirkFifeG

lasgow C

ityH

ighlandInverclydeM

idlothianM

orayN

orth AyrshireN

orth LanarkshireO

rkney IslandsPerth & K

inrossR

enfrewshire

Scottish BordersShetland IslandsSouth AyrshireSouth LanarkshireStirlingW

est Dunbartonshire

West Lothian

perc

enta

ge

OtherPrivate landlord/letting agencyHousing AssociationCouncil (Local Authority) HomesShared ownership Owns with a mortgage or loanOwns outright

Population: All households Source: Census Table KS18

Public sector (or social) housing has also been previously used as a measure of deprivation. However the SIMD did not have a housing deprivation domain, solely for the reason that there is a lot of ambiguity surrounding the accurate measurement of such a domain. Nonetheless, residents in social tenures are much more likely to suffer from one or more forms of deprivation. But although there could be a confounding effect†, it is still possible that tenure is an independent predictor of under-enumeration, in its own right.

* An intriguing discovery of the CCS process was that it identified a large number of babies that were missed during the Census. † A confounding effect makes it difficult to interpret the direction of an association. It results when a predictor variable is associated with both the outcome and other predictors.

10

Page 11: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

Figure 1.2.6 Tenants in each Local Authority

0

5

10

15

20

25

30

35

Scotland

Aberdeen City

Aberdeenshire

Angus

Argyll & Bute

Clackm

annanshire

Dum

fries & Gallow

ay

Dundee C

ity

East Ayrshire

East Dunbartonshire

East Lothian

East Renfrew

shire

Edinburgh, City of

Eilean Siar

Falkirk

Fife

Glasgow

City

Highland

Inverclyde

Midlothian

Moray

North Ayrshire

North Lanarkshire

Orkney Islands

Perth & Kinross

Renfrew

shire

Scottish Borders

Shetland Islands

South Ayrshire

South Lanarkshire

Stirling

West D

unbartonshire

West Lothian

perc

enta

ge Council (Local Authority) HomesHousing AssociationPrivate landlord/letting agencyOther

Population: All people who are tenants - i.e. resident in Council homes or other rented accommodation. Source: Census Table KS18

The household tenure variable is also important in determining the mobility of residents in wards. People who rent from private landlords or letting agencies are much more likely to move house. This section of the population would include not only students but also young professional adults, particularly at a time when the housing market, especially in cities, is highly competitive, making it difficult for first-time buyers to gain a foothold on the property ladder. So a lot of young professionals, for example nurses, are resident in council housing (and other social tenures) because of their affordability. However since residents in council accommodation are much more likely to be deprived, the public sector housing variable can be used to model under-enumeration. F. Population density The population density (which measures the number of residents per hectare) is a further important variable. As the enumeration districts are grouped based on contiguous postcodes, it is important to know how many residents there are, in order to evenly allocate the workloads of each Census enumerator. During the preliminary analysis, carried out prior to embarking on modelling the under-enumeration, it was found that a proxy measure of population density was the lowest floor level of the household or dwelling. Figures 1.2.7 and 1.2.8 show how areas that are densely populated have a larger proportion of tenements. The floor level was also found to be easier to interpret during the analysis since it is a categorical variable – with three distinct categories: first floor, second floor and third floor or higher. On the other hand, the population density is a continuous variable and cannot be easily split into distinct non-overlapping groups. The local authority with the lowest density is Highland with a population density of 0.08, while Glasgow City has the highest with a population density of 32.93. This difference between the least and most densely populated areas is much more pronounced when comparing at a ward level – parts of

11

Page 12: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

inner Glasgow have densities in excess of 100 – and leads to some interpretational difficulties. Figure 1.2.7

Population Density

0

5

10

15

20

25

30

35

Scotland

Aberdeen C

ityA

berdeenshireA

ngusA

rgyll & ButeC

lackmannanshire

Dum

fries & Gallow

ayD

undee City

East A

yrshireE

ast Dunbartonshire

East Lothian

East R

enfrewshire

Edinburgh, C

ity ofE

ilean Siar

FalkirkFifeG

lasgow C

ityH

ighlandInverclydeM

idlothianM

orayN

orth Ayrshire

North Lanarkshire

Orkney Islands

Perth &

Kinross

Renfrew

shireS

cottish BordersS

hetland IslandsS

outh AyrshireS

outh LanarkshireS

tirlingW

est Dunbartonshire

West Lothian

Density (Number of persons per hectare)

Population: All people (area measured in hectares) Source: Census Table UV02 Figure 1.2.8

Lowest floor level

0

0.05

0.1

0.15

0.2

0.25

Scotland

Aberdeen City

Aberdeenshire

Angus

Argyll & Bute

Clackm

annanshire

Dum

fries & Gallow

ay

Dundee C

ity

East Ayrshire

East Dunbartonshire

East Lothian

East Renfrew

shire

Edinburgh, City of

Eilean Siar

Falkirk

Fife

Glasgow

City

Highland

Inverclyde

Midlothian

Moray

North Ayrshire

North Lanarkshire

Orkney Islands

Perth & Kinross

Renfrew

shire

Scottish Borders

Shetland Islands

South Ayrshire

South Lanarkshire

Stirling

West D

unbartonshire

West Lothian

prop

ortio

n

FirstSecondThird floor or higher

Population: All households Source: Census Table UV61

12

Page 13: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

Figure 1.2.8(a)

Relationship between floor level and population density (at a ward level)

0

20

40

60

80

100

120

140

160

180

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Proportion resident on third floor or higher

Popu

latio

n D

ensi

ty (n

umbe

r of p

erso

ns p

er h

ecta

re)

Population: All people (area measured in hectares) Source: Census Tables UV02 and UV61 Figure 1.2.8(a) depicts the linear relationship between floor level and population density much more clearer. This relationship is much more pronounced at a local authority level, where the correlation between the number of third floor or higher households and the population density is 0.863. G. Educational Qualifications The qualification variable details the highest educational attainment of each individual. This is then aggregated at both the ward level and local authority level. The higher levels of educational qualification are Groups 3 and 4 and cover residents who have attained an HNC or higher. The highest educational attainment is very similar to the Education domain of the Scottish Index of Multiple Deprivation. The purpose of the Education domain is to measure how the key educational characteristics of a local area contribute to the overall level of deprivation and disadvantage. Previous measures have not been able to differentiate between social and educational deprivation (Noble et al, 2000 and Carstairs and Morris, 1992). Furthermore, measurement of education deprivation does not focus only on the adult population but also on school age children. Noble et al (2000) showed that the pupil performance of an area has a contributory effect on the educational and skills deprivation. Since the educational attainment variable in the Census only considers adults aged 16 and over, it is a better proxy of educational deprivation when it comes to ascertaining the relationship between wards that were education deprived and under-enumerated. This is intuitive, as adults (in particular between ages 20-29) were more likely to be missed during the Census, than children.

13

Page 14: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

Figure 1.2.9 Highest educational qualification attainment

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Scotland

Aberdeen City

Aberdeenshire

Angus

Argyll & Bute

Clackm

annanshire

Dum

fries & Gallow

ay

Dundee C

ity

East Ayrshire

East Dunbartonshire

East Lothian

East Renfrew

shire

Edinburgh, City of

Eilean Siar

Falkirk

Fife

Glasgow

City

Highland

Inverclyde

Midlothian

Moray

North Ayrshire

North Lanarkshire

Orkney Islands

Perth & Kinross

Renfrew

shire

Scottish Borders

Shetland Islands

South Ayrshire

South Lanarkshire

Stirling

West D

unbartonshire

West Lothian

Perc

enta

ge

Group 4Group 3 Group 2Group 1Group 0

Population: All people aged 16-74 Source: Census Table UV25

Figure 1.2.9 shows the breakdown of the highest educational attainment by local authority. Glasgow has the highest proportion of individuals with no qualifications, while there are more people in Edinburgh with a degree. H. National Statistics Socio-economic Classification The National Statistics socio-economic classification [NS-SeC] has been introduced by the government to replace social class based on occupation and socio-economic groups. It is an occupational based classification but has rules to provide coverage of the whole adult population. This classification is internationally accepted since it is based on the International Labour Organisation (ILO) employment definitions. It is also conceptually clear and has been validated both in criterion terms and in construction terms as a good predictor of health and educational outcomes (Office of National Statistics, 2002). The NS-SeC categories distinguish different positions (not persons) as defined by social relationships in the work place. People who are temporarily out of work can still be classified, so ‘unemployed’ in this case refers to people who have never worked or are long-term unemployed.

14

Page 15: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

Figure 1.2.10 Occupation - based on the National Statistics socio-economic classification

0%

20%

40%

60%

80%

100%

Scotland

Aberdeen City

Aberdeenshire

Angus

Argyll & Bute

Clackm

annanshire

Dum

fries & Gallow

ay

Dundee C

ity

East Ayrshire

East Dunbartonshire

East Lothian

East Renfrew

shire

Edinburgh, City of

Eilean Siar

Falkirk

Fife

Glasgow

City

Highland

Inverclyde

Midlothian

Moray

North Ayrshire

North Lanarkshire

Orkney Islands

Perth & Kinross

Renfrew

shire

Scottish Borders

Shetland Islands

South Ayrshire

South Lanarkshire

Stirling

West D

unbartonshire

West Lothian

perc

enta

ge

UnclassifiableStudentsUnemployedRoutineIntermediateProfessional

Population: All Adults aged 16-74 Source: Census Table KS14a It is important to realise that the NS-SeC cannot solely explain what socio-economic differences mean. Like the Index of Deprivation, not everything can be explained by what a classification directly measures. However it can clearly show any relationships that exist. Fundamentally, it is of interest to determine if the different socio-economic categories have different relationships with the level of imputation. I. Occupancy Rating

Figure 1.2.11

Occupancy rating

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Scotland

Aberdeen City

Aberdeenshire

Angus

Argyll & Bute

Clackm

annanshire

Dum

fries & Gallow

ay

Dundee C

ity

East Ayrshire

East Dunbartonshire

East Lothian

East Renfrew

shire

Edinburgh, City of

Eilean Siar

Falkirk

Fife

Glasgow

City

Highland

Inverclyde

Midlothian

Moray

North Ayrshire

North Lanarkshire

Orkney Islands

Perth & Kinross

Renfrew

shire

Scottish Borders

Shetland Islands

South Ayrshire

South Lanarkshire

Stirling

West D

unbartonshire

West Lothian

perc

enta

ge

rating -2 or less rating -1 rating 0rating +1rating +2 or more

Population: All households Source: Census Table UV59

15

Page 16: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

Occupancy rating is a measure of under-occupancy and overcrowding. A positive measure refers to the number of rooms in addition to the minimum requirements. A negative measure refers to the number of rooms short of the minimum, and gives some indication of over-crowding. It is expected that in wards with a large number of overcrowded households, there is a greater likelihood of people being missed. J. Multi-occupancy Figure 1.2.12

Proportion of shared dwellings in each council area

0.000%

0.020%

0.040%

0.060%

0.080%

0.100%

0.120%

0.140%

Scotland

Aberdeen City

Aberdeenshire

Angus

Argyll &

Bute

Clackm

annanshire

Dum

fries & Gallow

ay

Dundee C

ity

East A

yrshire

East Dunbartonshire

East Lothian

East Renfrew

shire

Edinburgh, C

ity of

Eilean Siar

Falkirk

Fife

Glasgow

City

Highland

Inverclyde

Midlothian

Moray

North Ayrshire

North Lanarkshire

Orkney Islands

Perth & Kinross

Renfrew

shire

Scottish B

orders

Shetland Islands

South Ayrshire

South Lanarkshire

Stirling

West D

unbartonshire

West Lothian

perc

enta

ge

Population: All dwellings Source: Census Table UV55 Residents who live in multi-occupied households are often hard to enumerate. The Census defines this group as being resident in a shared dwelling. More precisely, a household’s accommodation [or space] is defined as being a shared dwelling if:

1. it has accommodation type “part of a converted or shared house”, 2. not all rooms (including toilet/bathroom) are behind a door that only that

household can use, and 3. there is at least one other household space at the same address with which it

can be combined to form the shared dwelling.

If any of these conditions is met, the household space forms a shared dwelling. Therefore a dwelling can consist of one household space (an unshared dwelling) or two or more household spaces (a shared dwelling). A shared dwelling is distinct from university halls of residence, sheltered accommodation, hotels etc., which are known as communal establishments. Most shared dwellings are intended for temporary occupation (e.g. a group of bedsits) and residents are likely to be transient.

16

Page 17: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

2. Using logistic regression modelling to investigate Census under-enumeration

The main objective of the analysis in this paper is to find a multivariable logistic regression model that can investigate the independent effect of specific deprivation and Census variables on the proportion of residents imputed in a ward (i.e. synthetic individuals). The advantage of logistic regression over other forms of statistical regression is that it can be used to predict group membership of a binary outcome. The analysis was carried out in the statistical analysis package SAS, and the output produced first gives the response profile (see SAS Output). The response profile refers to the outcome variable, which is defined to be whether a person is imputed (an event) or not (a nonevent)‡. SAS Output Response Profile Ordered Binary Total Value Outcome Frequency 1 Event 198987 2 Nonevent 4863024 5062011

Logistic regression allows a discrete outcome, such as group membership, to be predicted from a set of variables that may be continuous, discrete, binary or a mixture of any of these. It results in a model that selects only the variables that are significant predictors of the outcome variable. The outcome variable in logistic regression is binary and takes the value 1 with a probability of success (an event) p or the value 0 with probability of failure (nonevent) 1-p. Another advantage of logistic regression analysis is that it makes no assumption about the distribution of the independent variables. They do not have to be normally distributed, linearly related or of equal variance. The difference between logistic and linear regression is reflected both in the choice of a parametric model and the underlying assumptions. Once this difference is accounted for, the methods employed in logistic regression analysis follow the same general principles as linear regression. After fitting a model, the emphasis shifts from the computation and assessment of significance of the estimated coefficients to the interpretation of their values. So the power of a logistic model is assessed by examination of the classificatory tables. The higher the percentage correct of positive cases (in this paper, accurately predicting the number imputed and not imputed) the better the predictive power of the model. The strength of this type of analysis lies in the fact that an entire set of variables can simultaneously be taken into account. The results then show which variables are independent predictors of the binary variable. For example, it might be shown in the preliminary analyses that the floor level is highly correlated to under-enumeration, and wards with a high proportion of residents on the 3rd floor or above have a high number of imputed persons. However, when account is taken of the fact that residents of the 3rd

‡ The population of Scotland after the 2001Census was 5,062,011 – this was made up of 4,863,024 people counted by the Census and an additional 198,987 people imputed after the Census Coverage Survey.

17

Page 18: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

floor or above are much more likely to be resident in flats, and much more likely to be renting, the impact of floor level on the likelihood of a person being imputed is removed. Any impact that floor level has on under-enumeration is better accounted for by other variables and it can be concluded that it is not floor level that is relevant but tenure and dwelling type. It might be true, however, that floor level is a good predictor of tenure and dwelling and through its effect on these variables it (indirectly) affects under-enumeration. Odds ratios The relationship between the predictor and outcome variables is not a linear function in logistic regression. Instead, the logistic regression function is used, which is:

kk xxxp

ppLogit ββββ ++++=⎥⎦

⎤⎢⎣

⎡−

= ...1

ln)( 22110

and (‡)

)...(

)...(

22110

22110

1 kk

kk

xxx

xxx

eep ββββ

ββββ

++++

++++

+=

Thus the interpretation of any fitted model relies on an understanding of odds ratios. These odds ratios are derived from the estimated coefficients in the model. Odds can be understood as the ratio of the probability of an event occurring over the probability of it not occurring. An odds ratio of 1 indicates equal odds in the two groups, therefore when the 95% confidence interval of the odds ratio estimates contains 1, no inference can be made as to the direction of the change in odds. Thus such a variable is not considered a useful predictor in the logistic model. Overdispersion Over-dispersion occurs when binary data exhibit variances larger than those permitted by the binomial model. It is usually caused by clustering and lack of independence. It is therefore sometimes referred to as ‘extra variation’. It seems to be the norm and not the exception (McCullagh and Nelder, 1989). For a correctly specified model the Pearson chi-squared statistic (and the deviance) divided by the degrees of freedom should be approximately equal to one. When their values are much larger, the assumption of binomial variability may not be valid and the data exhibits over-dispersion. In effect, when data is over-dispersed, terms in the model appear more significant than they actually are. The preliminary analysis carried out prior to the logistic modelling suggested that over-dispersion was a problem that existed in this particular data set. Similar individuals are likely to reside in the same area. However this similarity between individuals in a ward implies that the clustering is not properly explained by the model. There are other factors that the modelling process cannot take into account which affect the variability – e.g. geography, terrain, and enumerator effect. Therefore, in the logistic regression analyses, in order to correct for over-dispersion, the covariance matrix was multiplied by a heterogeneity factor.

18

Page 19: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

3. Preliminary Analysis 3.1 The outcome variable: proportion imputed in each ward Figure 3.1.1

Proportion imputed in each local authority

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Aberdeen City

Aberdeenshire

Angus

Argyle & Bute

Clackm

annanshire

Dum

fries & Gallow

ay

Dundee C

ity

East Ayrshire

East Dunbartonshire

East Lothian

East Renfrew

shire

Edinburgh City

Eilean Siar

Falkirk

Fife

Glasgow

City

Highland

Inverclyde

Midlothian

Moray

North Ayrshire

North Lanarkshire

Orkney Isles

Perth & Kinross

Renfrew

shire

Scottish Borders

Shetland Isles

South Ayrshire

South Lanarkshire

Stirling

West D

unbartonshire

West Lothian

Prop

ortio

n im

pute

d

Figure 3.1.2

Distribution of imputed persons in each ward

0

20

40

60

80

100

120

140

160

180

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.08 0.09 0.1 0.12 0.14 0.16 0.18 0.2

Proportion of persons Imputed

Num

ber o

f war

ds

As shown above, the outcome variable is skewed, and the number of people imputed varied from area to area. Therefore any analysis that assumes normality could produce flawed results. Logistic regression analysis makes no distributional assumptions and is thus a suitable modelling tool in giving an insight into what attributes are more or less likely to predict event outcome – in this case whether or not a person is imputed.

19

Page 20: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

3.2 The independent variables The analysis in this paper examined data collected from two sources – the Scottish Indices of Multiple Deprivation and the 2001 Census. The Census edit and imputation process - which identified the “synthetic individuals” - was based on a Hard to Count (HtC) Index derived to account for the disproportionate distribution of under-enumerated households. Therefore, the variables that were used to construct the HtC Index could unduly influence the logistic modelling procedure. To allow for that influence, the Census variables were divided into:

● Hard to Count variables ● Other Census variables

The Hard to Count variables shared – shared dwellings rent – private rented accommodation over_crowd – over crowded households students – students as defined by the National Socio Economic classification _0to4_ - residents aged 0-4 _20to24_ - residents aged 20-24 _25to29_ - residents aged 25-29 over_85 – residents aged over 85 Other Census variables Group_0 – ‘no’ educational qualifications council – Council accommodation house_assoc – other social rented accommodation sin_par_f – female single parents NS_SeC_3 – intermediate occupations no_car – non owner-ship of a car/van floor – lowest floor level single – single (never married) density – population density A series of logistic regression analyses were carried out on the three sets of variables - Deprivation, Hard to Count and Other Census - to yield three logistic models. Initially univariate models - in which the effects of the individual variables are considered separately with the view of determining how significant they were as predictors of the outcome variable – are fitted. Subsequently multivariate models are fitted. Lastly, a final model is developed that looks at which variables – pre-determined from the above - are significant and independent predictors of the binary outcome variable: whether or not a person is imputed.

20

Page 21: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

4. Results The results of the logistic analyses are presented below. For each model I have presented the Maximum Likelihood parameter estimates and the odds ratios of these parameter estimates. Like other regression techniques, a p-value less than 0.05 represents a significant result. Additionally, since the relationship between the predictor and outcome variables is not a linear, I have presented each model as a logistic equation as follows.

kk xxxp

ppLogit ββββ ++++=⎥⎦

⎤⎢⎣

⎡−

= ...1

ln)( 22110

Forgetting about the logistic transformation on the left-hand side, the right-hand side has an intercept term β0 and slope terms β1 , β2 , … , βk – the same as in linear regression. For reasons of parsimony, early on in the model selection process it was decided not to consider interaction effects between the independent variables. Interaction effects are difficult to interpret, and can introduce additional complexity to the analysis. In some cases although a model with an interaction effect might improve the fit the data, in comparison with a model without, the benefit of the improved model fit is not significant when other factors are taken into account. The objective of the logistic analyses in this paper was to find the simplest model that uncovered which independent factors were significant predictors of number of people imputed in a ward. 4.1 Deprivation Models Table 4.1.1: Multiple Deprivation Index

Analysis of Maximum Likelihood Estimates Standard Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -3.5951 0.00388 859150.595 <.0001 SIMD 1 0.0154 0.000109 19971.7635 <.0001 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits SIMD 1.016 1.015 1.016

Model 1

SIMDp

p 015.0595.31

ln +−=⎥⎦

⎤⎢⎣

⎡−

The slope of 0.015 represents the change in log odds for one unit increase in the SIMD score. Similarly, its odds ratio of 1.016 is the ratio for one unit change in the SIMD score. It would be more significant to look at increments other than one unit. For example, an increase in SIMD by 10 units increases the imputation probability by 17%§.

§ Assuming that the logit is linear in the continuous covariate, then (1.016) 10 = 1.17

21

Page 22: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

Looking at the component domains of the SIMD score, the model is:

Model 2

accesshealtheducemployincomep

p 294.0236.0094.0027.0034.0453.31

ln −+−−+−=⎥⎦

⎤⎢⎣

⎡−

All the deprivation variables entered into the model yield significant probabilities, implying that there is a significant relationship. As it is assumed that the relationship between log odds of an event and the deprivation score is linear, this model tells us that a 10-unit increment in the income deprivation score (holding everything else fixed) increases the probability of imputation by 40%. Table 4.1.2: Singular Deprivation Indices Analysis of Maximum Likelihood Estimates Standard Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -3.4529 0.0873 1564.8121 <.0001 income 1 0.0336 0.00773 18.8580 <.0001 employ 1 -0.0265 0.00882 9.0213 0.0027 educ 1 -0.0941 0.0403 5.4588 0.0195 health 1 0.2355 0.0609 14.9544 0.0001 access 1 -0.2939 0.0285 106.6630 <.0001 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits income 1.034 1.019 1.050 employ 0.974 0.957 0.991 educ 0.910 0.841 0.985 health 1.265 1.123 1.426 access 0.745 0.705 0.788

It is peculiar that the education and employment deprivation domain maximum likelihood estimates are negative, although they are positively correlated with the income domain. A likely explanation could be that the assumption of linearity might be flawed and this is investigated later on. 4.2 Hard to Count Index Model The Hard to Count Index was constructed from Census variables known to be associated with under-enumeration. Under-enumeration was also found to be unevenly spread across certain groups. Since this was the basis of the Census Coverage Survey design, these variables are likely to be significant predictors of the percentage imputed. The following variables were used in the logistic regression analysis:

• Shared dwelling • Private rented accommodation • Percentage in the following age groups: under 4’s, 20-24, 25-29, over 85’s • Over-crowded households • Students

Here ‘shared dwellings’ is the variable that looks at whether people in dwellings occupied by more than one household will have a high probability of being under-enumerated.

22

Page 23: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

Of the variables entered into the initial model, only private rented accommodation, multi-occupancy (shared dwellings), over-crowded households and people aged 25-29 – indicate a significant relationship based on the maximum likelihood estimates. Table 4.2.1: Hard to Count Index variables Standard Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -2.9510 0.0821 1291.5114 <.0001 shared 1 0.0145 0.00553 6.8765 0.0087 rent 1 0.0157 0.00393 15.9237 <.0001 over_crowd 1 -1.8450 0.1408 171.7077 <.0001 _25to29_ 1 0.0331 0.00900 13.5404 0.0002 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits shared 1.015 1.004 1.026 rent 1.016 1.008 1.024 over_crowd 0.158 0.120 0.208 _25to29_ 1.034 1.016 1.052

Model 3

_2925_033.0_845.1016.0015.0951.21

ln tocrowdoverrentsharedp

p+−++−=⎥

⎤⎢⎣

⎡−

Note how over_crowd has a large negative coefficient – this would give cause for further investigation of the linearity assumption. Fitting univariate models rarely provides an adequate analysis of the data as the independent variables are usually associated with each other and may have different distributions within the levels of the outcome variable. A multivariable analysis gives more comprehensive modelling of the data. In multivariate logistic modelling, each estimated coefficient provides an estimate of the log odds adjusting for all other variables included in the model. It is therefore of interest to investigate the changes in the odds ratio estimates. The results shows that while the number of students and persons aged 20-24 are significant predictors of under-enumeration when using the single variable model, at the multivariable level they are not. Students are known to be transient and hence more difficult to enumerate. However, most students are likely to live in (private) rented accommodation. Thus the inclusion of the variable ‘rent’ renders ‘students’ redundant. The same applies to the variable ‘_20to24_’. So knowing the number of residents in rented accommodation is a significant predictor of students and persons aged 20-24. However ‘_25to29_’ is still predictive in the multivariable model perhaps because people in this age group are much more likely to be on the first rung of the property ladder, and live alone.

23

Page 24: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

4.3 The remaining Census variables Table 4.3.1: Other Census variables

Analysis of Maximum Likelihood Estimates Standard Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -3.0960 0.3475 79.3967 <.0001 Group_0 1 -0.7843 0.2664 8.6646 0.0032 house_assoc 1 0.00707 0.00226 9.8328 0.0017 sin_par_f 1 -1.7026 0.3924 18.8264 <.0001 NS_SeC_3 1 0.0472 0.00719 43.1444 <.0001 no_car 1 0.0217 0.00239 82.8764 <.0001 single 1 0.0169 0.00356 22.5240 <.0001 Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits Group_0 0.456 0.271 0.769 house_assoc 1.007 1.003 1.012 sin_par_f 0.182 0.084 0.393 NS_SeC_3 1.048 1.034 1.063 no_car 1.022 1.017 1.027 single 1.017 1.010 1.024

Model 4

single02.0_02.03__05.0_sin_70.1_01.00_08.010.31

ln +++−+−−=⎥⎦

⎤⎢⎣

⎡−

carnoSeCNSfparassochouseGroupp

p

In the above model, the coefficients of sin_par_f and Group_0 go in the opposite direction to the other variables. The correlation plots give no clue why this should be the case, so an investigation into the assumption of linearity might give an explanation. The inclusion of NS_SeC_3 as a predictor of under-enumeration is surprising at first. However on close inspection there are a number of possible explanations. Throughout this analysis, it has become obvious that transience has a strong bearing on under-enumeration. NS_SeC_3 is the socio-economic classification that caters for the working population in intermediate occupations. This group includes clerical, sales and administrative staff – in particular, call-centre workers. With the increasing competitiveness of graduate schemes, many students on graduation take temporary work until they find suitable professional jobs. On closer inspection, the majority of these intermediate workers are in the university cities. There may however be two effects in existence because university cities also serve as financial centres and would therefore attract sales and administrative jobs (also classified as “intermediate”). A second explanation of this could be the Holiday-Maker Visa Scheme. This scheme allows foreign nationals to work in the UK for up to two years, although most stay for an average of six months. Most people on these schemes find jobs in the “intermediate” sector - this reinforces the link between migration and under-enumeration. The NS_SeC_3 “intermediate” variable in effect picks up areas where there is a large mobile young population. Mobility increases the probability of non-contact by Census enumerators – though sending out the Census form by post may reduce this possibility. NS_SeC_3, therefore, is an important variable in its own right because, even with ‘single’ in the model, it remains significant.

24

Page 25: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

4.4 Further investigation into the assumption of linearity The assumption of linearity simplifies the regression process and allows the calculation of the change in log-odds for any increase other than a unit increment. This provides a useful interpretation for independent variables measured on a continuous scale - a feature of this particular data set. By fitting all the independent variables, particularly the deprivation variables, as continuous it is being assumed that this relationship is linear. Based on the previous model results, there is reason to question the assumption of linearity for Group_0, the educational attainment variable, and income and access, the sub-indices of deprivation. The results from the preliminary analysis suggest that the relationship between these variables and the imputation probability is non-linear. If the variables are fitted in their present format as continuous, the multivariable model results would not yield useful conclusions. Therefore, based on a quartile design variable analysis, Group_0, income and access are fitted as categorical variables. Quartile design analysis creates categorical variables with 4 levels using three cutpoints based on the distributional quartiles. Thus the continuous variables are replaced by these (new) 4-level categorical variables. The lowest quartile serves as the reference group; this category is used as a comparison group for all the other groups, and hence results can be interpreted relative to this reference group. For the deprivation model, there is justification in using the sub-indices of multiple deprivation because it shows a more careful pattern of differences that have a better ability to predict non-response. Therefore the Access and Income singular deprivation indices are fitted as categorical variables (the remaining domains are left as continuous). In doing so, it is possible to ascertain which deprivation sub-indices have a comparatively better predictive ability. Table 4.4.1(a) Type III Analysis of Effects Wald Effect DF Chi-Square Pr > ChiSq income 3 22.4409 <.0001 access 3 109.0892 <.0001 health 1 47.9210 <.0001 employ 1 1.0548 0.3044 educ 1 0.1146 0.7350

Table 4.4.1(b) Type III Analysis of Effects Wald Effect DF Chi-Square Pr > ChiSq income 3 22.4496 <.0001 access 3 109.1964 <.0001 health 1 82.2696 <.0001

The Type III Analysis of Effects (shown above in Table 4.4.1(a) and (b)) tests all factors in the model as if each one were the last one included in the model (i.e. this controls for all other effects). This indicates that, with the other indices in the model, the employment and education indices are rendered redundant. The tables show that the employment and education sub-indices have the least significance, after controlling for other effects. While it is widely acknowledged that unemployment is a good predictor of under-enumeration, a person with a low income is

25

Page 26: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

much more likely to be unemployed (although the reverse is not necessarily true). In this respect, employment deprivation is said to be confounded as it is associated with both the outcome variable (imputation probability) and a primary independent variable (low income). The following table shows the how the Other Census model changes when Group_0 is fitted as a categorical variable.

Table 4.4.2: Type III Analysis with Group_0 fitted as a class covariate Type III Analysis of Effects Wald Effect DF Chi-Square Pr > ChiSq Group_0 3 7.4160 0.0598 sin_par_f 1 20.7668 <.0001 NS_SeC_3 1 55.2018 <.0001 no_car 1 83.2786 <.0001 single 1 49.8101 <.0001

When fitted as a class variable, Group_0 becomes non-significant. Also notice how house_assoc – the variable looking at social tenures not administered by the council – is absent from the model. The reason for this is that the odds ratio of the covariate in this model ends up being very close to 1 which is indicative of no effect**. Hence the final model is fitted with sin_par_f, NS_SeC_3, no_car and single. 4.5 Final Modelling Stage Having determined which amalgamated Deprivation, Hard to Count and Other Census variables are associated with under-enumeration, the final modelling stage seeks to identify which independent variables are significant predictors, controlling for all other variables. A series of logistic regression analyses were carried out and the results are shown below. Table 4.5.1(a) Census and Multiple Deprivation Variables: Type III Analysis Type III Analysis of Effects Wald Effect DF Chi-Square Pr > ChiSq income 3 17.5617 0.0005 access 3 5.0613 0.1674 health 1 37.7709 <.0001 sin_par_f 1 13.4854 0.0002 NS_SeC_3 1 111.6820 <.0001 no_car 1 34.6778 <.0001 single 1 12.7539 0.0004 shared 1 8.8587 0.0029 rent 1 21.7330 <.0001 over_crowd 1 31.2054 <.0001 _25to29_ 1 6.2902 0.0121

Access has a non-significant p-value; hence another model is fitted with the sub-index removed. The analysis yields the model below – the best model. Since income and access have been fitted as categorical variables they have

** The odds ratio estimate of 1.035, in the univariate model, drops down to 1.005 in the multivariate model - i.e. with the addition of other variables in the model, the differences between areas with a high number of housing association tenants and those without becomes non-significant.

26

Page 27: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

Table 4.5.1(b) Census and Multiple Deprivation Variables: Type III Analysis Type III Analysis of Effects Wald Effect DF Chi-Square Pr > ChiSq income 3 16.7498 0.0008 health 1 40.3774 <.0001 sin_par_f 1 15.1413 <.0001 NS_SeC_3 1 107.5792 <.0001 no_car 1 44.0331 <.0001 single 1 10.1418 0.0014 shared 1 7.7653 0.0053 rent 1 31.9308 <.0001 over_crowd 1 34.4162 <.0001 _25to29_ 1 6.7151 0.0096

Table 4.5.2 Final Model††

Analysis of Maximum Likelihood Estimates Standard Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -4.4200 0.3828 133.2996 <.0001 income 1 1 -0.0786 0.0537 2.1453 0.1430 income 2 1 -0.0436 0.0628 0.4817 0.4876 income 3 1 0.1020 0.0805 1.6074 0.2049 health 1 0.2427 0.0382 40.3774 <.0001 sin_par_f 1 -1.3972 0.3591 15.1413 <.0001 NS_SeC_3 1 0.0738 0.00711 107.5792 <.0001 no_car 1 0.0177 0.00267 44.0331 <.0001 single 1 0.0114 0.00359 10.1418 0.0014 shared 1 0.0121 0.00436 7.7653 0.0053 rent 1 0.0229 0.00405 31.9308 <.0001 over_crowd 1 1.4946 0.2548 34.4162 <.0001 _25to29_ 1 0.0224 0.00863 6.7151 0.0096

The final model reveals which Deprivation and Census variables are significant and independent predictors of under-enumeration. When all the variables found to be significant in Models 1- 4 are placed in the regression model simultaneously, any association that Group_0 and access have with under-enumeration is better accounted for by the remaining variables. Final Model

fparcrowdoverrenttocarnoSeCNS

sharedhealthincomeincomeincomep

p

_sin_40.1_50.102.0_2925_02.0_02.03__07.0

01.0single01.024.010.004.008.042.41

ln 321

−++++

+++++−−−=⎥⎦

⎤⎢⎣

⎡−

Notice how the over_crowd coefficient changes from a large negative (-1.84) from Model 3 to a large positive (+1.50) in this final model. The large positive is more plausible – it would be expected that the more overcrowded a household is, the greater the likelihood of being missed. However, there is evidence to suggest that this is true only to a certain extent. Changing household structures, working patterns etc. have meant that it is increasingly difficult to contact one person and adult couple households, which have significantly increased in the past 30 years (Miller, 2003).

†† Since income has been fitted as a categorical variable to account for the non-linearity of its relationship with the number imputed, there are three ‘income’ variables shown in Table 4.5.2. These variables can be thought of as representing the quartiles of income deprivation.

27

Page 28: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

4.6 Odds Ratios The table below gives the odds ratio estimates derived from the coefficients in the model. Odds ratios are more easily interpretable in comparison with the logistic regression coefficients. For example, it can be concluded that residents in income deprivation group 3 (upper quartile: least deprived) are 1.1 times more likely to be under-enumerated compared to the reference category (group 0: most deprived). However, for residents in group 1 the comparative value is 0.9 i.e. a reduction. Thus the greatest amount of imputation occurred in the lowest and highest income groups. Table 4.6.1: Odds ratio estimates of the variables in final model Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits income 1 vs 0 0.924 0.832 1.027 income 2 vs 0 0.957 0.846 1.083 income 3 vs 0 1.107 0.946 1.297 health 1.275 1.183 1.374 sin_par_f 0.247 0.122 0.500 NS_SeC_3 1.077 1.062 1.092 no_car 1.018 1.013 1.023 single 1.011 1.004 1.019 shared 1.012 1.004 1.021 rent 1.023 1.015 1.031 over_crowd 4.457 2.705 7.344 _25to29_ 1.023 1.005 1.040

Overcrowding and non-car ownership still remain significant even with the addition of the Deprivation Indices. As stated earlier, the Scottish Index of Multiple Deprivation was not exhaustive in the number of domains considered. It did not consider housing deprivation – of which overcrowding is an indicator. Car ownership was also not used in any of the indices because of the ambiguity over whether the non-ownership of a vehicle is indicative of deprivation or a reflection of personal choice. Therefore, this could be a possible explanation why these two variables are shown to be still significant in the final model. Secondly, single parentage has a comparatively large negative coefficient - meaning that the fewer single parents there are in an area, the higher the number of “synthetic individuals” imputed. This seems peculiar, until we look at the One Number Census methodology. The Census Coverage Survey identified certain household structures into which imputed individuals were placed – for instance a large number of female single parent households with live-in boyfriends (particularly in the Glasgow region). The imputation process placed young males into these households, hereby changing their structure.

28

Page 29: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

4.7 Assessment of how well the models fit the data The assessment of fit, and the predictive power, of the model is a multi-faceted investigation involving summary tests and measures as well as diagnostic statistics. The most straightforward way to summarise results of a fitted logistic regression model is by a classification table. This table is the result of cross-classifying the outcome variable with a dichotomous variable whose values are derived from the estimated logistic probabilities [Hosmer & Lemeshow (2000) p.156]. Classification tables use different estimated probabilities to predict group membership. Therefore a model is said to be of reasonable fit if, based on some criterion, it predicts group membership correctly. The accuracy of the classification is measured by the sensitivity and specificity. Sensitivity is defined as the ability to correctly predict an event, while specificity is the ability to predict a non-event correctly. A more complete description of the classification accuracy is given by the area under the Receiver Operating Characteristic (ROC) curve. ROC curves plot the probability of detecting a true signal (sensitivity) against a false signal (1 – specificity) for an entire range of possible cut-points.

Figure 4.7.1 Assessment of fit of the different models

ROC curves

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

1- specificity

sens

itivi

ty

Model A - Final ModelModel B - Other Census Variables ModelModel C - Hard-to-Count Variables ModelModel D - Singular Deprivation ModelModel E - Multiple Deprivation Model

As a general rule we require c, the area under the ROC curve, to be 0.5 < c ≤ 1, in order to ascertain how good the model is at discriminating between events (synthetic individuals) and non-events (actual individuals). A good prediction curve is one that is well above the diagonal line. The graph (Figure 4.7.1) shows the ROC curves for all the models. The diagonal line has been superimposed in order to determine which curve achieves the best ‘lift’ and is therefore the best model.

29

Page 30: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

Furthermore, Figure 4.7.1 shows that even the ‘best’ model does not achieve optimal lift (the area under the ‘best’ model, is 0.631, which is not greatly above the minimum value of 0.5). Classification is sensitive to the relative sizes of the two component groups and always favours classification into the larger group. This is particularly true for this dataset, where the model is trying to split the population of Scotland (5,062,011) into 198,987 (events i.e. imputed persons) and 4,863,024 (non-events). Although the final model lift is not optimal, it does have some ability to discriminate between the events and non-events. More importantly, the analysis results show which variables are significantly associated with under-enumeration.

4.8 Summary The results of the modelling exercise showed that there were a number of Deprivation and Census independent variables that had some power at predicting under-enumeration. The best model was achieved by firstly, formulating a basic plan for selecting the model variables and secondly, using variety of methods to assess the adequacy of the model both in terms of its individual variables and its overall fit. Therefore, the final model with the best fit included the following variables:

• Income Deprivation • Health Deprivation • Shared dwellings • Private rented accommodation • Over-crowding • Residents aged 25-29 • Single residents • Female Single Parents • Residents in intermediate occupations • Non-car ownership

Therefore, for brevity, under-enumeration can be said to be a function of deprivation factors – income and health deprivation, non-car ownership, over-crowding and multi-occupancy; and transience factors – privately rented accommodation, intermediate occupational class, single (never married) and young people (aged 25-29). The number of young single mothers in an area could be construed to fit into both transience and deprivation categories

30

Page 31: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

5. Conclusions and Recommendations 5.1 Conclusions

This paper indicates that deprivation and certain household, demographic, economic and social characteristics appear to have some association with the level of under-enumeration in an area. For some characteristics, however, the associations are very modest and it is clear that, while there may be a variety of factors that affect whether or not an area suffers from under-enumeration, much of the variability remains unaccounted for by the deprivation and socio-economic factors alone. However these factors can parsimoniously be split into transience or deprivation characteristics. By definition an area is said to have a transient population if it has a one (or more) of the following characteristics:

o high proportion of single people o high proportion of people aged 20-29 o high proportion of people in intermediate occupations o high proportion of people in private rented accommodation

These findings appear to be in line with previous research establishing that under-enumeration is higher in areas characterised by certain social, economic and demographic characteristics. However, this paper quantifies how the different characteristics affected under-enumeration, by the successful use of logistic regression models.

The paper did not however look at all possible variables. There are a number of other factors (e.g. enumeration strategy) known to have an effect on under-enumeration. Additionally the paper’s emphasis is on associations; only an experimental study can provide evidence for causality. It is however, possible that other confounding variables, not considered, could account for the associations. While a selection of demographic, social and household variables was found to predict under-enumeration, these variables were obtained from the post-imputation Census database and thus any associations could be due the underlying One Number Census methodology. More definitive results could be achieved by using the pre-imputation database.

31

Page 32: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

5.2 Future Analysis The following future analysis may be justified:

The One Number Census methodology was principally concerned with identifying and adjusting for under-enumeration and assumes that over-enumeration (people being counted twice in the Census e.g. students counted both at home and at place of study) is negligible. An investigation into this assumption would be worthwhile.

Further investigation into whether using the pre-imputation database yields

similar results would help verify the validity of the models. The UK is the only country (currently) with a fully adjusted Census individual-

level database. This makes comparison of the accuracy of the One Number Census project with other countries difficult. As other countries adopt similar approaches, the logistic models developed could be applied to their under-enumeration.

The Census variables chosen in the modelling process were by no means

exhaustive. Further investigation could examine other attributes not considered.

The Scottish Executive has recently developed a new level of geography –

the datazone. Datazones are fixed small areas, created from 2001 Census Output Areas, which avoid the homogeneity and confidentiality issues posed by other small areas such as postcodes or settlements. They have been designed to be compact in shape and contain households with similar social characteristics. Future work could ascertain if the logistic regression results are replicated at the datazone level.

Public perceptions of a Census would have been a good indicator to examine

in the modelling process. Has a decline in individual commitment to civic duty – or an increase in apathy – been a factor? The last general election registered the lowest voter turnout post-war. Future work might investigate if voter turnout rates are associated with the level of under-enumeration.

32

Page 33: Modelling Census Under-Enumeration · Modelling Census Under-Enumeration ... analysis controls for all variables simultaneously during ... currently no up-to-date data to address

References

Boyle, P. and Dorling, D. (2004) The 2001 UK Census: remarkable resource or bygone legacy of the ‘pencil and paper era’? Area, Vol.36 Issue 2, 101-111.

Brown, J., Diamond, I., Chambers, R., Buckner L.J. and Teague, A.D. (1999). A methodological strategy for a one number census in the UK. J. R. Statist. Soc. A, 162, 247-267 Carstairs, V. and Morris, R. (1992) Deprivation and Health in Scotland. Aberdeen University Press. Gordon, D., Adelman, L., Ashworth, K., Bradshaw, J., Levitas, R., Middleton, S., Pantazis, C., Patsios, D., Payne, S., Townsend P. and Williams, J. (2000) Poverty and Social Exclusion in Britain. York: Joseph Rowntree Foundation Hosmer, D.W. and Lemeshow, S. (2000) Applied Logistic Regression, 2nd Edition. Wiley Series in Probability and Statistics. Macniven, D. (2004) Scotland’s Population 2003 – the Registrar General’s Annual Review of Demographic Trends. McCullagh, P.and Nelder, J.A. (1989) Generalised Linear Models, 2nd Edition. Chapman and Hall / CRC Miller, G. (2003) The Relationship Matrix - measuring the changing face of Scotland’s households. MSc. Dissertation. Napier University. Edinburgh. Noble, M., Penhale, G., Smith, G. and Wright, G. (2000) Measuring Multiple Deprivation at the Local Level, Index of Deprivation 1999 Review. Noble, M., Wright, G., Lloyd, M., Smith, G., Ratcliffe, A., McLennan, D., Sigala, M. and Antilla, C. (2003) Scottish Indices of Deprivation (2003) Report. Office of National Statistics (2002) User Manual for the National Statistics Socio-economic classification [NS-SeC]. Ruston, D. (2002) Difficulty in accessing key services, Office of National Statistics publication. Scottish Office (1999) Social Inclusion – opening the door to a better Scotland Townsend, P. (1979) Poverty in the United Kingdom: a survey of household resources and standards of living. Harmondsworth: Penguin

33