County level population estimation using knowledge-based image

APPROVED: Pinliang Dong, Major Professor Paul Hudak, Committee Member and Chair

of the Department of Geography Chetan Tiwari, Committee Member James D. Meernik, Acting Dean of the

Robert B. Toulouse School of Graduate Studies

COUNTY LEVEL POPULATION ESTIMATION USING KNOWLEDGE-BASED

IMAGE CLASSIFICATION AND REGRESSION MODELS

Anjeev Nepali, B.S.

Thesis Prepared for the Degree of

MASTER OF SCIENCE

UNIVERSITY OF NORTH TEXAS

August 2010

Nepali, Anjeev. County level population estimation using knowledge-

based image classification and regression models. Master of Science (Applied

Geography), August 2010, 65 pp., 11 tables, 24 illustrations, references, 38 titles.

This paper presents methods and results of county-level population

estimation using Landsat Thematic Mapper (TM) images of Denton County and

Collin County in Texas. Landsat TM images acquired in March 2000 were

classified into residential and non-residential classes using maximum likelihood

classification and knowledge-based classification methods. Accuracy

assessment results from the classified image produced using knowledge-based

classification and traditional supervised classification (maximum likelihood

classification) methods suggest that knowledge-based classification is more

effective than traditional supervised classification methods. Furthermore, using

randomly selected samples of census block groups, ordinary least squares (OLS)

and geographically weighted regression (GWR) models were created for total

population estimation. The overall accuracy of the models is over 96% at the

county level. The results also suggest that underestimation normally occurs in

block groups with high population density, whereas overestimation occurs

in block groups with low population density.

ii

Copyright 2010

by

Anjeev Nepali

iii

ACKNOWLEDGEMENTS

I would like to take this opportunity to express my appreciation towards Dr.

Pinliang Dong for his full support and supervision throughout this project. I also

like to acknowledge my committee member Dr. Paul Hudak and Dr. Chetan

Tiwari for their support and suggestion to prepare this thesis work and Dr. Bruce

Hunter for providing software application support used in this thesis work.

I also would like to thank my friends Naresh Kanaujiya, Sanjay Gurung,

Nick Enwright and Aldo Avina for their comments and suggestions in improving

my thesis work.

iv

TABLE OF CONTENTS

Page

ACKNOWLEDGEMENTS ..................................................................................... iii LIST OF FIGURES ...............................................................................................vi LIST OF TABLES ............................................................................................... viii CHAPTER 1 INTRODUCTION ............................................................................. 1

Why Estimate Population? ................................................................................ 1

Why Use Remote Sensing for Population Estimation? ..................................... 2

Current Practice of Population Estimation Using Remote Sensing ................... 4

Research Objectives ......................................................................................... 9 CHAPTER 2 STUDY AREA AND DATA ........................................................... 10

Study Area ...................................................................................................... 10

Datasets .......................................................................................................... 12 CHAPTER 3 METHODOLOGY ......................................................................... 13

Image Calibration ............................................................................................ 13

Impervious Dataset ......................................................................................... 14

Calculation of Indices ...................................................................................... 15

Knowledge-Based Classification ..................................................................... 17

Development of the Knowledge-Based Classification Model for Denton County ........................................................................................................................ 20

Accuracy Assessment of Classified Images.................................................... 25

Input Data for Regression Models ................................................................... 26

Regression Modeling ...................................................................................... 26

Geographically Weighted Regression Model .................................................. 27

Accuracy Assessment of Population Estimation ............................................. 29

Relative Error (RE): ......................................................................................... 29

v

CHAPTER 4 RESULTS AND DISCUSSION ..................................................... 30

Results from Maximum Likelihood Classification (MLC) ................................. 30

Results from Knowledge-Based Classification of Landsat TM and Impervious Surface Data ................................................................................................... 31

Results from Knowledge-Based Classification Using Landsat TM Data Alone 33

Regression Models ......................................................................................... 34

Linear Regression Models .............................................................................. 35

Geographically Weighted Regression ............................................................. 48

Discussion....................................................................................................... 51 CHAPTER 5 CONCLUSION.............................................................................. 57 REFERENCES ................................................................................................... 60

vi

LIST OF FIGURES

Page

1. Study area. .............................................................................................. 10

2. Flowchart of methodology. ....................................................................... 18

3. Flowchart of knowledge-based classification model. ............................... 19

4. Indices data image from generated from TM image. ............................... 21

5. Classification indices value of various land use type vs. residential land use type. .................................................................................................. 22

6. Classification indices value of various land use type vs. residential land use type. .................................................................................................. 23

7. Spectral response of band 4 and band 7 in residential build-up area. ..... 24

8. Knowledge-based classification model developed for land use classification using Landsat 7 ETM+ image classification rules and conditions. ................................................................................................ 25

9. Classified image of Landsat TM produced from MLC classification. ........ 30

10. Classified image of Landsat TM after processing using impervious surface data. ......................................................................................................... 32

11. Results from knowledge-based classification using Landsat TM data alone. ....................................................................................................... 33

12. Linear regression models derived from sampling Denton County block-group level. .............................................................................................. 37

13. Linear regression models derived from sampling Denton County block-group level. .............................................................................................. 38

14. Scatter diagrams of relative population estimation error vs. population density at the census block-group level for Denton County. .................... 39

15. Scatter diagrams of relative population estimation error vs. population density at the census block-group level for Denton County. .................... 40

vii

16. Scatter diagrams of relative population estimation error vs. population density at the census block-group level generated from general regression for Denton County (03/04) ....................................................................... 41

17. Linear regression models derived from sampling Collin County block-group level. .............................................................................................. 44

18. Scatter diagrams of relative population estimation error vs. population density at the census block-group level for Collin County. ....................... 45

19. Scatter diagrams of relative population estimation error vs. population density at the census block-group level generated by general linear regression Collin County (03/04/2000)..................................................... 46

20. Scatter diagram of relative population estimation error vs. population density for GWR model (Denton County)................................................. 50

21. Scatter diagram of relative population estimation error vs. population density for GWR model (Denton and Collin County). ............................... 50

22. Scatter Diagram of relative population estimation error Vs. Population Density for GWR Model (Denton and Collin County). .............................. 51

23. Sparsely populated region on aerial, Landsat TM and classified image. . 53

24. Lake shore (sandy beach) and residential built-up on aerial, TM, and classified image. ...................................................................................... 54

viii

LIST OF TABLES

Page

1. Population estimates from July 1, 2006 to July 1, 2007 ........................... 11

2. Error matrix for maximum likelihood classification (MLC) ........................ 31

3. Error matrix for impervious surface data .................................................. 32

4. Error matrix for spectral response alone .................................................. 34

5. Summary of linear regression model results -Denton County .................. 36

6. Summary of population estimates produced by general regression model for Denton County .................................................................................... 41

7. Summary of linear regression model results -Collin County .................... 42

8. Summary of population estimates produced by general regression model for Collin County ...................................................................................... 46

9. Summary of linear regression model results –Denton and Collin County Combine .................................................................................................. 47

10. Summary of population estimates produced by general regression model (Denton & Collin County combined) ......................................................... 48

11. Summary of geographically weighted regression model results .............. 49

1

CHAPTER 1

INTRODUCTION

Why Estimate Population?

Half of the world’s human population now lives in urban settlement with a

rapid growth rate (UNCHS, 2001). According to the United Nation Human

Settlements Program (UN-Habitat), nearly 60% of the world population will be

urban dwellers by 2030. This rapid population growth has a direct impact on all

aspects of human development such as social behavior, health, education,

gender equality, economic development, job opportunities, and environment.

Population growth impacts the sustainability of natural resources through

processes of environmental deterioration, including deforestation and loss of

biodiversity. Therefore, there is an urgent need to develop methods that can

accurately estimate the spatial distribution of populations. This will allow decision

makers/planners and environmental planners to develop a better understanding

of the complex relationships between population growth, social/economic impact,

environmental condition, and decision making process (Lu, Weng, & Li, 2006).

Traditional population estimation is based on census which provides extensive

information on demographic parameters but, in the mean time, it is also very

labor intensive, time consuming and costly. Furthermore, in the United States,

census data is only collected once every ten years, which is inadequate for

modeling the population dynamics for rapidly changing urban environment

2

(Lee & Goldsmith, 1982). Therefore, the usefulness of decennial census dataset

is becoming less representative for those urban settlements which are

developing rapidly. For example, Denton County and Collin County were listed as

one of the top 25 of U.S. Counties which received the largest numeric increase in

population in one year (July 1, 2006 to July 1, 2007) by US Census Bureau 2008

report (www.census.gov). Recent demographic data (population estimations and

projections) has become more important source of information for developing

various applications including decision making processes for marketing,

planning, government, and businesses. However, the resources available to

collect up-to-date demographic information for rapidly growing regions are still

inadequate.

Why Use Remote Sensing for Population Estimation?

Many researchers have used remotely sensed data such as aerial

photographs for estimating population in urban settlement since 1950s. For

example, Green (1956) used aerial images to count the number of dwelling units

and dwelling type to conduct his demographic analysis in Birmingham, Alabama.

Similarly, Collins and El-Beik (1971) used high spatial resolution image from

Leeds, England to investigate the co-relation between dwelling type and resident

population. Lo (1986a) applied the dwelling method (dwelling count and average

household size using 1:20000 spatial resolution aerial Image) to estimate the

population of Athens, Georgia successfully. The outcome of Collins and El-Beik

3

(1971) and Lo (1986b) indicates that remote sensing techniques can be used to

estimate population of small areas with high accuracy.

Early methods based on manual interpretation of remotely sensed data

were highly time consuming, tedious, labor intensive and not feasible to use for

large areas (e.g. metropolitan or county level); They also require high spatial

resolution images, hence, are not suitable for images with resolution courser than

1 meter. In addition, consistency of the result might be an issue because the

result is highly subjective to the image analyst (Zha, Gao, & Ni, 2003). As a

result, image classification methods were employed to overcome the

shortcomings of manual methods. Image classification can provide researchers

with additional information (such as land use type, transportation network, and

impervious surface) related to urban settlements that play crucial role in

population estimation (Hardin, Jackson, & Shumway, 2007; J. T. Harvey, 2002a;

S. Wu, Qiu, & Wang, 2005). Furthermore, readily available images from different

space-borne and airborne sensors with various spatial resolutions are making it

feasible to delineate ancillary information to assist population estimation. To sum

up, remote sensing methods have been developed for population estimation

because (1) remotely sensed images can provide spatial and spectral information

for residential areas; (2) remotely sensed images can cover large geographic

areas to support population estimation at different scales with less cost; and (3)

computer-based digital image analysis methods greatly facilitate information

extraction from remotely sensed images.

4

Current Practice of Population Estimation Using Remote Sensing

In the United States, “small area” generally indicates counties and their

subdivisions. However, some prefer the term “small area” to the land masses

comprised of census tracts, block groups and blocks, as well. Remotely sensed

images with various spatial resolutions (high, medium, low) have been used for

estimating small area population. For example, high spatial resolution aerial

images were used by Lo and Welch (1977) and Lo (1986a) for their research.

Harvey (2002b; 2003), Lo (2003), Li and Weng (2005) used medium spatial

resolution Landsat Thematic Mapper (TM), and Sutton et al. (1997; 2001) did

their research using low spatial resolution data such as defense meteorological

satellite program operational linescan system (DMSP OLS). However, these

different resolutions have their own complications. For example, images with very

high spatial resolution such as aerial photographs and IKONOS images can

create processing problem because of their massive data content and possible

spatial distortions while working on large areas; likewise, low spatial resolution

data, such as DMSP OLS, is unable to provide significant information for

population estimation at the regional and local levels. Because of their relatively

rich spectral information for land cover mapping and intermediate spatial

resolution to cover large areas, medium spatial resolution images, such as

Landsat TM/Enhanced TM (ETM+) images, have become the main image source

for population estimation (J. T. Harvey, 2002b; 2003; Li & Weng, 2005; Lo, 1995;

2003).

5

Different methods have been used for residential population estimation

based on remotely sensed data. Lo (1986b) summarized four distinguished

approaches that are mainly used in remote sensing literature. They are based

on:

1. Counting individual dwelling units on high spatial resolution imagery

2. Extracting the size of the urban settlement from medium or high spatial

resolution images

3. Using land-use type classification for extracting urban settlement

4. Using automated digital image classification based on spectral features

of satellite imagery

The first three approaches have been previously used for visual

interpretation and analysis. However, the fourth technique has emerged as a

different methodology; it can be applied to any remotely sensed data using

particular spectral information and spatial resolution provided by the image (J. T.

Harvey, 2002b). Under Lo’s fourth approach, researchers have invested

considerable effort on modeling automated digital image analysis techniques

using various computer assisted methods. Among the various digital image

analysis techniques that are used for population estimation, supervised

maximum likelihood classification (MLC) is most commonly used in the remote

sensing literature.

The basic MLC principle relies on decision rules that classify image pixels

to particular classes based on probabilities. This classification method is faster

6

and less labor intensive compared with traditional census approaches. However,

automatic classification of remotely sensed data for extracting urban settlement

is a difficult task to achieve at high levels of accuracy. This is due to diverse

range of land cover type associated with the urban environment (Zha et al.,

2003). The majority of the automated image classification (including MLC), in

some extent, requires training samples to run the algorithms. The size, location,

and representativeness of training samples also play a pivotal role on the

reliability of the output of these classification methods. As a result, conventional

supervised classification method is fairly time consuming and labor intensive

(Zha et al., 2003)

Langford et al. (1991) used land use classification to estimate the

population of northern Leicestershire based on supervised classification. They

used regression analysis that takes the number of pixels in each land use

category as explanatory variables. Likewise, Lo (1995) also used a regression

model to estimate population and number of dwelling units based on reflectance

and pixel counts as explanatory variables. Qiu et al. (2003) tested the regression

analysis approach using geographic information system (GIS) derived

transportation networks (roads network) to perform population estimates.

Similarly, dasymetric model is another of the renowned technique that uses

ancillary information from satellite imagery to perform population estimates.

Harvey (2000) adopted dasymetric model in his study and argued that the

method significantly improved the efficiency of the land use classification in

7

determining the residential population estimates. Later, his method was

supported by Wu et al. (2005) who argued that the dasymetric method does

indeed produce more accurate estimation with remotely sensed ancillary

information compared to those without the information. In addition, more remote

sensing attributes, such as texture and temperature, were included as ancillary

information in remote sensing population research.

Wu and Murray (2005) and Lu et al. (2006) explored the possibility of

using impervious surface (any surface where water cannot infiltrate is termed as

impervious surface) as a remote sensing attribute for population estimation.

Impervious surfaces are important ancillary information as they are associated

with roads, buildings, and other built-up areas that are relatively stable. Lu et al.

(2006) also used regression analysis model to estimate population. Their

approach produced an overall population estimation error of -0.97% for the study

area.

Scientists are finding ways to extract residential features more quickly and

precisely in order to develop a base for understanding complexity of urban

ecosystems. To overcome the existing shortcomings of the available

approaches, Ridd (1995) proposed a pixel based classification method that

utilizes spectral properties of green vegetation, impervious surface material, and

surface soil to delineate urban pixels as these attributes are major component

urban ecosystem. Ridd (1995) argues that the developed vegetation-impervious

surface-soil (V-I-S) model produced using spectral properties of vegetation,

8

impervious and soil attributes can discriminate urban built-up with high accuracy.

Qiao et al. (2009) also developed a pixel based “unified conceptual model” for

discriminating urban area more precisely. This model was based on Ridd’s V-I-S

model that uses spectral information such as spectral indices and texture of the

remotely sensed data to perform image classification. Qiao et al. (2009) used

hierarchical classification method that defines the specific rules for classifying

land use classes based on spectral properties of the features.

Regression modeling techniques on remotely sensed data have been

widely used for population estimation. Wu et al. (2005) argues that because of its

unbiased model accuracy test through statistical significance, regression analysis

is widely used methods in remote sensing literature of population estimation.

In order to simplify the process of outlining different land cover

classifications using automated image analysis, researchers have developed

techniques such as using various indices derived from remotely sensed data.

Normalized difference vegetation index (NDVI) is one of the commonly used

indices for delineating vegetation. In addition, normalized difference water index

(NDWI) is used for mapping open water bodies from remotely sensed data.

Similarly, normalized difference built-up index (NDBI) and normalized difference

blue band built-up index (NDBBBI) are two other indices developed for mapping

urban settlements using satellite image data (Baraldi et al., (2006); Zha et al.

(2003)).

9

Research Objectives

The objectives of the research are (1) To develop automated knowledge-

based classification models for extracting residential areas from Landsat

Thematic Mapper (TM) imagery; and (2) to develop linear regression and

geographically weighted regression (GWR) models using classified images and

census data to estimate population for Denton County and Collin County, Texas,

United States.

10

CHAPTER 2

STUDY AREA AND DATA

Study Area

Denton County and Collin County in north Texas were selected as the

study area for this research. According to U.S. Census Bureau, both counties

were listed among the top 25 counties that had the largest numeric population

influx within a year (July 1, 2006 to July 1, 2007). According to Census 2000,

Denton County had 189 block groups, and Collin County had 282 block groups.

Figure 1. Study area.

11

Table 1

Population Estimates from July 1, 2006 to July 1, 2007

Rank Geographic Area Population Estimates Change, 2006 to

2007

County State July 1, 2007

July 1, 2006 Number Percent

1 Maricopa County Arizona 3,880,181 3,778,598 101,583 2.7 2 Riverside County California 2,073,571 2,007,206 66,365 3.3 3 Harris County Texas 3,935,855 3,876,306 59,549 1.5 4 Clark County Nevada 1,836,333 1,777,168 59,165 3.3 5 Tarrant County Texas 1,717,435 1,668,541 48,894 2.9 6 Bexar County Texas 1,594,493 1,555,192 39,301 2.5

7 Wake County North Carolina 832,970 794,129 38,841 4.9

8 Collin County Texas 730,690 696,383 34,307 4.9 9 Travis County Texas 974,365 941,577 32,788 3.5

10 Mecklenburg County

North Carolina 867,067 835,328 31,739 3.8

11 Pinal County Arizona 299,246 268,316 30,930 11.5 12 Orleans Parish Louisiana 239,124 210,198 28,926 13.8 13 Dallas County Texas 2,366,511 2,337,956 28,555 1.2

14 Santa Clara County California 1,748,976 1,720,839 28,137 1.6

15 Fulton County Georgia 992,137 964,649 27,488 2.8 16 Gwinnett County Georgia 776,380 749,836 26,544 3.5

17 San Diego County California 2,974,859 2,948,362 26,497 0.9

18 Denton County Texas 612,357 586,582 25,775 4.4 19 King County Washington 1,859,284 1,834,194 25,090 1.4 20 Fort Bend County Texas 509,822 485,482 24,340 5.0

21 Williamson County Texas 373,363 350,879 22,484 6.4

22 Hidalgo County Texas 710,514 689,494 21,020 3.0 23 Lee County Florida 590,564 570,089 20,475 3.6

24 San Bernardino County California 2,007,800 1,987,505 20,295 1.0

25 Montgomery County Texas 412,638 393,233 19,405 4.9

Source: Population Division, U.S. Census Bureau

Release Date: March 20 2008

12

Datasets

The datasets acquired for this project are:

1. Landsat TM images: Three TM images acquired in 2000 (2000/03/04 and

2000/03/12) were downloaded from the United States Geological Survey

(USGS) for this particular research purpose. They were used for retrieving

ancillary information (residential area) of the study area.

2. Impervious surface image: The impervious surface image (2001) used in

this research is downloaded from the United States Geological Survey

(USGS).

3. Census shapefiles: Census shapefile created by US Census Bureau for

2000 was used to build regression models to estimate population.

4. Census population data: US Census Bureau population data (2000) was

used to build regression models and test population estimation accuracy.

5. Aerial image: Aerial image of Denton County acquired in 2000 was used

as reference for image classification accuracy assessment.

Software packages used in this research include ERDAS IMAGINE 9.3 for image

processing and ArcGIS 9.3 for GIS data analysis.

13

CHAPTER 3

METHODOLOGY

Image Calibration

Chander and Markham (2003) developed methods and parameters to

overcome radiometric calibration error generated by the degraded sensor’s

internal calibrator due to long term use. According to Chander and Markham

(2003), the calibration process helped to improve the attributes of remotely

sensed data such as spectral radiance, reflectance, and temperature estimates,

providing better base for comparing images acquired in different dates and/or by

different sensors. The methods and procedures suggested by Chander and

Markham (2003) for post calibration of image are:

a. Conversion from digital number (DN) to radiance:

Where,

Lλ = spectral radiance at sensor’s aperture in W/(m2*sr*μm);

Qcal = quantized calibrated pixel value in DNs;

Qcalmin= minimum quantized calibrated pixel value (DN = 0)

corresponding to LMINλ .

Qcalmax= maximum quantized calibrated pixel value (DN = 255)

corresponding to LMAXλ ;

14

LMAXλ = spectral radiance that is scaled to Qcalmax in ;

LMINλ = spectral radiance that is scaled to Qcalmin in ;

b. Radiance to reflectance

Where,

ρP = unitless planetary reflectance;

Lλ = spectral radiance at sensor’s aperture;

d= earth-sun distance in astronomical units;

ESUNλ = mean solar exatmospheric units

θs = solar zenith angle in degree.

Impervious Dataset

Surfaces or features that prevent water from infiltrating into the soil are

defined as impervious surfaces. These are the major component of urban

infrastructure, thus considered as an important indicator of urban settlement in

remotely sensed dataset (C. Wu & Murray, 2005). Impervious images represent

the percentage of impervious surface in a pixel. They are produced using

remotely sensed data such as ETM+ and Terra ASTER, and tend to preserve

more spectral information that can be useful for urban land use classification (Ji

& Jensen, 1999; Li & Weng, 2005). The impervious image used in this research

was prepared by USGS using spectral information from ETM+ dataset.

15

Calculation of Indices

Normalized difference vegetation index (NDVI):

NDVI is widely used for predicting vegetation characteristics from remote

sensing image. Vegetation has low reflectance on red (R) band and has a high

reflectance on near infrared (NIR) band on reflectance curve. These different

bands obtained from vegetations were used by NDVI to detect vegetations.

Where, NIR = reflectance value of near-infrared band

R = reflectance value of red band

Modified normalized difference water index (MNDVI):

This index is generally used in identifying individual water bodies from

satellite imagery. MNDWI uses the relation of green (G) band and mid-infrared

(MIR) band to delineate water pixels from other spectral pixels because water

has a higher and lower reflectance in green (G) band and MIR band respectively.

Where, MIR = reflectance value of mid-infrared band

G = reflectance value of green band

16

Normalized difference built-up index (NDBI):

This index was proposed by Zha et al. (2003) for mapping the urban built-

up instead of using NDVI and MNDWI indices. They concluded that urban

settlement has higher reflectance in mid-infrared; and the use of mid infrared and

near-infrared to define the index is more appropriate.


NITR = reflectance value of near-infrared band

Normalized difference blue band built-up index (NDBBBI):

This index was discussed by Baraldi et al. (2006) as a suitable index for

detecting urban pixels. This expression exploits a relation of mid-infrared and

blue (B) band to delineate urban pixel from remotely sensed data.


B = reflectance value of blue band

Wetness index (WI):

Spectral properties of soil depend on various soil attributes such as soil

type, texture, moisture, and organic matter content. On the other hand satellite

imagery obtained from remotely sensed imagery can produce varying spectral

behavior depending on soil property and classification. Hence the use of satellite

17

imagery is a challenge in determining wetness index of soil. However, moist soil,

in general reflects similar spectral property which is useful for delineating it from

other classes such as vegetation, residential, and commercial. In general, wet

soil exhibits low reflectance values in all TM band and the soil wetness index

(WI) information obtained from tasseled cap indices (explained by Crist et

al.(1986)) can assist in delineating it from other land cover types such as

vegetation, industrial and residential classes from remotely sensed data (Crist et

al., 1986).

The wetness index (WI) can be defined as (Todd, Hoffer, & Milchunas, 1998):

Knowledge-Based Classification

This classification method is also termed as expert classification or rule-

based classification. This classification method integrates and processes

information available in multiple knowledge layers (e.g. spectral, temporal) from

remotely sensed data to produce a single classified image. In terms of land use

classification, by using knowledge-based classifier, a user can specify or design

the required attributes or class based on user’s knowledge, land use

characteristics, as well as classification rules based on multi-spectral and multi-

temporal remotely sensed data. This classification approach follows a

hierarchical expert decision tree method to classify defined variables.

18

Different knowledge based layers such as impervious, NDVI, MNDWI,

NDBI, and NDBBI were used within hierarchical classification tree to produce

classified image required for this research. All process performed using the

“knowledge engineer” tool available in ERDAS IMAGINE 9.3 (eardas, 2008)

Figure 2. Flowchart of methodology.

19

Figure 3. Flowchart of knowledge-based classification model.

20

Development of the Knowledge-Based Classification Model for Denton County

The images produced by the indices (NDVI, MNDWI, NDBI, NDBBI and

WI) of TM image showed that specific land-use (features) type has notably

different index values as compared to other land-use types. For example, NDVI

image showed notable high index value for vegetation areas; water is

represented by highest index value on the MNDWI image; and NDBI image

showed higher index values in residential built-up areas as compared to

vegetated area. These results helped define threshold values in the hierarchical

knowledge-based classification model from TM image for residential pixel

extraction.

(a) (b)

21

(c) (d)

(e) (f)

Figure 4. Indices data image from generated from TM image.

(a) Landsat Image; (b) NDVI Index; (c) MNDWI Index; (d) NDBI Index; (e) Wetness Index; and (f) NDBBBI Index

22

(a) Vegetation vs. residential

(b) Water vs. residential

(c) Commercial/Industrial/Transportation (CIT) vs. residential

Figure 5. Classification indices value of various land use type vs. residential land

use type.

0.00

0.20

0.40

0.60

0.80

1.00

1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930

Vegetation Residential

Samples

(ND

VI)

-0.40-0.200.000.200.400.600.801.001.20

1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930

Water Residential

Samples

MN

DW

I

Samples

MN

DW

I

0.00

0.10

0.20

0.30

0.40

0.50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

CIT Residential

ND

VI

Samples

23

(a) Soils vs. residential

(b) Soils/construction sites vs. residential

(c) Wetland vs. residential

Figure 6. Classification indices value of various land use type vs. residential land

use type.

-0.15-0.10-0.050.000.050.100.150.200.25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Soils Residential

Samples

ND

BI

-0.60

-0.50

-0.40

-0.30

-0.20

-0.10

0.00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Soils/Const Residential

Samples

ND

BBBI

0.00

0.05

0.10

0.15

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Wet_Soil Residential

Samples

Wet

ness

Inde

x (W

I)

24

Figure 7. Spectral response of band 4 and band 7 in residential build-up area.

The graphs in Figure 5, 6, and 7 clearly show that the usefulness of

various spectral indices in discriminating residential land use from other land use

types. For example; normalized difference vegetation index (NDVI) graphs show

that vegetation, commercial and industrial pixels can be distinguished from urban

pixels (figures 5a, 5c). Similarly, the spectral indices like normalized difference

built-up index (NDBI) and normalized difference blue band built-up index

(NDBBBI) illustrate that they can be used for delineating urban built-up from its

surrounding features such as bare soil and construction materials/sites (figures

6a, 6b). Figure 7 shows that the spectral response band 4 (NIR) and band 7

(TM7) in residential land use type can be used in defining additional rules for the

knowledge-based model for extracting residential built-up pixels.

0

0.05

0.1

0.15

0.2

0.25

0.3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Band 4 Band 7

Samples

Refle

ctan

ce

25

Figure 8. Knowledge-based classification model developed for land use

classification using Landsat 7 ETM+ image classification rules and conditions.

Accuracy Assessment of Classified Images

Accuracy assessment for remote sensing image classification is the

process involved in understanding the quality of the image produced by

discovering and evaluating classification error. Therefore, accuracy assessment

of the image is very important as it defines the reliability of the information

provided by the image which, in turn, can be used for various decision making

processes (Congalton & Green, 1999).

Since the study concentrates on classifying urban settlements, the

classified image produced from knowledge based model is re-classified into two

26

groups; “residential” and “non-residential”. In the next step, “Accuracy

assessment”, the tool readily available in ERDAS IMAGINE 9.3 is used to test

accuracy of re-classified images. Random samples plots were allocated around

the study area (Denton County) and a 3×3 pixel cluster was selected as a

minimal area for defining pixel classification. Furthermore, aerial image of Denton

County acquired in 2000 was used as reference to define classification of

allocated sample pixels manually.

Input Data for Regression Models

Census block group level is used to generate the regression model for

producing population estimates. In this step, the total number of residential pixels

that lie within individual census block groups is calculated using ArcGIS. This tool

calculates statistics (e.g. sum, mean, median, standard deviation etc) on values

of a raster within the zones of another dataset (raster or vector) and reports the

results to table (ArcGIS, ESRI).

Regression Modeling

Regression is a statistical technique used for investigating relationships

between the given (i.e. depended) variable and one or more other (i.e.

independent) variables. Previous work in the area of population estimation has

used regression methods to determine correlations between spectral reflectance

value pixel and population density. For example, Lusaka and Hegedus (1982)

used spectral reflectance of bands 4, 5 and 7 as input to develop a regression

model to estimate population distribution in Tokyo, Japan. Similarly, Harvey

27

(2002a) used different spectral characteristic such as indices, band sum, and

band difference as variables for regression analysis to estimate population.

The two different regression analysis methods, linear regression and

geographically weighted regression (GWR), are used in this study.

Linear regression model:

Linear regression model can be defined as follows:

Where,

Pe= population estimates

a = regression intercept

b = slope

x = area of residential pixels in a block group

Geographically Weighted Regression Model

In spatial datasets, the relationships between the dependent and

independent variables are different across geographic space; i.e. the same

attribute can have a different effect on the model in different parts of the study

region (Fotheringham, Brunsdon, & Charlton, 2002). However, global regression

models such as linear regression examines the relationship between the

dependent and independent variables without explicitly considering the variations

that may occur due to their spatial context. Therefore, there is a need for

developing modeling techniques that define the relationship between variables

28

locally, i.e. the results produced from such models should be location dependent

(Charlton, Fotheringham, & Brunsdon, 02/02/2009).

Geographically weighted regression (GWR) is a local spatial statistical

technique that defines and analyzes the relationship between various attributes

that vary across geographic space (Fotheringham et al., 2002). Unlike traditional

global model, the GWR model allows the explanatory variable to vary in terms of

location, thus providing detail information on understanding and analyzing

geographic data. A GWR model also takes into account the spatial weighting

function which allows to define relationship among neighborhood according to

spatial variation throughout the study area (Fotheringham et al., 2002), and due

to its ability to incorporate spatial attribute for research, GWR analysis technique

is widely used in many studies such as geography, remote sensing, and

environmental science (Mennis, 2006). A GWR model can be expressed as (Lo,

2008):

∑=

++=n

kiikikii exaaY

10 (i = 1, 2, …, n) (9)

where Yi and xik are the dependent and independent variables at i, k = 1,

2, …, n, ei are normally distributed error terms (with zero mean and constant

variance at point i), and aik is the value of the k-th parameter at location i.

29

Accuracy Assessment of Population Estimation

Accuracy assessment is an important procedure to test the developed

regression model in population estimation research. The three error measures for

accuracy assessment as suggested by (Lu et al., 2006) are:

Relative Error (RE):

Relative error compares the result produced by the developed model with

the census measurement to test the goodness of the model.

Where,

Pe= estimated population calculated from the regression model

Pr= reference population (for this research: block group population)

Mean relative error (MRE):

In the same way, MRE can be used test the overall performance of the

model over the study area.

Where,

RE = relative error; n= total number of census block group used in study area

Median relative error (MdRE):

This measure is used to reduce the effect of extreme values to the overall

result.

30

CHAPTER 4

RESULTS AND DISCUSSION

Results from Maximum Likelihood Classification (MLC)

The Landsat Thematic Mapper (TM) image of Denton County was

classified into several land use classes such as vegetation, water, commercial-

industrial- transportation area (CIT), soils, and residential classes based on

training samples defined by using aerial Image of Denton County; furthermore,

the classified image was reclassified into two major classes: residential and non-

residential for extracting residential built-up.

(a) (b)

Figure 9. Classified image of Landsat TM produced from MLC classification.

(a) TM classified Image (MLC); (b) Reclassified Image from a. Black pixels are classified residential areas.

31

Table 2

Error Matrix for Maximum Likelihood Classification (MLC)

True Data Total User Accuracy

(%) Non-

residential Residential

TM

Image

Non-residential 229 5 234 97.86 Residential 9 19 28 67.85

Total 238 24 262 Producer

Accuracy (%) 96.22 79.16 Over All

94.65

Accuracy assessment was performed using visual interpretation method

by taking high resolution aerial image as a reference data. Random samples

were selected throughout Denton County; despite the overall accuracy of near

95% was achieved for overall classification, the produced classified image was

only able to achieve 67.85% and 79.16% user and producer’s accuracy

respectively in extracting residential area.

Results from Knowledge-Based Classification of

Landsat TM and Impervious Surface Data

Knowledge-based classification model was developed using rules based

on spectral indices and impervious surface layer characteristics. The TM image

is classified by using developed knowledge-based model, further; it is reclassified

into two major classes; residential and non-residential.

32

(a) (b)

Figure 10. Classified image of Landsat TM after processing using impervious

surface data.

(a) Classified image using Landsat TM and impervious surface layer.

(b) Reclassified residential and non-residential areas based on results in a.

Table 3

Error Matrix for Impervious Surface Data


(%) Non-

residential Residential

TM

Image

Non-residential 354 5 359 98.60 Residential 6 31 37 83.78


Accuracy (%) 98.33 86.11 Over All

97.22

Table 3 summarizes the accuracy assessment result for classified image

produced from knowledge-based model. This classification produce an overall all

33

accuracy of over 97% with improved user and producer accuracy for the

residential built-up as compared to MLC classification result.

Results from Knowledge-Based Classification Using Landsat TM Data Alone

In the same way, knowledge-based model was developed using Landsat

TM spectral properties alone. The image classification was broken into number of

land use classes such as vegetation, water, CIT, soils, and residential classes

based on spectral property described by characteristics of spectral indices. The

image is then reclassified into two major classes: residential and non-residential,

and subjected to accuracy test.

(a) (b)

Figure 11. Results from knowledge-based classification using Landsat TM data alone.

(a) TM Classified image using spectral attributes.

(b) TM reclassified image using spectral attribute.

34

Table 4

Error Matrix for Spectral Response Alone


(%) Non-

residential Residenti

al

TM Image

Non-residential 675 14 689 97.97 Residential 22 133 155 85.81%


Accuracy (%) 96.84% 90.48% Over All

95.73%

Table 4 summarizes the accuracy assessment result produced from the

classified. The over-all accuracy of over 95% is achieved for the produced

classified image. The error matrix on table 4 also showed the improvement on

user and producer’s accuracy on delineating residential pixels.

In comparison to the MLC and knowledge base model using impervious

layer, the knowledge-based classification model produced using only TM spectral

attributes was the most effective method for delineating residential areas. In

addition to its effectiveness in delineating residential areas, the knowledge-based

classification model does not need impervious surface data for a study area,

thereby facilitating the use of the model in other study areas.

Regression Models

Based on the images classification results described in the previous

section, the classified images produced from the knowledge based model using

Landsat data alone were used as base images for regression modeling.

Random samples from census block-group dataset are selected to generate

35

linear regression models. The developed models are then applied to the entire

study area to make population estimates of the region. Relative errors at block-

group level are calculated for model accuracies in estimating population.

Linear Regression Models

Denton County (03/04/2000 Image)

Table 5 summarizes 12 linear regression models produced by various

census block-group samples using block-group population as a dependent

variable and block-group residential pixel area as explanatory variables. In

addition, a new regression model is derived from incorporating multiple models

generated from different group of samples. A total of 42 random block-groups

were selected for generating each linear regression model. The high R2 value for

sample block-group suggested that the size residential area is highly correlated

with population of the region at block-group level.

36

Table 5

Summary of Linear Regression Model Results -Denton County

Linear Regression (Denton County) Samples R2 Total Error

(%) Mean Error

(%) Median

Error (%) Model

1 0.7088 6.0 41.40 27.50 y = 0.002804x + 595.18 2 0.7077 -3.68 37.61 26.32 y = 0.002385x + 652.90 3 0.7203 1.38 38.35 24.84 y = 0.002750x + 531.41 4 0.6312 3.7 43.92 31.50 y = 0.002347x + 847.87 5 0.6707 -1.88 35.91 22.92 y = 0.002806x + 419.96 6 0.8093 -1.54 33.38 22.46 y = 0.002896x + 368.90 7 0.6707 0.11 38.84 25.14 y = 0.002568x + 620.61 8 0.7671 2.60 37.97 24.85 y = 0.002952x + 427.86 9 0.6999 4.17 38.18 25.64 y = 0.00311x + 360.29

10 0.8075 -0.66 37.86 24.32 y = 0.002620x + 568.96 11 0.7918 -1.86 38.55 26.30 y = 0.002430x + 665.10 12 0.7451 -5.35 34.63 23.27 y = 0.002676x + 425.01

General Linear Equation Model -1.858 Y=0.002695X+492.92

Selected regression models are shown in Figures 12 and 13. Scatter diagrams

are produced for analyzing and understanding the relationship between the

relative population estimation errors and population density at each census

block-group of the Denton County (Figures 14 and 15).

37

(a)

(b)

(c)

Figure 12. Linear regression models derived from sampling Denton County block-group level.

y = 0.002896x + 368.9R² = 0.8094

0

2000

4000

6000

8000

0 500000 1000000 1500000 2000000 2500000

Popu

lati

on

Area (sq. m)

y = 0.002568x + 620.62R² = 0.6631

02000400060008000

10000

0 500000 1000000 1500000 2000000 2500000 3000000

Popu

lati

on

Area(sq. m)

y = 0.002952x + 427.87R² = 0.7672

02000400060008000

1000012000

0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000

Popu

lati

on

Area(sq. m)

38

(a)

(b)

(c)

Figure 13. Linear regression models derived from sampling Denton County block-group level.

y = 0.00311x + 360.3R² = 0.6999

01000200030004000500060007000

0 500000 1000000 1500000 2000000 2500000

Popu

lati

on

Area(sq. m)

y = 0.002620x + 568.96R² = 0.8075

02000400060008000

1000012000

0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000

Popu

lati

on

Area(sq. m)

y = 0.002430x + 665.1R² = 0.7919

0

2000

4000

6000

8000

10000

12000

0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000

Popu

lati

on

Area(sq. m)

39

(a)

(b)

(c)

Figure 14. Scatter diagrams of relative population estimation error vs. population density at the census block-group level for Denton County.

-500

-250

0

250

0 3000 6000 9000 12000 15000

Rela

tive

Err

or(%

)

Population Density (Persons per Sq. Km)

-250

0

250

500

0 3000 6000 9000 12000 15000

Rela

tive

Err

or(%

)

Population Density (Person per Sq. Km)

-250

0

250

500

0 3000 6000 9000 12000 15000

Rela

tive

Err

or(%

)


40

(a)

(b)

(c)

Figure 15. Scatter diagrams of relative population estimation error vs. population density at the census block-group level for Denton County.

Denton County (03/12/2000 image)

Classified image generated from 03/12/2000 Landsat image is developed

-250

0

250

500

0 3000 6000 9000 12000 15000

Rela

tive

Err

or(%

)


-250

0

250

500

0 3000 6000 9000 12000 15000

Rela

tive

Err

or(%

)


-250

0

250

500

0 3000 6000 9000 12000 15000

Rela

tive

Err

or

(%)


41

from produced knowledge-based classification model. In the next step, the

produced image was used estimate population using the general regression

model produced for Denton County in table 5.The total population estimates and

total error produced by the linear regression are summarized in table 6. Figure 16

summarizes the relationship between relative population estimates error and the

block-group density for general regression model produced in table 5.

Table 6

Summary of Population Estimates Produced by General Regression Model for

Denton County

General Linear equation(Denton County): Y=0.002695X+492.92

Image Total population Est. population Total Error (%)

03/04/2000 432976 424929 -1.86

03/12/2000 432976 458115 5.80

Figure 16. Scatter diagrams of relative population estimation error vs. population density at the census block-group level generated from general regression for Denton County (03/04/2000)

-250

0

250

500

0 3000 6000 9000 12000 15000Rela

tive

Err

or(%

)


42

Collin County (03/04/2000)

In the same way, table 7 summarizes 12 Linear Regression models

produced by selecting random census block-group samples of Collin County.

Similar to previous methods, block-group population was selected as a

dependent variable and block-group residential pixel area as explanatory

variables. A new regression model is derived from incorporating multiple models

generated from different sample groups.

Table 7

Summary of Linear Regression Model Results -Collin County

Linear Regression (Collin County) Samples R2 Total

Error (%) Mean Error

(%) Median Error

(%) Model

1 0.8176 2.01 239.82 21.44 y = 0.002502x + 258.17 2 0.7986 -7.88 283.52 20.85 y = 0.001704x + 570.56 3 0.8137 -6.00 282.86 20.76 y = 0.001799x + 545.99 4 0.6893 4.54 243.94 20.76 y = 0.002581x + 254.61 5 0.7123 1.75 233.13 21.27 y = 0.002548x + 225.86 6 0.8180 -4.09 279.74 20.47 y = 0.001899x + 518.35 7 0.7918 -1.86 299.27 26.30 y = 0.002430x + 665.10 8 0.8504 3.12 314.91 22.18 y = 0.001934x + 622.57 9 0.6472 7.01 231.02 21.91 y = 0.002804x + 161.99

10 0.8034 3.95 309.42 21.50 y = 0.002016x + 587.64 11 0.7918 0.24 229.09 21.82 y = 0.002515x + 219.22 12 0.7595 -2.27 275.24 20.81 y = 0.002015x + 479.09

General Linear Equation Model 2.10 Y=0.00223X+425.75

A total of 60 random block-groups are selected as a sample size for

generating each linear regression model presented in table 7. Figure 16

represent a graph generated by selected regression model from table 7. Due to

43

some anomalies present in relative population estimation error, the scatter

diagram produced was condensed and became less representative because of

those extreme values. For the purpose of better graphical representation, the

extreme values were removed from the scatter plot (Figure 18), which only

account for 1% of the data.

44

(a)

(b)

(c)

Figure 17. Linear regression models derived from sampling Collin County block-group level.

y = 0.002016x + 587.65R² = 0.8034

02000400060008000

100001200014000

0 1000000 2000000 3000000 4000000 5000000 6000000 7000000

Popu

lati

on

Area (Sq. m)

y = 0.002515x + 219.23R² = 0.7451

02000400060008000

10000

0 500000 1000000 1500000 2000000 2500000 3000000 3500000

popu

lati

on

Area (Sq. m)

y = 0.002015x + 479.09R² = 0.7596

0

2000

4000

6000

8000

10000

0 500000 1000000 1500000 2000000 2500000 3000000 3500000

popu

lati

on

Area (Sq. m)

45

(a)

(b)

(c)

Figure 18. Scatter diagrams of relative population estimation error vs. population density at the census block-group level for Collin County.

-500

0

500

1000

0 1000 2000 3000 4000 5000 6000 7000 8000

Rela

tive

Err

or

Population Density (Person/Sq. Km)

-500

0

500

1000

0 1000 2000 3000 4000 5000 6000 7000 8000

Rela

tive

Err

or

Pop. Density (Person/Sq. Km)

-500

0

500

1000

0 1000 2000 3000 4000 5000 6000 7000 8000Rela

tive

Err

or (%

)

Pop. Density (Person/Per sq. m)

46

Collin County (03/12/2000 image)

Again, a developed knowledge based classification model is used to

generate classified image of Collin County dated 03/12/2000. The produced

image was used to produce population estimates using the general regression

model produced in Table 6. A summary of the results are presented in Table 8.

Table 8

Summary of Population Estimates Produced by General Regression Model for

Collin County

General linear equation(Collin County): Y= 0.002229X + 425.75


03/04/2000 491675 502015 2.10

03/12/2000 491675 476297 -3.12

Figure 19. Scatter diagrams of relative population estimation error vs. population density at the census block-group level generated by general linear regression Collin County (03/04/2000).

-500

0

500

1000

0 1000 2000 3000 4000 5000 6000 7000 8000

Rela

tive

Err

or

Pop. Density (Person/Sq. Km)

47

Denton & Collin County (03/04/2000 and 03/12/2000 image)

Similarly, a combination of classified images of Denton and Collin County

is used to generate general regression model for producing population estimates

for both counties. Table 9 summarizes the results of the regression model

produced from different samples groups selected from both counties.

Table 9

Summary of Linear Regression Model Results –Denton and Collin County

Combine

Linear regression (Denton and Collin County Combine) Samples R2 Total Error

(%) Mean Error

(%) Median

Error (%) Model

1 0.7520 8.90 206.87 20.81 y = 0.002653X + 479.54 2 0.6701 10.07 212.25 27.99 y = 0.002646X + 506.57 3 0.7744 -0.80 206.54 24.94 y = 0.002165X + 594.11 4 0.7457 4.26 224.74 27.29 y = 0.002195X + 674.27 5 0.7691 2.72 212.80 25.84 y = 0.002264X + 601.30 6 0.5611 1.99 191.58 23.54 y = 0.002491X + 444.36 7 0.5454 3.29 185.56 23.28 y = 0.002635X + 380.37 8 0.5688 -3.98 217.99 24.79 y = 0.001868X + 717.12 9 0.7251 3.21 239.13 28.54 y = 0.001970X + 794.74

10 0.7083 1.36 238.71 28.27 y = 0.001883X + 812.90 General linear equation model

3.10 Y=0.002277X+600.52

A total of 100 total random block-groups are selected from Denton and

Collin County for generating each linear regression model. In the next step, the

general regression model produced in table 9 is used to perform population

estimates of Denton, Collin, and combine population estimates for both counties

as well. Table 10 summarizes the results produced from the analysis.

48

Table 10

Summary of Population Estimates Produced by General Regression Model

(Denton & Collin County Combined)

General linear equation(Denton & Collin County Combine): Y=0.002277X+600.52


03/04/2000 Denton 432976 393807 -9.05

03/04/2000 Collin 491675 559539 13.80

03/04/2000 Combined 924651 953351 3.10

03/12/2000 Denton 432976 421846 -2.57

03/12/2000 Collin 491675 533253 8.45

03/12/2000 Combined 924651 965565 4.42

Geographically Weighted Regression

Table 11 shows the errors calculated from the population estimates when

geographically weighted regression (GWR) model is employed to the entire study

area. GWR model was developed using 03/04/2000 classified data for Denton,

Collin, and combine County separately. The GWR model based on 03/04/2000

image was used to perform population estimation from 03/12/2000 classified

image. In this analysis, adaptive kernels are used, and the bandwidth is

determined using cross validation (CV). Figure 18 illustrates the scatter diagrams

obtained from GWR model and defines the relationships between relative

population estimation error obtained from GWR models and population density.

49

Extreme values were removed from the scatter plot (figures 19, 20), which only

account for one percent of the data for better graphical representation for the

model.

Table 11

Summary of Geographically Weighted Regression Model Results

Geographically weighted regression Study area R2 Total Error

(%) Mean Error Median

Error Denton (03/04) Local

Models 0.04 38.23 25.00

Denton (03/12) Local Models

7.1 44.37 29.84

Collin (03/04) Local Models

0.61 176.11 18.10

Collin (03/12) Local Models

-5.53 133.16 20.17

Combine (03/04) Local Models

-0.46 133.10 21.18

Combine (03/12) Local Models

-0.77 111.47 23.45

Denton (Combine-03/04)

Local Models

-2.0 37.01 26.24

Collin (combine -03/04)

Local Models

1.16 197.49 19.32

Denton (Combine-03/12)

Local Models

3.61 42.05 26.60

Collin (Combine -03/12)

Local Models

-4.62 157.99 20.91

50

(a) GWR County Denton(03/04)

(b) GWR Denton County (03/12)

Figure 20 Scatter diagram of relative population estimation error vs. population density for GWR model (Denton County)

(a) GWR Collin County (03/04)

(b) GWR Collin County (03/12)

Figure 21. Scatter diagram of relative population estimation error vs. population density for GWR model (Denton and Collin County).

-250

0

250

500

0 2000 4000 6000 8000 10000 12000 14000Rela

tive

Err

or

(%)

Population Density(Person per sq. km)

-2500

250500

0 2000 4000 6000 8000 10000 12000 14000Rela

tive

Er

ror (

%)

Population Density (Person per sq. km)

-500

0

500

1000

0 1000 2000 3000 4000 5000 6000 7000 8000Rela

tive

Err

or

(%)

Pop. Density (Person per sq. km)

-500

0

500

1000

0 1000 2000 3000 4000 5000 6000 7000 8000

Rela

tive

Err

or (%

)


51

(a) GWR Denton and Collin County (03/04)

(b) GWR Denton and Collin County (03/04)

Figure 22. Scatter Diagram of relative population estimation error Vs. Population Density for GWR Model (Denton and Collin County).

Discussion

A few observations and limitations faced while producing the above results are

discussed below.

1. Knowledge-based vs. MLC:

The MLC and knowledge-based model produced the higher overall

accuracy for their respective image. However, the closer examination of

the error matrix (tables 2, 3, 4) showed the knowledge-based model is

able to discriminate residential pixel more accurately as compared to

-500

0

500

1000

0 2000 4000 6000 8000 10000 12000 14000Rela

tive

Err

or (%

)


-500

0

500

1000

0 2000 4000 6000 8000 10000 12000 14000Rela

tive

Err

or (%

)


52

MLC. In addition, MLC requires training samples for image classification

whereas knowledge-based classification approach classifies image

without using training samples; as the result, knowledge-based models

facilitate the image classification compared with MLC.

Impervious surface knowledge-based model:

Impervious surface is an important attribute which is closely

associated with urban ecosystems. This attribute can be very useful for

developing the knowledge-based model. However, the impervious surface

layer was not readily available for the temporal period of study (the

impervious layer available through USGS for the research is from 2001).

On the other hand, knowledge-based model developed only by using TM

spectral attribute was able to produce similar accuracy results. The error

matrix (tables 3, 4) showed that the knowledge-based model derived from

spectral indices is able to discriminate residential pixels more precisely as

compared to the knowledge-based model derived from original image

bands and impervious data.

2. Spectral knowledge-based model:

The model discriminated residential land use type with high

accuracy from Landsat TM image. However, the model struggled to

extract residential pixels from the sparsely populated neighborhood with

high accuracy. Since the classification model is exclusively based on

spectral response of TM bands, in thinly populated region, the spectral

53

response of the surrounding feature dominates the residential pixel’s

spectral response and causes errors in classification (figure 23). Another

principal factor affecting the classification accuracy in sparsely populated

areas is that the medium spatial resolution (30m × 30m) of the Landsat

TM which made it difficult to extract residential areas in low population

density areas.

(a) Aerial image (b) TM image (c) Classified TM image

Figure 23. Sparsely populated region on aerial, Landsat TM and classified image.

3. The spectral response of TM band 4 (NIR) is affected by the moister

content of the surrounding, and may affect the threshold value defined for

the indices used in the models such as NDVI and NDBI. Likewise, Zha et

al. (2003) also argue that the consistency of the NDBI for extracting

residential built-up might be indirectly affected by the presence of other

land use types that exhibit seasonal spectral response to TM bands, such

as forests and soils. However, this setback may be overcome with the

selection of the remotely sensed data captured during the time when the

spectral discrimination between the surrounding features and residential

built-up is higher.

54

4. The developed classification model only uses TM spectral properties to

delineate land use type from each other. One of the limitations of this

approach is: it is difficult to distinguish residential built-up from lake-shore

or sandy beaches (where sand and silt concentration is high) because

those features exhibit similar spectral response on all TM bands (Figure

24).

(a) Aerial image (b) TM image (c) Classified TM image

Figure 24. Lake shore (sandy beach) and residential built-up on aerial, TM, and classified image.

5. The majority of the old residential neighborhood of the study area is

surrounded by tree canopies. Therefore, image captured during leaf-off

season more likely to produce good classification results because

residential area can be delineated more precisely as the few features are

covered by tree canopies. In addition, the image acquired on 03/12/2000

was affected by jet trails in the sky and corrected by removing the errors

manually.

6. The high R2 values for both Linear model and GWR models suggest that

population is strongly correlated with the residential built-up.

55

7. Comparison of mean relative error (MRE) and median relative error

(MdRE) in table 7, 9, and 11 indicate that the mean is highly affected by

extreme values, especially in Collin county region.

8. Scatter diagrams produced to analyze the relationship between relative

error and population density (figures 14, 15, 16, 18, 19, 20, 21 and 22)

indicate that the population is often overestimated when population

density at census block group is less than approximately 300 persons per

square kilometer, whereas population count is always underestimated

when population density is greater than approximately 3000 persons per

square kilometer, regardless of the independent variable. It is observed

that the magnitude of error is very high in population underestimation and

it is mainly accord in high population density block-group region. Tables

6,8,10 and 11 summarized the performance of the linear and GWR model

on Denton, Collin and both counties.

9. Similarly, for census blocks with a low population density (for example,

less than 100 persons per square kilometer), relative error of population

estimation in percentage can be misleading. For example, for a census

block group with actual population of 5 and estimated population of 25, the

relative error is 400%, but the actual error of 20 persons may be

insignificant compared with the total population of the county.

10. The knowledge-based model uses spectral reflectance of the TM bands,

and can potentially be applied to extracting urban land use type in

56

developing countries where alternative sources such as census and other

demographic records (e.g. birth, death, migration records) are not readily

available or not reliable. In addition, the developed model is very flexible

as explicit/new rules can be defined as well as the threshold values may

be adjusted based on urban environment characteristics.

Since census or other sources of demographic information may not be

readily available or reliable in many developing countries, sampling

regions could be defined (which cover approximately 15 to 20 percent of

the study area) to collect population data based on random sampling

techniques. Finally, linear and GWR models can be created to perform

population estimation in the study area.

57

CHAPTER 5

CONCLUSION

The knowledge-based classification model successfully extracted

residential land-use areas from remotely sensed data by applying rules and

conditions defined by spectral response features over the study area. For each

rule threshold values were defined, and conditions were set to extract

classification information from the input images. While defining rules; several

spectral indices were combined together to overcome the limitations of one

single index and achieve higher classification accuracy than single indices.

Compared with other traditional land use classification approaches, the

knowledge-based classification approach is more efficient and can produce more

accurate results. It is completely automated once the threshold values for the

conditions are defined and it classifies remotely sensed image without using

training samples thereby making the results more consistent. Therefore, this

model can be a very useful alternative for researchers and planners to map

residential built-up for their research and discipline quickly.

Landsat TM image contain rich spectral information with medium spatial

resolution which is used to define various spectral indices such as NDVI,

MNDWI, NDBI, NDBBI, and WI. These indices are useful in delineating major

58

land-use types of urban environment such as soil, water, vegetation, and

residential area. The total classification accuracy of over 95% is better than the

accuracy produced from general classification method such as MLC, supervised

and unsupervised classification.

A major limitation of this model is the difficulty in delineating bare earth

(e.g. sandy beach) from residential built-up because of similar spectral response

on all TM bands. In fact, this is also a major challenge for Landsat TM data.

Satellite data such as light detection and ranging (LiDAR) could be useful to

overcome this challenge as it can provide building height information to delineate

sandy beach and residential land use type. Soils have wide a range of spectral

responses in TM bands; furthermore, moisture content of the soil strongly

influences the spectral response of all TM bands and thus can affect the defined

threshold values defined for the classification model. The performance of this

model in effectively extracting residential pixels from sparsely populated areas

is also limited because of spatial resolution of Landsat TM images and the

closeness of spectral response from background features.

The population estimation results show that the total accuracy of

population estimation in the study area is controlled by the sign and magnitude of

relative errors at the census block-group level. Furthermore, median absolute

relative errors calculated for the model also suggest that the GWR models

outperformed the linear regression models due to better incorporation of spatial

heterogeneity in GWR models. However, the both linear regression and GWR

59

methods underestimated the population count in census block-groups with high

population density, and overestimated the population count in census block-

groups with low population density.

The recommendation for future research is to improve the knowledge

based model by incorporating other available spatial datasets to increase

accuracy of the classification process, and to better explore the performance of

linear regression and GWR models. Additional measures may be required to

better examine the spatial dependence and spatial heterogeneity issues in

population estimation using remotely sensed data.

60

REFERENCES

ArcGIS (Version 9.3) [Computer software]. Redlands,CA: ESRI.

Baraldi, A., Puzzolo, V., Blonda, P., Bruzzone, L., & Tarantino, C. (2006).

Automatic spectral rule-based preliminary mapping of calibrated landsat TM

and ETM+ images. IEEE Transactions on Geoscience and Remote Sensing,

44(9), 2563-2586.

Chander, G., & Markham, B. (2003). Revised Landsat-5 TM radiometric

calibration procedures and postcalibration dynamic ranges. IEEE

Geoscience and Remote Sensing, 41(11), 2674-2677.

Charlton, M., Fotheringham, S., & Brunsdon, C. (02/02/2009). NCRM methods

review papers, NCRM/006. geographically weighted regression. discussion

paper. Unpublished manuscript.

Collins, W. G., & El-Beik, A. H. A. (1971). Population census with the aid of aerial

photograph: An experiment in the city of leeds. Photogrammetric Record,

7(37), 16-26.

Congalton, R. G., & Green, K. (1999). Assessing the accuracy of remotely

sensed data: Principles and practices. Boca Raton, FL: Lewis Publishers.

61

Crist, E. P., Laurin, R., & Cicone, R. C. (1986). Vegetation and soil information

contained in transformed Thematic Mapper data. International Geoscience

and Remote Sensing Symposium, 2, 1465-1470.

ERDAS IMAGINE (Version 9.3) [Computer software]. Norcross,GA: erdas.

Fotheringham, A. S., Brunsdon, C., & Charlton, M. (2002).

Geographically weighted regression: The analysis of spatially varying

relationships. Chichester, UK: John Wiley & Sons.

Green, N. A. (1956). Aerial photographic analysis of residential neighborhoods:

And evaluation of data accuracy. Social Forces, 35(2), 142-147.

Hardin, P. J., Jackson, M. W., & Shumway, J. M. (2007). Intraurban population

estimation using remotely sensed imagery. Geo-Spatial Technologies in

Urban Enviroments: Policy,practice and pixels, 47-92.

Harvey, J. T. (2000). Small area population estimation using satellite imagery.

Statistics in Transition, 4(4), 611-633.

Harvey, J. T. (2002a). Estimating census district populations from satellite

imagery: Some approaches and limitations. International Journal of Remote

Sensing, 23(10), 2071-2095.

62

Harvey, J. T. (2002b). Population estimation models based on individual TM

pixels. Photogrammetric Engineering and Remote Sensing, 68(11), 1181-

1192.

Harvey, J. T. (2003). Population estimation at the pixel level: Developing the

expectation maximization technique. In V. Mesev (Ed.), Remotely sensed

cities (pp. 181-205). London, UK and New York, NY: Taylor and Francis.

Ji, M., & Jensen, J. R. (1999). Effectiveness of subpixel analysis in detecting and

quantifying urban imperviousness from Landsat Thematic Mapper imagery.

Geocarto International, 14(4), 31-39.

Langford, M., Maguire, D. J., & Unwin, D. J. (1991). The areal interpolation

problem: Estimating population using remote sensing in a GIS framework. In

I. Masser & M. Blakemore (Ed.), Handing geographical information:

Methodology and potential applications (pp. 55-77). New York: Wiley.

Lee, E. S., & Goldsmith, H. F. (1982). Population estimates methods for small

area analysis. Beverly Hills, CA: SAGE Publications, Inc.

Li, G., & Weng, Q. (2005). Using Landsat ETM+ imagery to measure population

density in Indianapolis, Indiana, USA. Photogrammetric Engineering and

Remote Sensing, 71(8), 947-958.

63

Lisaka, J., & Hegedus, E. (1982). Population estimation from Lansat imagery.

Remote Sensing of Enviroment, 12(4), 259-272.

Lo, C. P. (1986a). Accuracy of population estimation from medium-scale aerial

photography. Photogrammetric Engineering and Remote Sensing, 52(12),

1859–1869.

Lo, C. P. (1986b). Applied remote sensing. New York: Longman.

Lo, C. P. (1995). Automated population and dwelling unit estimation from high

resolution satellite images: A GIS approach. International Journal of Remote

Sensing, 16(1), 17-34.

Lo, C. P. (2003). Zone-based estimation of population and housing units from

satellite generated land use/land cover maps. In V. Mesev (Ed.), Remotely

sensed cities (pp. 157). London and New York: Taylor & Francis.

Lo, C. P. (2008). Population estimation using geographically weighted

regression. Journal GIScience & Remote Sensing, 45(2), 131-148.

Lo, C. P., & Welch, R. (1977). Chinese urban population estimation. Annals of

the Association of American Geographers, 67(2), 246-253.

Lu, D., Weng, Q., & Li, G. (2006). Residential population estimation using a

remote sensing derived impervious surface approach. International Journal

of Remote Sensing, 27(16), 3553-3570.

64

Mennis, J. (2006). Mapping the results of geographically weighted regression.

Cartographic Journal, 43(2), 171-179.

Qiao, Y., Liu, H. P., Bai, M., Wang, X. D., & Zhou, X. Z. (2009). The decision tree

algorithm of urban extraction from multi-source image data. Virtual Changing

Globe for Visualisation and Analysis, Wuhan, China. , 38

Qiu, F., Woller, K. L., & Briggs, R. (2003). Modeling urban population growth using

remotely sensed imagery and TIGER GIS road data. Photogrammetric Engineering

and Remote Sensing, 69(9), 1031-1042.

Ridd, M. K. (1995). Exploring a V-I-S (vegetation-impervious surface-soil) model for

urban ecosystem analysis through remote sensing: Comparative anatomy for cities.

International Journal of Remote Sensing, 16(12), 2165-2185.

Sutton, P., Roberts, D., Elvidge, C. D., & Baugh, K. (2001). Census from heaven: An

estimate of global human population night-time satellite imagery. International

Journal of Remote Sensing, 22(16), 3061-3076.

Sutton, P., Roberts, D., Elvidge, C. D., & Meij, H. (1997). A comparison of nighttime

satellite imagery and population density for the continental united states.

Photogrammetric Engineering and Remote Sensing, 63(11), 1303-1313.

65

Todd, S. W., Hoffer, R. M., & Milchunas, D. G. (1998). Biomass estimation on

grazed and ungrazed rangelands using spectral indices. International

Journal of Remote Sensing, 19(3), 427-438.

UNCHS. (2001). The state of the world’s cities. Nairobi, Kenya: United Nations

Center for Human Settlements (Habitat).

Wu, C., & Murray, A. T. (2005). A cokriging method for estimating population

density in urban areas. Computers, Environment and Urban Systems, 29(5),

558-579.

Wu, S., Qiu, X., & Wang, L. (2005). Population estimation methods in GIS and

remote sensing: A review. GIScience and Remote Sensing, 42(1), 58-74.

Zha, Y., Gao, J., & Ni, S. (2003). Use of normalized difference built-up index in

automatically mapping urban areas from TM imagery. International Journal

of Remote Sensing, 24(3), 583-594.

Documents

County level population estimation using knowledge-based image