Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Uppsala University
Department of Statistics
Bachelor Thesis
Spring 2018
Modelling Migration
An evaluation of existing spatial interaction models and decay
functions on municipality level in Sweden
Klara Hvarfner and Teodor Sandell
Supervisors: John Östh and Marina Toger
Abstract
This study examines which distance decay parameter and spatial interaction model should be
used when studying migration on municipality level in Sweden. The data used consists of
100000 randomly sampled observations from the Uppsala University database PLACE from
the years 2013-2014. The decay functions that are examined are exponential decay, power
decay and exponential normal decay. The interaction models that are used are the
unconstrained, doubly constrained and half-life model. The different functions and models
were evaluated using RMSE and Pearson’s correlation. The results show that the power
function and doubly constrained model most accurately estimated the flows of migration.
However, all of the model types can be considered preferable depending on how the data are
constructed and what the objective of the study is.
Keywords: Migration, Spatial interaction model, Unconstrained, Half-life, Doubly
constrained, Distance decay.
Acknowledgements
We would like to express our gratitude to our supervisors John Östh and Marina Toger from
the department of Social and Economic Geography at Uppsala University. Thank you for
always having your doors open and guiding us through an interesting and new subject.
Table of Contents 1. Introduction ............................................................................................................................ 1
1.1. Spatial Interaction Modelling .......................................................................................... 1
1.2. Background ..................................................................................................................... 2
1.3. Purpose and Research Question ...................................................................................... 2
2. Data ........................................................................................................................................ 3
2.1. Variables.......................................................................................................................... 3
Distance .............................................................................................................................. 4
The Flow Variables ............................................................................................................ 7
3. Method ................................................................................................................................... 7
3.1. The Distance Decay Functions ........................................................................................ 7
3.2. The Unconstrained Model ............................................................................................... 8
OLS Assumptions ............................................................................................................... 9
3.3. The Doubly Constrained Model .................................................................................... 10
3.4. The Half-Life Model ..................................................................................................... 11
3.5. Goodness of Fit Measures ............................................................................................. 14
3.6. Limitations of Spatial Analysis ..................................................................................... 14
MAUP ............................................................................................................................... 14
Spatial Autocorrelation ..................................................................................................... 15
4. Results .................................................................................................................................. 16
4.1. The Unconstrained Model ............................................................................................. 16
Regression Assumption Evaluation .................................................................................. 17
4. 2. The Doubly Constrained Model ................................................................................... 20
4.3. The Half-Life Model ..................................................................................................... 21
4.4. Decay Parameter Evaluation ......................................................................................... 21
5. Discussion ............................................................................................................................ 23
6. Conclusions .......................................................................................................................... 24
7. References ............................................................................................................................ 25
8. Appendix .............................................................................................................................. 28
8.1. The Unconstrained Exponential Decay Model ............................................................. 28
8.2. The Unconstrained Power Decay Model ...................................................................... 32
8.3. Moran’s Index for the Unconstrained Models .............................................................. 36
1
1. Introduction
Within human geography flows of goods, information and people have been analyzed through
spatial interaction analysis for a long time. Migration, or the movement of people, in particular
has been a subject of interest since it affects several societal functions. Migration is important
to understand considering aspects such as urban planning when for example more housing is
needed or whether there is a risk of depopulation. Such issues are easier to prevent or prepare
for if the flows of migration can be predicted. One of the first to study migration was Ernst
Georg Ravenstein and in his published work The Laws of Migration 1885 he stated seven
different rules in order to explain who migrates and why (Ravenstein, 1885). Since then,
researchers have tried to model migration and understand its “laws”.
Spatial interaction is, simply put, flows between locations in geographic space. In order to
understand migration or spatial interaction in general it is important to understand how spatial
separation, or distance, works as a deterrent to interaction. (Fotheringham, 1980). To measure
how distance works as a deterrent power in spatial interaction analysis there are several
different spatial interaction models (SIM) that can be used and within them several different
distance decay functions. Distance decay can be defined as: “The rate at which the volume of
interaction decreases as the distance over which the interaction is taking place increases, ceteris
paribus” (Fotheringham, 1980, p.2). In practice, this means that flows tend to travel the shortest
possible distance i.e. the principle of least effort.
1.1. Spatial Interaction Modelling
The most widely used model for studying spatial interaction is the gravity model which has
been used for a long time in order to understand, analyze and predict flows in geographic space
such as migration. The gravity model stems from Newton’s analogy of gravity between planets
as a function of mass and distance and is now used in different areas of social science as well
as other fields. (Haynes and Fotheringham, 1984). Part of its success is due to three facts, its
intuitive consistency with migration theories, the ease of estimating it and its goodness of fit.
(Poot et al. 2016). The model has the following form:
𝑇𝑖𝑗 = 𝐾𝑂𝑖𝐷𝑗𝑓(𝑑𝑖𝑗) (1)
2
Where model (1) explains the number of flows between an origin and destination as a function
of repulsion, attraction and distance. From this general form, there are several more advanced
versions of the model that are explained thoroughly in the method section.
1.2. Background
In “A new way of determining distance decay parameters in spatial interaction models with
application to job accessibility analysis in Sweden” by Östh, Reggiani and Lyhagen three
different techniques for calculating distance decay parameters are examined and compared.
The more commonly used unconstrained and doubly constrained spatial interaction models are
compared with the half-life model, which is a relatively new model in spatial interaction
analysis. Five different decay functions are used: exponential decay, power decay, exponential
normal decay, exponential square root decay and the log normal decay. Two empirical
applications related to job accessibility in Sweden are made in order to compare the different
decay functions. The first example is a smaller dataset studying job accessibility flows on
municipality level and the second is a more disaggregated data set with 5km*5km squares. By
comparing RMSE and Pearson’s correlation for the estimated parameters using the different
techniques conclusions were drawn about which estimated parameters worked best. For smaller
to midsized datasets the doubly constrained spatial interaction models worked better. Half-life
models and unconstrained models behaved in a similar way, but when using larger datasets,
the doubly constrained models became impossible to estimate while the half-life models
became more accurate (Östh et al. 2016).
The authors state that since half-life models can be used for several different decay functions
it can be assumed that they can be useful in both long- and short-span trips in studies of
accessibility (Östh et al. 2016). A long-span type of flow of importance is migration which in
this study is explored through the three different types of spatial interaction models similarly
to Östh et al (2016).
1.3. Purpose and Research Question
The aim of this study is to examine which decay function and spatial interaction model fits best
for examining migration within Sweden. This is done by examining three of the most
commonly used decay functions on unconstrained and doubly constrained spatial interaction
3
models as well as a half-life model on individual data aggregated to municipality level from
Sweden. The research question, the study attempts to answer is therefore the following:
Which spatial interaction model and what decay function should be used when studying
migration flows on municipality level in Sweden?
2. Data
The data used in the study are from the database PLACE 2013-2014 which is a subsample from
Statistics Sweden’s database LISA (Longitudinell integrationsdatabas för Sjukförsäkrings- och
Arbetsmarknadsstudier).1 The sample used in this study contains 100000 observations which
were randomly sampled from PLACE containing roughly 10 million observations of Swedish
citizens.
Out of the total 100000 observations in the sample only the 12612 individuals that actually
migrated between the years of 2013 and 2014 are studied. Since the aim is to study migration
on municipality level the data are aggregated to 2902 = 84100 observations, representing
flows from every municipality to all others. The variables needed are the number of migrational
inflows and outflows to each municipality, distances from each of the municipalities to each of
the municipalities as well as the total number of interactions between all of the municipalities.
Naturally, there are many observations with missing values since the number of observations
is far larger than the number of individuals in the data, these observations are set to zero.
2.1. Variables
The original dataset contains 17 different variables with information about where the
individuals lived and worked in 2013 and 2014, as well as if they moved and properties such
as gender, whether the individuals were rich or poor etc. The variables needed in the study to
create the models are: distance, outflows, inflows and total number of interactions. These
variables are further explained in the coming sections.
1 The data was generously provided by our supervisor John Östh from the Department of Social and Economic
Geography at Uppsala University.
4
Distance
The measurement for distance that is used is the Cartesian distance which is calculated by
measuring the distance between the coordinates of where an individual lived 2013 and 2014
and it is given in meters. Using the Cartesian distance instead of the actual transport distance2
can be criticized for not being completely accurate. However, the available data do not contain
information about anything related to the travel distance except the coordinates hence the
Cartesian distance is the only available option. More precisely, the coordinate system that is
used is a projected coordinate system called RT90. (Lantmäteriet, 2018) The distance between
two positions is then calculated by the distance formula stemming from Pythagoras theorem
stated below:
𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = √(𝑥2 − 𝑥1)2 + (𝑦2 − 𝑦1)2 (2)
The actual distances that are used are between municipalities, aggregated from the individual
data, and it is therefore important to choose the appropriate position within each zone where
from the distance is measured. This point is decided by calculating the average position for
each municipality’s inflow and for each municipality’s outflow i.e. the average point people in
each municipality migrate from as well as to. This might cause some irregularities e.g. if the
average position of a zone ends up in a lake, nonetheless it is still the method that captures the
most information and it is one of the downsides of aggregating the data. Another way of
determining the positions of inflows and outflows would have been to use the median
coordinates. This would perhaps yield more accurate positions but the difference this would
make for the results can be considered negligible.
The variable itself is calculated for each distance between all zones meaning that there in total
are 2902 = 84100 distances. It is calculated by using two vectors and creating a matrix.
2 A more complex way of calculating distance that takes into account existing transportation infrastructure.
(Rodrigue et al. 2017)
5
|
|
|
𝑑 𝑂 𝐷1 1 1. 1 .. 1 .𝑛 1 𝑛
𝑛2 − 𝑛 + 1 𝑛 1. . .. . .
𝑛2 𝑛 𝑛
|
|
|
(3)
Matrix (3) shows how the distances for all 𝑖-to-𝑗 pairs are calculated, where 𝑑 represents
distance, O represents origin 𝑖 and D represents destination 𝑗. The last 𝑛 elements represent
the distance from the 𝑛:th origin to the destinations 1 to 𝑛.
Out of the 84100 distances there are only 2519 distances that contain actual flows, the rest are
zero interactions and these distances are ignored. This was expected since there are, as
previously, mentioned only 12612 individuals in the data.
Differences in distance on individual and municipal level
In Figure 2.1.1. the distribution of distance of migration on municipality level is shown. In the
figure, the distance decay effect on migration is visualized. In Figure 2.1.2. the distribution of
distance on the individual level data is shown. The two figures differ in how rapid the decaying
effect is. The difference in distribution of distance in the two different datasets is also shown
in Table 2.1.1. The differences are a result of aggregating the data.
Table 2.1.1: Properties of distance on municipality level data in meters
Distance
Minimum Median Mean Maximum
Municipality
data
60.4 73193.4 163378.3 1263211.0
Individual data 100.0 3700.0 44759.8 1373085.7
6
Figure 2.1.1: Distribution of distance on municipality level data in meters
Figure 2.1.2: Distribution of distance on individual level data in meters
7
The Flow Variables
The gravity models require three different types of flow variables. First of all, the Origin and
Destination variables. Where 𝑂𝑟𝑖𝑔𝑖𝑛𝑖 is the number of outflows from each municipality i,
representing the emission of the municipality and 𝐷𝑒𝑠𝑡𝑖𝑛𝑎𝑡𝑖𝑜𝑛𝑗 is the number of inflows to
each municipality j, representing the attraction to each municipality.
The third flow variable is 𝑇𝑖𝑗 which is the total number of interactions between each 𝑖 is and 𝑗.
𝑇𝑖𝑗 can be visualized as a matrix with 𝑂𝑖 vertically and 𝐷𝑗 horizontally. Since the total number
of municipalities in Sweden is 290, 𝑂𝑖 and 𝐷𝑗 consists of 290 different values meaning that 𝑇𝑖𝑗
is a matrix of 2902 = 84100 values. However, as previously mentioned only 2519 of these
cells have values that are not 0. This means that only 2519 of the 𝑇𝑖𝑗 observations are used
when estimations are conducted.
3. Method
This section explains how the method of the study is conducted. The softwares SAS, ArcGIS3
and RStudio are used to analyze and process the data. Some calculations have been processed
in the software Matlab.
3.1. The Distance Decay Functions
The beta parameter, or the decay parameter, is calculated differently depending on which
spatial interaction model is being used. For the unconstrained model the parameter is estimated
using a log-log least squares regression. For the unconstrained model, it is calibrated through
iterations. For the half-life model, it is derived mathematically from an integral function. The
calculations of the decay parameter for the different spatial interaction models are presented in
the method chapter.
The types of distance decay functions that are used are listed below. Functions 1-2 is used for
the unconstrained model. The doubly constrained model also uses functions 1-2 but there are
only results from function 1 in this study. For the half-life model functions 2-3 are used. The
three decay functions are chosen based on that they are amongst the most commonly used
3 ArcGIS is a licensed product, due to this we did not have access and our supervisor, John
Östh, provided help in using it for Figures 4.1.1, 8.3.1. and 8.3.2.
8
within spatial interaction analysis when studying migration. The exponential is considered the
best for short distances and the power function best for long distances. (Fotheringham and
O`Kelly, 1989)
However, it is important to remember that in this study not all function types are examined and
there are plenty of others around in practice. Following, this study can in reality only claim to
be examining which out of these three decay functions fits the best for migration in Sweden on
municipality level. The type of decay function that is used in a spatial interaction model is also
often chosen depending on whether short- or long-distance migration is being examined. (Hipp
and Boessen, 2017) In this study both types of distances are used at the same time which might
potentially impact how well the model fits.
(1) Exponential decay
𝑓(𝑑𝑖𝑗) = 𝑒−𝛽𝑑𝑖𝑗
(2) Power decay
𝑓(𝑑𝑖𝑗) = 𝑑𝑖𝑗−𝛽
(3) Exponential normal decay
𝑓(𝑑𝑖𝑗) = 𝑒−𝛽𝑑𝑖𝑗2
3.2. The Unconstrained Model
The unconstrained spatial interaction model has the following general form:
𝑇𝑖𝑗 = 𝐾𝑂𝑖𝐷𝑗𝑓(𝛽, 𝑑𝑖𝑗) (4)
Where 𝑇𝑖𝑗 is the number of flows between each origin (𝑂𝑖 ) and destination (𝐷𝑗 ) in this case
the number of migration interactions. These flows are a function of the number of flows from
9
𝑂𝑖 and flows to 𝐷𝑗 as well as of a distance decay function 𝑓(𝛽, 𝑑𝑖𝑗) which takes into account
the distance deterring effect and is a function of distance (𝑑). The beta value is the decay
parameter and is calibrated from equation (4). It determines the migration behavior and
expresses how the probability to migrate changes over distance. The constant K is a result of
the calibration made on real data when estimating the model and works as a scaling factor. K
is set to 1 since enables comparisons of different models. (Östh et al. 2016) The decay functions
that are used with this particular SIM are the power decay (2) and the exponential decay (1)
functions.
The unconstrained model is estimated as a log-log least squares regression model. Where the
dependent variable is expressed as ln (𝑇𝑖𝑗
𝑂𝑖 𝐷𝑗 ) and the independent variable is expressed as 𝑑𝑖𝑗in
the exponential decay model and ln (𝑑𝑖𝑗) in the power decay model. When the decay parameter
has been estimated the model can be used to calculate the estimated number of total interactions
𝑇𝑖�̂� which can be compared to the actual number of total interactions 𝑇𝑖𝑗 to evaluate the
performance of the model.
OLS Assumptions
Before using Ordinary Least Squares regression (OLS) the assumptions have to be checked.
The first one is that the dependent variable has a linear relationship with the independent
variable combined with some sort of error term. The second assumption is that the independent
variable has variation i.e. it is not always the same value. The independent variable also needs
to be non-stochastic, which means that it is not determined by chance. The number of
observations must exceed the number of independent variables. The last four assumptions deal
with the error terms which needs to be independently, identically, normally distributed with a
zero mean and share a common variance (Asteriou and Hall, 2016). A few of these assumptions
are already known to be fulfilled and will therefore not be further discussed. The independent
variable is already known to be varying and not be stochastic as not all people move the exact
same distance and not completely at random. The number of observations also, clearly, exceeds
the number of independent variables. When obtaining the decay parameter through regression
for the unconstrained model, the remaining assumptions will need to be evaluated.
10
3.3. The Doubly Constrained Model
The doubly constrained model differs from the unconstrained SIM in that the model includes
restrictions on the origin and destination variables. The general form of the model therefore
looks very similar to the unconstrained model and is stated below:
𝑇𝑖𝑗 = 𝐴𝑖𝐵𝑗𝑂𝑖𝐷𝑗𝑓(𝛽, 𝑑𝑖𝑗) (5)
The variables are the same as in unconstrained model except that the equation has two
additional variables 𝐴𝑖 and 𝐵𝑗 and that the scaling factor K is excluded. 𝐴𝑖 and 𝐵𝑗 are the
constraints for the origins and the destinations ensuring that the totals of the origins and
destinations are predicted correctly. One of the effects of this is that the balancing effects for
the origins and destinations take into account spatial autocorrelation something the
unconstrained model does not adjust for (Griffith and Fisher, 2013). One downside with the
model is that the estimated 𝐴𝑖 and 𝐵𝑗 parameters lack any sort of meaningful interpretation
(Yang et al. 2014). Below the construction of a doubly constrained model is explained.
By looking at the following sums
∑ 𝑇𝑖𝑗𝑖
= ∑ 𝐴𝑖𝑖
𝐵𝑗𝑂𝑖𝐷𝑗𝑓(𝛽, 𝑑𝑖𝑗)
The following constraints
∑ 𝑇𝑖𝑗𝑖
= 𝐷𝑗
and
∑ 𝑇𝑖𝑗𝑗
= 𝑂𝑖
It then follows that
𝐷𝑗 = 𝐷𝑗𝐵𝑗 ∑ 𝐴𝑖𝑖
𝑂𝑖𝑓(𝛽, 𝑑𝑖𝑗)
11
And it is then possible to solve for 𝐴𝑖 and 𝐵𝑗
𝐴𝑖=∑ 𝐵𝑗𝐷𝑗(𝑓(𝛽, 𝑑𝑖𝑗))−1
𝐵𝑗=∑ 𝐴𝑖𝑂𝑖 (𝑓(𝛽, 𝑑𝑖𝑗))−1
From the two previous equations, notice that 𝐴𝑖 and 𝐵𝑗 are interdependent, the two equations
are solved through iterations. By assigning 𝐵𝑗 the value 1 it is possible to solve for 𝐴𝑖. After
𝐴𝑖 has been solved the same procedure is repeated but for 𝐵𝑗. The process is repeated until the
errors no longer are significant. The errors are calculated by the following formula.
𝐸 = ∑|𝑂𝑖 − 𝑂𝑖1| + ∑|𝐷𝑗 − 𝐷𝑗
1| (6)
𝑂𝑖 are the real number of the origins while 𝑂𝑖1 are the calculated origins from the same zone.
The same for 𝐷𝑗 and 𝐷𝑗1 but for destinations. These steps are then repeated until the errors
dissipate when the real and calculated flows converge.
Different values of the decay parameter will result in different flow matrices and since the
parameter is unknown the beta value will be calibrated until it fits the model. This is done by
assigning the decay parameter an arbitrary value. If the iterative process demands too many
iterations it is an indication that the decay parameter needs adjusting. When the iterations are
completed the model has automatically generated �̂�𝑖𝑗 estimates.
3.4. The Half-Life Model
The half-life model has the same function form as the unconstrained model (3) other than that
the beta value is obtained through mathematical calculations instead of regressions. The
difference between half-life models and other spatial interaction models is that while most
statistical models try to reduce the deviation from the mean the half-life models instead use the
median distance. Thus, half of the moving population will migrate when the median distance
is reached i.e. it is possible to state that the probability of migrating among the migrating
population is 0.5 at the median distance. Mathematically this means that the median distance
12
will intersect half of the Area Under Curve (AUC) of an integral function describing the
probability to move (Östh et al. 2016).
Using the individual level median distance of migration between municipalities will lead to
relatively large systematic errors. (Östh et al. 2016) This is because of the difference in median
migration distance between the individual and the municipality level datasets as previously
mentioned in Section 2.1.
The decay functions used for the half-life model are: exponential decay and exponential normal
decay. The power decay function cannot be used with the half-life model since it is not
mathematically possible to calculate the AUC of an integral which is asymptotic on the x-axis.
When calculating a decay function for the half-life model the decay function should be
perceived as an integral function which then describes the probability to migrate over all
possible distances (Östh et al. 2016). The derivations of the exponential decay function and
exponential normal decay function below are retrieved from Östh. et al ( 2016).
Exponential decay
∫ 𝑒−𝛽𝑥𝑑𝑥 = 1/𝛽∞
0
For obtaining half of the AUC we use the following expression:
∫ 𝑒−𝛽𝑥𝑑𝑥 = 0.5/𝛽𝑚
0
The integral now spans from zero to the median value m instead of infinity and is now only
integrated to 0.5 instead of 1, m will later be replaced by the true value of the median migration
distance. The equation can now be solved for 𝛽.
𝛽 ∫ 𝑒−𝛽𝑥𝑑𝑥 = 0.5 = 1 − 𝑒−𝛽𝑚𝑚
0
0.5 = 𝑒−𝛽𝑚
13
ln(0.5) = −𝛽𝑚
𝛽 =ln(0.5)
𝑚
Exponential Normal decay
∫ 𝑒−𝛽𝑥2𝑑𝑥 =
√𝜋
2√𝛽
∞
0
For obtaining half of the AUC we use the following expression:
∫ 𝑒−𝛽𝑥2𝑑𝑥 =
0.5√𝜋
2√𝛽
𝑚
0
𝛽 can then be solved for in the following way:
0.5 =2√𝛽
√𝜋
√𝜋
2√𝛽erf(𝑚√𝛽) = (𝑚√𝛽)
𝑒𝑟𝑓−1(0.5) = 𝑒𝑟𝑓−1(erf(𝑚√𝛽)) = 𝑚√𝛽
𝛽 = (𝑒𝑟𝑓−1(0.5)
𝑚)
2
𝛽 ≈ (0.47693628
𝑚)
2
Using the calculated beta value, the half-life model can estimate the total number of interactions
�̂�𝑖𝑗 in the same way as the unconstrained model.
14
3.5. Goodness of Fit Measures
The goodness of fit of the models is determined by studying and comparing the RMSE and
Pearson's correlation of the estimated and observed number of total interactions. RMSE and
Pearson’s correlation could indicate different models as being the best since they measure
goodness of fit differently. Where RMSE measures the estimated values deviation from the
observed ones, correlation measures how the observed and estimated values relate. Since the
two variables, can follow a similar pattern and therefore be highly correlated, without being as
close as possible to each other, correlation and RMSE can give different results. (Östh et al.
2016)
3.6. Limitations of Spatial Analysis
MAUP
The Modifiable Areal Unit Problem (MAUP) deals with how the units used in spatial analysis
affects the results of the analysis. MAUP can be explained as two separate problems. Firstly,
different levels of aggregation of the same data leads to different results, something called the
scale problem. Secondly, zones with equal scale that are divided into different combinations
also leads to different results which is called the aggregation problem. (Openshaw and Taylor,
1979) In this study municipalities are chosen as the spatial units. Another type of division of
space would yield different results. This means that the results cannot be generalized to units
of other sizes than municipalities or differently combined units of the same size.
Another problem with Swedish municipalities is that they are not equal in size. (SCB, 2017)
Unequal sizes in units can cause problems with the analysis as well. Different sized zones lead
to different results in correlation analysis even though the underlying individual level data are
the same. (Robinson, 1956) In the case of Swedish municipalities it is therefore important to
keep in mind these difficulties when interpreting the results. The main differences in
municipality size in Sweden are between larger urban areas and smaller rural municipalities. It
is therefore expected that these difficulties predominantly surface in bigger cities as well as the
sparsely populated areas in northern Sweden.
15
Spatial Autocorrelation
Spatial autocorrelation is the correlation between values of the same variable through
geographic space making the estimated dependent variable deviate from the true values of the
dependent variable. Positive spatial autocorrelation is common and indicates that units that are
close in geographic space share common properties. (Griffith, 2003) Some measures that can
be used in order to examine the spatial association in a dataset are Global Moran’s Index
(Moran’s I) and the Local Indicators of Spatial Association (LISA). Moran’s I indicates if there
is any statistical significant autocorrelation present in the models. (Li et al. 2007) It is computed
as following:
𝐼 =𝑛
𝑆0
∑ ∑ 𝑤𝑖,𝑗𝑧𝑖𝑧𝑗𝑛𝑗=1
𝑛𝑖=1
∑ 𝑧𝑖2𝑛
𝑖=1
Where 𝑧𝑖is the average residual for municipality 𝑖. 𝑤𝑖,𝑗is the spatial weight between 𝑖 and 𝑗,
which is 1 if the municipalities share a border and 0 otherwise. 𝑆0 is the summation of the
spatial weights and 𝑛 is the number of municipalities. The expected value under the null
hypothesis, that there is no spatial autocorrelation is:
𝐸(𝐼) =−1
(𝑛 − 1)
The z statistic is calculated by the following expression.
𝑧𝐼 =𝐼 − 𝐸(𝐼)
√𝑉(𝐼)
16
Moran’s I does not however give information about where the autocorrelation is, it only
indicates that it is present. LISA is a measurement that detects local autocorrelations and
clusters. (Anselin, 1995) LISA can, compared to Moran’s I, highlight the locations of
autocorrelations. It is computed as following:
𝐼𝑖=
𝑧𝑖
𝑆𝑖2 ∑ 𝑤𝑖,𝑗𝑧𝑗
𝑛
𝑗=1,𝑗≠𝑖
𝑆𝑖2 =
∑ (𝑧𝑗2)𝑛
𝑗=1,𝑗≠1
𝑛 − 1
Where 𝐼𝑖 is the local Moran’s I statistic. A positive local Moran’s I value indicates spatial
autocorrelation clusters and negative values indicate that an observation is an outlier. The
expected 𝐼𝑖and 𝑧𝐼𝑖 are calculated in the same way as the Moran’s I.
4. Results
The results of applying the unconstrained SIM, doubly constrained and the half-life model on
migration flows on municipality level in Sweden are presented in the following sections.
4.1. The Unconstrained Model
In order to receive the decay parameter for the decay function to use with the unconstrained
SIM two regressions are run. For the model with exponential decay the dependent variable is
expressed as ln (𝑇𝑖𝑗
𝑂𝑖 𝐷𝑗 ) and the independent variable as 𝑑𝑖𝑗 For the model with power decay the
only difference is that the independent variable instead is ln(𝑑𝑖𝑗). By running these regressions,
the decay parameters deciding the effect of distance decay are determined.
Table 4.1.1: Decay function parameter estimates for the unconstrained model
Decay function Decay parameter estimate Significance level
Exponential decay -0.00000385 <.0001
Power decay -0.74711 <.0001
17
As shown in Table 4.1.1. the distance parameters, or decay parameters, for both functions are
significant on the 1 % significance level.4
Using the estimated decay values, the estimated numbers of total interactions �̂�𝑖𝑗 are calculated
as shown in equation (7) for the exponential decay and as in equation (8) for the power decay
function.
�̂�𝑖𝑗 = 𝑂𝑖𝐷𝑗𝑒−𝛽𝑑𝑖𝑗 (7)
�̂�𝑖𝑗 = 𝑂𝑖𝐷𝑗𝑑𝑖𝑗−𝛽
(8)
The evaluation measures are discussed in Section 4.4.
Regression Assumption Evaluation
The full regression diagnostics are found in the appendix in section 8.
Linear Relationship
The first assumption of the independent and dependent variable having a linear relationship
can be examined by studying the regression plots. The scatter of observations for the
exponential decay function in Figure 8.1.1. in the appendix does, by ocular examination, not
seem to be clearly linear which can be considered a problem. The power decay scatter in Figure
8.2.1. in the appendix appears to be more linear compared to the exponential although it looks
a bit problematic as well. The plot appears to be divided into two separate clusters. Examining
the clusters more closely they are discerned to be within-municipality migration and between-
municipality migration.
Violations of the linearity assumption can create misspecifications errors such as wrong
regressors. (Asteriou and Hall, 2016) Since distance is assumed to be a deterrent to interaction
the expected signs of the regressors in both models are expected to be negative and since
distance is measured in meters the values are expected to be low. The obtained regressors at
least fulfill these requirements, but for the exponential model in particular the nonlinearity
should be viewed with caution. Nevertheless, it is clear that the regressors do not explain much
of the variation in the dependent variables studying the adjusted R-square of the models in
4 The full regression outputs are found in the appendix.
18
Tables 8.1.3. and 8.2.3. in the appendix. Distance explains only 13,7 % of the variation in the
dependent variable for the exponential function and 39.9% for the power function. This could
be explained by the low degree of linearity.
Normality of the Residuals
The results in Table 8.1.4. and 8.2.4. in the appendix show that the Kolmogorov-Smirnov test
of normality of residuals for both the exponential- and the power decay functions were
significant indicating that the null hypothesis of the residuals being normal was rejected in both
cases which means that the normality assumption is violated. However, using large samples
the normality test should be considered with caution since large sample sizes means even small
deviations from normality become significant. Therefore, the residual distribution plots and
residual quantile plots in Figures 8.1.3 and 8.2.3. in the appendix are also studied. From the
residual distribution plots the distribution is approximately bell-shaped in both cases. In the
residual quantiles plots the distribution of observations approximately follows the diagonal
line. These observations indicate that the deviations from normality does not appear to be
serious and the violation of the assumption should not be a problem.
Homoscedasticity
In Tables 8.1.4. and 8.2.4. in the appendix, the White test results concerning homoscedasticity
i.e. a common variance among the residuals are displayed. For both functions it turns out that
the test is significant which means that the null hypothesis of homogenous variance of the
residuals is rejected. The assumption of homoscedasticity is therefore violated. However,
studying the residual plots of the functions in Figure 8.1.2. and 8.2.2. in the appendix, the
violation does not seem to be of a great magnitude.
Violating the homoscedasticity assumption mainly affects the distribution and variance of the
regressors but does not make them biased or inconsistent. (Asteriou and Hall, 2016) Since the
decay-parameters exact values are all that is needed in this case this violation is not considered
a serious problem.
19
Spatial Autocorrelation
The spatial autocorrelation is evaluated by a Moran’s Index test. The Moran’s I value for the
unconstrained power model in Figure 8.3.2. in the appendix is 0.402 and for the unconstrained
exponential in Figure 8.3.1. in the appendix is 0.285. Both values are significant which means
that there is significant positive autocorrelation i.e. clustered patterns. In Figure 4.1.1. the
residuals for the exponential- and power functions are displayed showing how they
differentiate between municipalities. In the figure, spatially dependent patterns are also
displayed in the Local Indicator of Spatial Association (LISA) maps. The pink clusters
represent areas where residual values are high and correlated i.e. where the flows are
overestimated. These clusters are found in areas of Sweden where the population is relatively
small. The light blue clusters represent areas where residual values are low and correlated i.e.
where the flows are underestimated. These clusters are found in the more densely populated
urban areas. The red and dark blue areas represent areas with high/low residual values but the
municipalities are not correlated with their neighboring municipalities. They are to be
interpreted as outliers.
The presence of autocorrelation causes the estimated regressors to be inefficient and the
variance of them to be biased and inconsistent which causes problems with hypothesis testing.
R-square also becomes overestimated. However, the point estimates themselves can still be
unbiased and consistent (Asteriou and Hall, 2016). In this case, it is reasonable to assume that
the autocorrelation is a product of an omitted variable bias which means that more factors than
distance are needed to explain differences in flows between different municipalities. The
unconstrained model thus suffers from some level of bias. (Asteriou and Hall, 2016) The half-
life model can be assumed to suffer from the same issue as well since it is constructed using
the same variables as the unconstrained.
20
Figure 4.1.1: Spatial Autocorrelation. Residuals are shown on municipality level and also local
autocorrelation clusters.
4. 2. The Doubly Constrained Model
The decay parameter is calibrated through an iterative process that generates the balancing 𝐴𝑖
and 𝐵𝑗 factors. From the iterative process values for 𝐴𝑖, 𝐵𝑗 and the decay parameter are
generated. In Table 4.2.1. below the decay parameters are presented. As can be seen the
parameter for the exponential decay is missing. Due to computational difficulties with the
iterative process, no results for the exponential decay were produced.
Table 4.2.1: Decay parameter estimates for the doubly spatial interaction model
Decay function Decay parameter
Power decay -1.074354
Exponential decay Na
The estimation is evaluated and compared to the other models in section 4.4.
21
4.3. The Half-Life Model
When estimating the total number of interactions between every municipality using the half-
life model the decay parameter is calculated mathematically using the functional form of the
decay functions integral form and the median migration distance. The two decay functions used
are the exponential decay function and the exponential normal decay function. The median
distance is 3700m as is displayed in Table 2.1.1. The decay parameters are:
Exponential decay:
𝛽 = −ln(0.5)
𝑚= −
ln(0.5)
3700= 0.000187337
Exponential normal decay:
𝛽 = [𝑒𝑟𝑓−1(0.5)
𝑚]
2
= [𝑒𝑟𝑓−1(0.5)
3700]
2
= 0.000000016616
Table 4.3.1: Decay parameter estimates for the half-life model
Decay function Beta value
Exponential decay -0.000187337
Exponential normal decay -0.000000016616
These decay parameters are used in the same way as the ones estimated by regressions for the
unconstrained SIM in order to estimate the total number of interactions as shown in equations
(9) and (10).
�̂�𝑖𝑗 = 𝑂𝑖𝐷𝑗𝑒−𝛽 𝑑𝑖𝑗 (9)
�̂�𝑖𝑗 = 𝑂𝑖𝐷𝑗𝑒−𝛽 𝑑𝑖𝑗2
(10)
The evaluation measures are discussed in Section 4.4.
4.4. Decay Parameter Evaluation
The decay parameters used for the different spatial interaction models are evaluated by their
respective Pearson’s correlation and RMSE. The evaluation measurements are presented in
Table 4.4.1.
22
Table 4.4.1: Decay parameter estimates and evaluation measures.
*** indicates that the p-value of the correlation is less than 0.001.
Model Decay parameter
estimate
𝐶𝑜𝑟𝑟(�̂�𝑖𝑗, 𝑇𝑖𝑗) RMSE
Unconstrained Exponential decay -0.00000385 0.724*** 39104.03
Unconstrained Power decay -0.74711 0.734*** 1364.56
Half-life Exponential decay -0.000187337 0.842*** 31821.50
Half-life Exponential Normal
decay
-0.000000016616 0.846*** 32464.84
Doubly Constrained Power decay -1.074354 0.985*** 6.99
From Table 4.4.1. it is clear that all of the correlations are relatively high ranging from
approximately 0.724 to 0.985. This indicates that all of the spatial interaction models to some
extent explain the actual flows. It also appears that both of the half-life models, to some extent,
do a better job of estimating the flows than both of the unconstrained models according to
Pearson’s correlation. However, the doubly constrained model far outperforms the other
models and is doing very well at predicting the migration flows.
The other evaluation measurement RMSE from Table 4.4.1. produces a slightly different
outcome. Here the unconstrained power decay model has a substantially lower value than the
half-life models and the unconstrained exponential decay model. Both of the half-life models
perform approximately equally and the unconstrained exponential decay model slightly worse.
The RMSE for the doubly constrained power model is better than all other models by several
magnitudes.
From the two measurements, RMSE and Pearson's correlation, it is obvious that the
unconstrained exponential decay performed worst considering all aspects. What is quite
interesting is that the unconstrained power decay has a much lower RMSE than the half-life
models but at the same time a lower correlation. The doubly constrained power model excels
at all aspects and is compared to the other models considerably better.
23
5. Discussion
The doubly constrained power model outperformed both the unconstrained and the half-life
models regardless of their type of decay parameter. The result is not surprising, since the
construction of doubly constrained models depend on the data it will later predict, which means
that it will adjust itself through iterations for the specific dataset. The implications of this is
that the model is good for the data that is being used for but no other data. No predictions on
other data with the doubly constrained model will be valid since it is adjusted for a specific
dataset. This severely limits the applicability of the model and it is an aspect that should be
taken into consideration when using the doubly constrained model.
The issue with spatial autocorrelation that clearly affects the unconstrained models as shown
in Figure 4.1.1. is almost certainly present in the half-life model as well by the very construction
of the model. The only difference between the models are the estimations of the decay
parameters and thus the same flaws that the unconstrained models inherit are bound to be found
in the half-life model. As can be seen from the residuals in Figure 4.1.1. it is obvious that the
models over- and underestimate the migrational flows depending on where in Sweden the
municipality is situated. In this case municipalities situated around the largest urban areas in
Sweden underestimate the migrational flows. At the same time the model overestimates flows
for municipalities in the sparsely populated, large municipalities in the north. These findings
suggest that when studying migration on municipality level in Sweden with the unconstrained
and half-life model it is necessary to include municipality-specific parameters that take into
account the structural differences that exist, otherwise the models will suffer from omitted
variable bias.
Which model works best for studying migration flows on municipality level in Sweden also
depends on what type of analysis the model is used for. While the doubly constrained model
estimates are the most accurate its 𝐴𝑖 and 𝐵𝑗 values have no meaningful interpretation, and
makes the model completely adjusted for the specific dataset. The unconstrained model is more
intuitive but less accurate. If the model would include municipality specific variables that could
potentially improve the estimates substantially. The half-life model which turned out to
perform on a similar level as the unconstrained is easier to calculate and can be used to obtain
decay parameter where small amounts of flow data are available since all it needs is the median
24
travelling distance. The models also had a higher correlation than the unconstrained models.
However, since the power function seemed to provide a better fit than the exponential in
general, the half-life model, which cannot use the power function could be considered an
inferior option in the context of migration.
As mentioned in the section covering the decay parameters, according to the literature, the
exponential function provides the most accurate results for short distances while the power
function suits better for longer distances. In this study both types of distances were used and
the power function had the best model fit. This does not necessarily mean that it is the preferred
functional form for models that combine distance types. It might be a product of a discrepancy
of short distances and an overrepresentation of longer distances in the data. This shortage can
partially be explained by how the data was aggregated. It is possible that it might be more
prudent to split our data in two and measure short distance migration and long-distance
migration separately.
6. Conclusions
When studying migrational flows in Sweden on municipality level there is no clear answer to
what model should be used. Given the tests of the models’ usefulness the doubly constrained
model is the most suitable choice. However, the three model types that have been examined
all have strengths and weaknesses. Depending on the research goal and resources available
every individual researcher must contemplate what model is the most appropriate.
The functional form of the decay parameter that best describes the migrational flows in this
study has been the power function. This result might be misleading to some extent due to
how the study was conducted and aggregated. Further investigation on how the aggregation
of the data affects the decay parameters could clarify this question. But from our findings the
power function has been the best at predicting migrational flows.
25
7. References
Anselin, L. 1995. Local Indicators of Spatial Association-LISA. Geographical Analysis.
27(2): 93-115.
https://doi.org/10.1111/j.1538-4632.1995.tb00338.x
(Accessed 2018-05-10)
Asteriou, D. and Hall, S G. 2016. Applied Econometrics. 3rd ed. London: Palgrave.
Fotheringham, A. S. 1980. Spatial Structure, Spatial Interaction, and Distance Decay
Parameters. Diss. McMaster University.
https://macsphere.mcmaster.ca/bitstream/11375/11312/1/fulltext.pdf
(Accessed 2018-04-20)
Fotheringham, A. S. and O'Kelly, M. E. 1989. Spatial interaction models: formulations and
applications. Dordrecht: Kluwer Academic Publishers.
Griffith, D.A. 2003.Spatial Autocorrelation and Spatial Filtering: Gaining Understanding
Through Theory and Scientific Visualization. New York: Springer-Verlag Berlin Heidelberg.
https://link-springer-com.ezproxy.its.uu.se/content/pdf/10.1007%2F978-3-540-24806-4.pdf
(Accessed 2018-05-11)
Griffith, D.A. Fischer, M.M. and J Geogr Syst. 2013. Constrained Variants of the Gravity
Model and Spatial Dependence: Model Specification and Estimation Issues. Journal of
Geographical Systems.15(3): 291-317.
https://doi.org/10.1007/s10109-013-0182-7
(Accessed 2018-05-05)
Haynes, K E. and Fotheringham, A S. 1984. Gravity and Spatial Interaction Models. Beverly
Hills: Sage Publications.
Hipp, J. R. and Boessen, A. 2017. The Shape of Mobility: Measuring the Distance Decay
Function of Household Mobility. The Professional Geographer, 69 (1): 32-44.
DOI: 10.1080/00330124.2016.1157495
Lantmäteriet. RT 90. https://www.lantmateriet.se/sv/Kartor-och-geografisk-
information/GPS-och-geodetisk-matning/Referenssystem/Tvadimensionella-system/RT-90/
(Accessed 2018-05-03)
26
Li, H. Calder, A.K. and Cressie, N. 2007. Beyond Moran’s I: Testing for Spatial Dependence
Based on the Spatial Autoregressive Model. Geographical Analysis. 39(4). 357-375.
https://doi.org/10.1111/j.1538-4632.2007.00708.x
(Accessed 2018-05-17)
Openshaw, S. and Taylor P J. 1979. A Million or so Correlation Coefficients: Three
Experiments on the Modifiable Areal Unit Problem. Statistical Applications in the Spatial
Sciences. Wrigely, N. London. Pion: 127-144.
http://www.csiss.org/GISPopSci/workshops/2009/UCSB/readings/Openshaw-Taylor-
1979.pdf
(Accessed 2018-05-12)
Poot, J. Alimi, O. Cameron, M. P., and Maré, D. C. 2016. The Gravity Model of Migration:
The Successful Comeback of an Ageing Superstar in Regional Science.
IZA Discussion Paper No. 10329. Available at SSRN: https://ssrn.com/abstract=2864830
Ravenstein, E. G. 1885. The Laws of Migration. Journal of the Statistical Society of London.
2(48): 167-235.
http://www.jstor.org/stable/pdf/2979181.pdf?refreqid=excelsior:c68114a1919ab9acde5ba396
f17c693b
(Accessed 2018-04-10)
Robinson, A H. 1956. The Necessity of Weighting Values in Correlation Analysis of Areal
Data. Annals of the Association of American Geographers. 46(2): 233-236.
http://www.jstor.org/stable/pdf/2561482.pdf?refreqid=excelsior%3A58d46761d574134843a3
5350fd6b8aa8
(Accessed 2018-05-10)
Rodrigue, J. Comtois, C. and Slack B. 2017. The Geography of Transport Systems 4th ed.
London: Routledge.
Statistiska centralbyrån. (2017-05-10). Folkmängd i riket, län och kommuner 31 mars 2017
och befolkningsförändringar 1 januari–31 mars 2017.
https://www.scb.se/hitta-statistik/statistik-efter-amne/befolkning/befolkningens-
sammansattning/befolkningsstatistik/pong/tabell-och-diagram/kvartals--och-halvarsstatistik--
kommun-lan-och-riket/kvartal-1-2017/
(Accessed 2018-05-08)
Yang, X. Herrera, C. Eagle, N. and González, C.M. 2014. Limits of Predictability in
Commuting Flows in the Absence of Data for Calibration. Scientific Reports. 4, Article
number: 5662.
doi:10.1038/srep05662
27
Östh, J. Lyhagen, J. and Reggiani, A.2016. A New Way of Determining Distance Decay
Parameter in Spatial Interaction Models with Application to Job Accessibility Analysis.
European Journal of Transport and Infrastructure Research. 16(2): 344-363.
http://uu.diva-portal.org/smash/get/diva2:912729/FULLTEXT01.pdf
(Accessed 2018-03-26)
28
8. Appendix
8.1. The Unconstrained Exponential Decay Model
In this section, the regression outputs from the unconstrained exponential decay model are
found. Table 8.1.1. shows the parameter estimates of the model and Table 8.1.2. and 8.1.3.
show some further evaluative measures.
In Figure 8.1.1. the regression plot is found where the y-axis represents the dependent
variable 𝑙𝑛 (𝑇𝑖𝑗
𝑂𝑖𝐷𝑗) and the x-axis represent the distance. Figure 8.1.2. shows the residual plot
and some fit diagnostics are expressed in Figure 8.1.3.
Thereafter, the official tests made in order to evaluate the assumptions are shown. In Table
8.1.4. the Kolmogorov-Smirnov test of normality is found and in Table 8.1.5. the White test
is found.
Table 8.1.1: Parameter estimates for the unconstrained exponential decay model
Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 -6.84205 0.05079 -134.71 <.0001
Distance 1 -0.00000385 1.919705E-7 -20.03 <.0001
Table 8.1.2: Analysis of variance for the unconstrained exponential decay model
Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 1 1613.44346 1613.44346 401.33 <.0001
Error 2517 10119 4.02029
Corrected Total 2518 11733
29
Table 8.1.3: Evaluation measures for the unconstrained exponential decay model
Root MSE 2.00507 R-Square 0.1375
Dependent Mean -7.47037 Adj R-Sq 0.1372
Coeff Var -26.84026
Figure 8.1.1: Regression plot of the unconstrained exponential decay model. With 𝑙𝑛 (𝑇𝑖𝑗
𝑂𝑖𝐷𝑗)
on the Y-axis and Distance on the x-axis.
30
Figure 8.1.2: Residual plot of the unconstrained exponential decay model
31
Figure 8.1.3: Fit diagnostics for the unconstrained exponential decay model
Table 8.1.4: Test of normality of residuals for the unconstrained exponential decay model
Test for normality Statistic p Value
Kolmogorov-Smirnov D 0.055644 Pr > D <0.0100
32
Table 8.1.5: White tests of heteroscedasticity of residuals for the unconstrained exponential
decay model
Test of First and Second
Moment Specification
DF Chi-Square Pr > ChiSq
2 37.39 <.0001
8.2. The Unconstrained Power Decay Model
In this section, the regression outputs from the unconstrained power decay model are found.
Table 8.2.1. shows the parameter estimates of the model and Table 8.2.2. and 8.2.3. show
some further evaluative measures.
In Figure 8.2.1. the regression plot is found where the y-axis represents the dependent
variable 𝑙𝑛 (𝑇𝑖𝑗
𝑂𝑖𝐷𝑗) and the x-axis represent the natural logarithm of distance. Figure 8.2.2.
shows the residual plot and some fit diagnostics are expressed in Figure 8.2.3.
Thereafter, the official tests made in order to evaluate the assumptions are shown. In Table
8.2.4. the Kolmogorov-Smirnov test of normality is found and in Table 8.2.5. the White test
is found.
Table 8.2.1: Parameter Estimates for the unconstrained power decay model
Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 0.72810 0.20314 3.58 0.0003
dlog 1 -0.74711 0.01826 -40.91 <.0001
33
Table 8.2.2: Analysis of variance for the unconstrained power decay model
Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 1 4686.20765 4686.20765 1673.95 <.0001
Error 2517 7046.29940 2.79948
Corrected Total 2518 11733
Table 8.2.3: Evaluation measures for the unconstrained power decay model
Root MSE 1.67317 R-Square 0.3994
Dependent Mean -7.47037 Adj R-Sq 0.3992
Coeff Var -22.39737
Figure 8.2.1: Regression plot for the unconstrained power model. With 𝑙𝑛 (𝑇𝑖𝑗
𝑂𝑖𝐷𝑗) on the Y-
axis and the natural logarithm of distance on the x-axis.
34
Figure 8.2.2: Residual plot for the unconstrained power decay model
35
Figure 8.2.3: Fit diagnostics for the unconstrained power decay model
Table 8.2.4: Test of normality of residuals for the unconstrained power decay model
Test for Normality Statistic p Value
Kolmogorov-Smirnov D 0.035495 Pr > D <0.0100
36
Table 8.2.5: White tests of heteroscedasticity of residuals for the unconstrained exponential
decay model
Test of First and Second
Moment Specification
DF Chi-Square Pr > ChiSq
2 8.34 0.0154
8.3. Moran’s Index for the Unconstrained Models
In the following two figures the results of the global Moran’s Index test are found. Figure
8.3.1. shows the results for the unconstrained exponential decay model and Figure 8.3.2.
shows the results for the unconstrained power decay model.
37
Figure 8.3.1: Moran’s Index for the unconstrained exponential decay model
38
Figure 8.3.1: Moran’s Index for the unconstrained power decay model