Trip distribution forecasting with multilayer perceptron …directory.umm.ac.id/Data Elmu/jurnal/T/Transportation Research Part... · Trip distribution forecasting with multilayer

Trip distribution forecasting with multilayer perceptron neuralnetworks: A critical evaluation

M. Mozolin a, J.-C. Thill b,*, E. Lynn Usery c

a ESRI, Inc. Redlands, CA, USAb Department of Geography and National Center for Geographic Information and Analysis, State University of New York

at Bu�alo, Amherst, NY, USAc Department of Geography, University of Georgia, Athens, GA, USA

Received 17 February 1998; received in revised form 15 April 1999; accepted 19 April 1999

Abstract

This study compares the performance of multilayer perceptron neural networks and maximum-likeli-hood doubly-constrained models for commuter trip distribution. Our experiments produce overwhelmingevidence at variance with the existing literature that the predictive accuracy of neural network spatial in-teraction models is inferior to that of maximum-likelihood doubly-constrained models with an exponentialfunction of distance decay. The study points to several likely causes of neural network underperformance,including model non-transferability, insu�cient ability to generalize, and reliance on sigmoid activationfunctions, and their inductive nature. It is concluded that current perceptron neural networks do notprovide an appropriate modeling approach to forecasting trip distribution over a planning horizon forwhich distribution predictors (number of workers, number of residents, commuting distance) are beyondtheir base-year domain of de®nition. Ó 1999 Elsevier Science Ltd. All rights reserved.

1. Introduction

A number of modeling approaches have been put forward over the years to distribute trips,freight or information among origins and destinations. One of the more successful ones is thespatial interaction (or gravity) model (Ortuzar and Willumsen, 1994). This model relates thematrix of ¯ows to a matrix of interzonal impedance. Traditionally, the spatial interaction model iscalibrated by one of several well known techniques, including regression, maximum likelihood, orby numerical heuristics. Several recent studies (Black, 1995; Fischer and Gopal, 1994; Gopal andFischer, 1996; Openshaw, 1993) have proposed the neural network architecture as a means to

Transportation Research Part B 34 (2000) 53±73www.elsevier.com/locate/trb

* Corresponding author. Tel.: +716 645 2722; fax: +716 645 2329; e-mail: [email protected]�alo.edu

0191-2615/99/$ - see front matter Ó 1999 Elsevier Science Ltd. All rights reserved.

PII: S0191-2615(99)00014-4

model the distributed complexity of spatial interaction. 1 This line of research has shown thatneural networks generally outperform classical calibration and estimation approaches.

At ®rst, this conclusion should not come as much of a surprise to many modelers given the widesuccess experienced by neural networks in pattern recognition and classi®cation (Bishop, 1995;Smetanin, 1995; Ripley, 1996), as well as in various application ®elds of transportation engi-neering and planning (Dougherty, 1995; Himanen et al., 1998; Hua and Faghri, 1994). After all,neural networks impose less constraints on the form of the functional relationship between inputsand outputs than conventional ®tting techniques. This paper revisits this conclusion by payingattention to several aspects of spatial interaction modeling that have not been addressed so far.

Our aim is to compare the performance of a perceptron neural network (NN) spatial inter-action model to that of a baseline, conventionally estimated spatial interaction model beyond thecomparative work done previously. The comparison is conducted empirically on journey-to-workpatterns in the Atlanta metropolitan area. Our approach di�ers drastically from others in severalrespects.

Firstly, we evaluate the models in a predictive mode. In other words, calibration is done onobserved, base-year data, while testing is conducted on data for the projection year. To the best ofour knowledge, all other NN studies of trip distribution have used the same origin-destinationmatrix for training and testing, thus allowing the network to learn the noise in the training data(Black, 1995). Incidently, NN applications to tra�c data and other transportation problems alsouse hold out samples for testing. Secondly, our baseline model is a doubly-constrained modelestimated by maximum likelihood. This is a departure from Fischer and Gopal (1994), and Gopaland Fischer (1996) who chose the less accurate unconstrained spatial interaction model as abenchmark, and estimated model parameters by ordinary least squares regression, a methodconsidered less precise than maximum likelihood (Fotheringham and O'Kelly, 1989).

Thirdly, we evaluate the models on origin-destination matrices of di�erent sizes (from hundredsof origin/destination zones down to a dozen) to test the sensitivity of our conclusions to the size ofthe interaction system being modeled. Finally, we apply an adjustment factor to ¯ows predictedby the NN output to satisfy production and attraction constraints, and thus make it possible tounambiguously interpret any discrepancy with ¯ows predicted by the baseline doubly-constrainedmodel in terms of relative performance of the models.

The paper presents a case where a conventional spatial interaction model outperforms amultilayer perceptron NN model of spatial interaction. The predictive mode of the analysisreplicates the process by which trip distribution is realized in transportation planning, and thushelps to compare the merits of the conventional and NN approaches for practical applications ofspatial interaction modeling.

The remainder of the paper is organized as follows. The next two sections present an overviewof the conventional spatial interaction model of journey-to-work, and of the multilayer per-ceptron NN model. In the following section, we describe the setup of the empirical test of thelatter model against the former, as well as the data used in the test. Next, results under di�erent

1 Some work has also been done on estimating origin-destination matrices from tra�c counts or cordon intercept

survey data with neural networks (for instance, Kikushi et al., 1993). This type of problem is out of the scope of this

paper.

54 M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73

modeling con®gurations are detailed. We conclude with a discussion of possible explanations forthe better performance of the conventional spatial interaction model.

2. Journey to work problem and its conventional solution

Spatial interaction may be de®ned in general terms as any ¯ow of commodity, people, capital,or information over space resulting from some explicit or implicit decision process (Fotheringhamand O'Kelly, 1989). Journey to work is one kind of spatial interaction. Other kinds of spatialinteraction include journey to school, shopping trips, non-home based intraurban trips, intercitypopulation migration, choice of college or university by students, intercity freight movement,telephone calls, Internet access, and many others.

Spatial interaction models are often classi®ed on the basis of the number and character ofconstraints imposed on the predicted trip matrix. Constraints represent a priori knowledge aboutthe total interaction ¯ows entering and/or exiting a particular zone. For example, if the totalnumber of employed residents in each zone i(Oi) is known exogenously, then the sum of predicted¯ows leaving each zone is equal to OiX

j

Tij � Oi; 8i �1�

where Tij is the ¯ow of commuters from zone i to zone j predicted by the model. Similarly if oneknows total employment in each zone (Dj), one can impose that the sum of predicted commuter¯ows ending in each zone is equal to Dj:X

i

Tij � Dj; 8j: �2�

If Eq. (1) holds for each origin zone then the model is said to be production constrained; if Eq. (2)holds for each destination zone then the model is referred to as attraction constrained. If bothEqs. (1) and (2) hold, the model is doubly constrained, while if neither of the two holds, the modelis unconstrained.

Trip distribution may be modeled with any number of constraints. Implementing the additionalconstraints requires more a priori information. In turn, the reduction of degrees of freedom leadsto a more accurately predicted ¯ow matrix. It has been shown empirically that estimating spatialinteraction with a doubly-constrained model yields the most accurate results. See, for example,Fotheringham and O'Kelly (1989) on interregional migration among the nine major census di-visions in the United States, or Mozolin (1997) on commuting trips within metropolitan Atlanta.Because of its higher accuracy in modeling trip distribution, the doubly-constrained model is aproper baseline against which to evaluate neural networks.

The doubly-constrained spatial interaction model of journey to work can be formulatedmathematically as:

Tij � AiOiBjDjf �cij�; �3�where cij is the travel impedance (distance) from zone i to zone j, f(cij) is a distance decayfunction, Oi is the number of workers resident in zone i (production), Dj is the number of workers

M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73 55

in zone j (attraction), and Ai and Bj are balancing coe�cients ensuring that Eqs. (1) and (2) aresatis®ed:

Ai � 1Pj

BjDjf �cij� �4�

and

Bj � 1Pi

AiOif �cij� : �5�

Two alternative speci®cations of the distance decay function f(cij) will be used here: the negativepower function cÿb

ij �b P 0�, and the negative exponential function exp�ÿbcij��b P 0�.Of all the methods suggested to calibrate spatial interaction models (see, for instance, Ba-

charach (1970), Batty (1976), Evans (1971), Fotheringham and O'Kelly (1989), Ortuzar andWillumsen (1994), Wilson (1970)), we choose to use a maximum-likelihood estimation (MLE)approach. Batty and Mackie (1972) have shown that likelihood maximization boils down tosolving a non-linear equation. With a power distance function, this equation is given byX

i

Xj

Tij ln�cij� �X

i

Xj

T �ij ln�cij� �6�

where Tij� is the actual number of commuters from zone i to zone j, and Tij is estimated by

Eqs. (3)±(5). Mutatis mutandis with an exponential distance function. The SIMODEL computercode (Williams and Fotheringham, 1984) is used to derive parameter estimates.

3. Multilayer perceptron neural networks applied to the journey to work problem

Background of multilayer perceptron neural networks is presented below before we proceedwith their application to the journey to work problem. The multilayer perceptron neural networkis one of a variety of parallel computing techniques that conceptually mimic structures andfunctions of human central neural systems. The model used in this study is a three-layer fully-connected feedforward NN which consists of input nodes representing independent variables (theproductions, the attractions, and the travel impedances), hidden nodes, and one output node forthe dependent variable, namely the ¯ow Tij from zone i to zone j. See Fig. 1 for the architecture ofa NN with four hidden nodes. Each input node corresponds to an independent variable in the ¯owmodel while the dependent variable Tij is the output node. The network output (activation) z isobtained by a double logistic transform of the weighted sum of inputs. The reader is referred toHaykin (1998), Rojas (1996), or Smith (1993) for an in-depth coverage of the NN methodology.

The most valuable property of multilayer feedforward NNs is their ability to approximate adesired function from training examples. In fact, a three-layer fully connected feedforward NNwith n input nodes, a su�cient number of hidden nodes, and one output node can be trained toapproximate an n to 1 mapping function of arbitrary complexity (Kreinovich and Sirisaengtaskin,1993). Learning of network weights often proceeds by backpropagation of errors (Rumelhartet al., 1986) so as to minimize the total error for all examples in a training set.

The backward pass is a recursive procedure (Rumelhart et al., 1986) during which the partialderivatives of all network weights with respect to the error used to adjust the weights up or down


to reduce total error. In this research, we use an o�-line, or epoch-based learning: network weightsare adjusted only after all examples in the training set have been processed. Several non-linearoptimization methods are available to ®nd a set of weights that minimizes the error on all ex-amples in the training set. This study uses the Quickprop algorithm developed by Fahlman (1989).Though it does not guarantee a global optimum, its quick convergence dramatically increases thespeed of NN training. The gradient descent method (Rumelhart et al., 1986) is applied in thoserare instances where Quickprop cannot be used.

Backpropagation neural networks easily ®t into the framework of the doubly-constrainedspatial interaction model. The network learns the mapping function that best ®ts the relationshipbetween dependent variables (production, attraction, and travel impedance) and the independentvariable (¯ows). Interestingly, the mapping function is no longer restricted to either power orexponential functional form as in the conventional models. Nor is it explicitly speci®ed as a linearor non-linear regression model. The major advantage of the NN approach is that it is ¯exibleenough to model non-linear relationships of arbitrary complexity in an automated fashion.

As noted by Black (1995), Fischer and Gopal (1994), and Gopal and Fischer (1996), a NN mayperform well enough to estimate actual spatial interaction ¯ows, but small deviations are boundto remain. Furthermore, the network itself does not contain any mechanism to enforce the originand destination constraints. Consequently, the origin and destination totals derived by summingthe ¯ows predicted by the model are usually not equal to the actual origin and destination totals.

Fig. 1. Fully-connected three-layer perceptron neural network.


We use a standard iterative proportional ®tting procedure (Slater, 1976) to enforce these con-straints. After this post-processing, NN ¯ows give predictions comparable to those of a doubly-constrained model.

Network training is realized with NevProp 1.16 (Goodman, 1996). The Quickprop algorithm isembedded in NevProp, but pre- and post-processing (including scaling and enforcing productionand attraction constraints) are part of separate applications written by the authors.

4. Empirical analysis

4.1. Study Area and Data Sources

We use 1980 and 1990 journey-to-work commuter ¯ows in the Atlanta Metropolitan Area.Commuter ¯ows among the 15 counties of Atlanta SMSA for 1980 are available from the 1980U.S. Census (Bureau of the Census, 1983). Commuter ¯ows among the 20 counties of the AtlantaMSA (Fig. 2) for 1990 are available in the Census Transportation Planning Package (CTPP)(Bureau of Transportation Statistics, 1993). Data sets on commuting ¯ows between census tractsin 1980 and 1990 were kindly made available to the authors by the Atlanta Regional Commission.There were 345 census tracts in 1980, and 507 census tracts in 1990 in the study area.

Fig. 2. Counties of Atlanta Metropolitan Statistical Area (MSA) in 1990.


The logistics of spatial interaction modeling requires a clearly de®ned region with no, or small,¯ows across its border. In the case of Atlanta, this assumption is not grossly violated. In 1980,slightly more than 90% of working residents of the Atlanta SMSA also worked inside the SMSA.In 1990, 95.3% of working residents of the Atlanta MSA worked inside the MSA. Also, 92.9%employed in the Atlanta MSA also lived within the region in 1990.

Spatial separation between commuting zones (counties or census tracts) is measured by thestraight-line (Euclidean) distance between zone centroids in the metropolitan area. Setting in-trazone distance to zero is known to generate systematic measurement errors. It is commonpractice to correct for this error by de®ning the distance from a zone to itself as a quarter of thedistance from the zone centroid to the centroid of its nearest neighbor (Thomas and Hugget,1980). The Euclidean distance is only an approximation of the perceived impedance betweenhome and work locations. We recognize that road distance, travel time, or a generalized travelcost function may di�erently a�ect the predictive accuracy of the MLE and NN models. However,since the major thrust of this study is to compare and evaluate two forecasting techniques, therelative accuracy of their estimates matters more than their absolute accuracy and the approxi-mation given by Euclidean distance is acceptable. 2

4.2. Implementation issues

The journey-to-work analysis is structured as follows. First, two doubly-constrained spatialinteraction models are calibrated by MLE on 1980 travel data. One model uses county-level data,the second uses census tract-level data. Each model is then employed to forecast interzonalcommuter ¯ows (intercounty or intertract) for year 1990 with the calibrated distance decay pa-rameter b and 1990 working population and employment data at the corresponding geographicresolution (Oi and Dj marginal totals) for production and attraction, respectively. Forecasted¯ows are compared to actual 1990 trip data using four goodness-of-®t measures (the absoluteerror (AE), the standardized root mean square error (SRMSE), Kulback's w statistic, and theR-square. See Fotheringham and Knudsen (1987), Weiss (1995), and others for a description ofthe statistics. In most cases, these measures are highly consistent. Hence, only the AE and SRMSEmeasures are reported for the tested models hereunder.

In parallel, two sets of NN spatial interaction models are trained and validated on the same1980 travel data-one on county-level ¯ows, the other on census tract ¯ows. With the networkweights for which the validation error is minimum and 1990 population and employment data, theNNs predict 1990 interzonal commuter ¯ows at the county and census tract levels respectively(test sets). Goodness-of-®t of these forecasts to actual ¯ows is once again measured. Finally, therelative performance of MLE and NN spatial interaction models in predictive mode is assessed bycomparing goodness-of-®t measures.

2 In fact, preliminary test results in Mozolin (1997) indicate that the goodness-of-®t of an MLE doubly-constrained

model is enhanced by substituting average reported travel time for Euclidean distance, and vice versa for the NN

formulation. Consequently, if the use of the Euclidean distance as a proxy for spatial impedance biased our test, it is

most likely to be in favor of the NN model, which places our conclusions of the conservative side.


Selecting a NN con®guration and parameters suitable for a certain problem is often a chal-lenging task. The ®rst stage of backpropagation feedforward NN model design entails setting thetopology of the model. A natural topology for a doubly-constrained spatial interaction probleminvolves three inputs (the number of resident workers in the zone of origin, employment in thedestination zone, and the spatial impedance between the zones) and one output (the number ofcommuters). It is common practice to proceed by trial and error to select the number of hiddennodes, and to test networks with hidden layers of varying size. Networks with 5, 20, and 50 hiddennodes are tested in this study. Networks of larger sizes are impractical due to the excessivecomputational requirement of their training. Each network con®guration is processed ®ve times,each run starting with a random set of initial weights and a training set drawn randomly from thefull data set. We report results of experiments with di�erent partitions of the full data set intotraining and validation sets.

The NN model is further speci®ed as follows. Since, in most instances, weights are changedaccording to the Quickprop rule, no momentum term is needed, and the learning rate must bespeci®ed only for use with the gradient descent method. 3 A 0.1 learning rate is used throughoutthe analysis. Experiments with di�erent rates lead to remarkably similar weight estimates andlearning speeds. Initial weights are randomly drawn from a uniform distribution within the range[ÿ0.01, +0.01].

All three network inputs are scaled by dividing the value observed for each example by theinput's maximum value in the set. Whereas input scaling is optional, scaling of the output isrequired for successful learning. Scaling to ®t the output within the [0.1, 0.9] range is usually used.However, because the networks are tested on data other that those used for training and vali-dation, and that total ¯ows have increased between base and prediction years, the interval isscaled to 0.75. Therefore, the output (the number of commuters) is scaled using a linear trans-formation to have 1980 ®t the [0.25, 0.75] range. This transformation is given by

T networkij � 0:25� 0:5

Tij

Tmax1980

; �7�where T network

ij is the output as seen by the network, and Tmax1980 is the maximum commuter ¯ow in1980. At the testing stage, Eq. (7) is used in reverse.

All networks are trained for a maximum of 100 000 iterations. Many neural network practi-tioners allow for an early stopping of the feedforward backpropagation algorithm (Sarle, 1996) inorder to prevent over®tting. It is critical to realize that the error on the validation set is not a goodestimate of network generalization. A stopped network is tested on an independent test set thathas never been used for training to give an unbiased estimate on the network performance. Wetrain and validate all networks on the 1980 data, while the testing is accomplished on the 1990data.

At the county level, a total of 225 data vectors are available. For each network processed, thetraining set is formed by randomly selecting 112 vectors without replacement, while the remaining113 vectors are used for validation. In one experiment, the full set of vectors is used both for

3 Fahlman (1989) also suggests adding a small constant (known as the sigmoid prime o�set) to the error derivative to

avoid very slow learning when z is close to 0 or 1. The recommended value is set 0.1. The Quickprop Maximum Growth

Factor is to 1.75, as suggested by Fahlman (1989).


training and validation. The network weights that minimize validation error serve to test themodel on the 400 interactions from the 1990 trip matrix.

At the census-tract level, the training set is selected by simple random sample without re-placement of 200 examples from the 121 104 origin-destination pairs in the 1980 tract-to-tract tripmatrix. 4 Similarly for the validation set. The optimal set of network weights is then tested on all257 049 vectors from the 1990 tract-level trip matrix.

5. Results of performance comparison

5.1. Baseline spatial interaction models

The results of calibrating and testing maximum likelihood doubly-constrained models ofjourney to work at the county and census-tract levels are presented in Table 1. The overall per-formance of these models with an exponential function of spatial deterrence is su�cient to providea benchmark against which to evaluate the performance of NN models. 5

Of the two distance decay functions, better performance is exhibited by the negative expo-nential function in terms of all four goodness-of-®t measures, and at both aggregation levels(county and census tract). This is consistent with the widespread consensus that the exponentialfunction is more appropriate for analyzing short distance interactions, such as those that takeplace within an urban area, while the power function is more appropriate for analyzing longerdistance interactions such as interstate migration ¯ows (Fotheringham and O'Kelly, 1989).

Table 1 clearly shows that county-level models are more accurate than models applied at thecensus-tract level. Lower model performance or ®t at a more detailed geographic scale is notunusual. This phenomenon is known in the spatial-analytic literature as the modi®able areal unitproblem (Openshaw, 1984). Treatment of the issue in the context of spatial interaction modelingcan be found in Amrhein and Flowerdew (1989), Batty and Sikdar (1982), and others. In sub-stance, if a simple functional relationship with a single parameter, like the doubly-constrainedspatial interaction model, presents some di�culties in accounting for all subtleties of 400 inter-actions in the 20-country trip matrix, it is more exacerbated with the 507-tract trip matrix.Consequently, if NNs truly outperform MLE spatial interaction models, one would anticipate theadvantage to be much larger with census tracts that with county data. Along the same line, in caseNNs were not to perform as well as MLE models with the latter data, the reverse statement couldreasonably be expected for commuter ¯ows between census tracts.

5.2. Neural network models

The results of the NN training and testing on county-level data are presented in Table 2. All®ve sets of networks exhibit good to very good ability to predict 1990 commuter ¯ows in Atlanta.

4 Tests with training sets of 1000 cases do not show better predictive accuracy than with 200 cases, while

computational time becomes prohibitively high.5 Similar, though slightly less good results are obtained with a power function of distance.


Comparison of the ®rst four sets of results reported in Table 2 indicates that, except for a slighttendency for performance to drop as more hidden nodes are used, there is no signi®cant impact ofthe number of hidden nodes in a network on goodness-of-®t. It is also noteworthy that all betterperforming networks were stopped after less than 10 000 epochs.

Table 2

Neural network models trained using 1980 county-to-county commuter ¯ows, and tested on 1990 commuter ¯ows

Instance Absolute error (%) SRMSE Epoch network was stopped

(a) Five-node networks

1 42.6 1.877 2500

2 39.5 1.182 1000

3 41.0 1.466 3500

4 44.0 1.762 13 500

5 40.2 1.642 8000

Average 41.5 1.586

(b) Twenty-node networks

1 47.0 1.890 500

2 41.8 1.806 4500

3 50.6 1.788 21 000

4 32.7 1.077 9500

5 52.6 1.655 500

Average 44.9 1.643

(c) Fifty-node networks

1 51.2 1.936 31 500

2 50.5 1.773 500

3 28.8 0.847 1500

4 56.6 2.396 40 500

5 42.1 1.683 9 000

Average 43.8 1.722

(d) Twenty-node networks trained on full 1980 data

1 30.9 0.920 100 000

2 30.1 0.878 100 000

3 36.2 0.962 100 000

4 41.9 1.151 100 000

5 39.7 1.080 100 000

Average 35.8 0.998

Table 1

Maximum likelihood doubly-constrained models calibrated using 1980 ¯ows, and tested on 1990 commuter ¯ows

Distance decay parameter

(b)

Absolute error (AE)

(%)

SRMSE

County level

Exponential function of distance decay ÿ8.43 ´ 10ÿ5 24.7 0.866

Census-tract level



The bottom part of Table 2 displays the results of testing ®ve 20-node NNs trained on theentire 1980 data set (225 training cases). As expected, they perform rather consistently andsomewhat better than networks trained on half the available interaction pairs. Training a neuralnetwork on the full set of cases is usually not a recommended practice because it promotesover®tting, a point evidenced here by the failure of all ®ve networks to converge before 100 000training epochs. On the other hand, training data is less sparse than with a sample of cases, andthe network's power to generalize input±output relationships is enhanced.

Neural networks trained and tested on tract-level ¯ows perform signi®cantly worse that thosetrained on county-level data (Table 3). The best NN model produces errors that are 83.6% of 1990commuter ¯ows, while the worst model hits a whopping 119.1% absolute error. The faint inverserelationship between model performance and number of hidden nodes detected above is nowclearly marked. Average goodness-of-®t measures drop with the increase of hidden nodes from 5to 50.

5.3. Neural Network versus MLE Models

None of the NN models tested on county commuter ¯ows outperforms the corresponding MLEdoubly-constrained model. The only partial exception to this rule is the case of one 50-node NNmodel (#3 in Table 2(c)) which shows a better performance as measured by SRMSE and

Table 3

Neural network models trained using 1980 tract-to-tract commuter ¯ows, and tested on 1990 commuter ¯ows with

training and validation sets selected randomly


(a) Five-node networks

1 90.4 3.265 12 000

2 83.6 3.226 6500

3 89.0 3.280 5500

4 108.3 4.072 500

5 96.6 3.701 1200

Average 93.6 3.509

(b) Twenty-node networks

1 92.4 3.451 3000

2 105.3 4.000 1000

3 119.1 5.328 10 500

4 105.9 4.009 98 500

5 97.5 3.760 4000

Average 104.0 4.110

(c) Fifty-node networks

1 109.0 4.048 500

2 109.8 4.106 500

3 111.0 4.218 79 000

4 90.2 3.418 16 000

5 101.6 3.845 4000

Average 104.3 3.927


R-square, but not according to the other two statistics. Even more remarkable, the ®ve NNstrained on the entire 1980 data set (Table 2(d)) still fail to surpass the MLE model in any run,though they come closer to challenging its superiority.

At the census-tract level, the comparison is even more favorable for the MLE model. All runsof the NN models (Table 3) trail far behind the MLE model (Table 1). The best of twelve NNmodels misallocates 83.6% of all commuter ¯ows, while the conventional doubly-constrainedmodel with the negative exponential function of distance misallocates ``only'' 68.7% of ¯ows.

It is appropriate to stress here again that model performance is evaluated in a predictive mode,that is by the capacity of a model to predict interaction ¯ows for a horizon other than the baseyear used in training and validation. In fact, performance measured on base-year data would leadto opposite conclusions, thus supporting the existing literature in the matter. For informationpurposes, performances of MLE and NN models trained and tested on the 1990 county-to-county¯ows are reproduced in appendices A and B, respectively.

By all accounts, the evidence reviewed above that neural networks show inferior predictiveperformance over conventional statistical models is quite puzzling and unexpected. Neural net-works are indeed regarded as good approximators (Kreinovich and Sirisaengtaskin, 1993). Thedata analysis calls for further research to pinpoint the causes of their underperformance. In orderto trace potential patterns of consistent underprediction or overprediction by NN models, we usethree-dimensional plots of observed and predicted data. Each plot displays ¯ows originating froma given county against distance and number of workers at destinations. Such plots for a sample offour counties, namely Clayton, Cobb, DeKalb, and Fulton counties, are depicted in Fig. 3 for1990. Corresponding ¯ow surfaces generated by the ®ve tested instances of 20-node neural net-works (see Table 2-b) are given in Fig. 4.

At examination, the predicted surfaces in Fig. 4 reveal unsuspected structures dominated by awavy pattern of troughs and ridges. These structures are particularly pronounced in instance three(Fig. 4(c)), which also happens to be among the instances that predicts 1990 ¯ows with the leastoverall accuracy. This pattern is often symptomatic of over®tting due to excessive training of thenetwork. That this network was trained longer than any other 20-node network suggests that itlearned the noise in the training set in addition to the underlying function we want it to ®nd. As aresult, its ability to generalize is rather poor and its prediction accuracy is low particularly wheretraining data are sparse (interpolation problem). Another feature common to several underper-forming network instances in Fig. 4 is the consistent underestimation of the largest 1990 expected¯ows (Figs. 3 and 4). 6 Networks fail to extrapolate around and beyond the limits of the trainingsample. Possible explanations for interpolation and extrapolation errors are now pursued.

The spurious ridges and troughs that show throughout predicted surfaces suggest that over-®tting may be occurring, in spite of the early stopping mechanism put in place to prevent it. In ourcommuter trip distribution problem, we can postulate that the occurrence of over®tting is tied toan excessive number of hidden nodes in the network. Our problem may in fact be simple enoughto require less than ®ve hidden nodes. After all, conventional spatial interaction models perform

6 Most remarkable in this matter are instances 2 and 3, for which the training set does not include the Fulton-Fulton

¯ow.


well with a single parameter. Such a neural network could be devoid of spurious ridges andtroughs, and generalize just right.

To test the proposition that the predictive performance of NN models is improved by reducingthe number of hidden nodes, various neural networks with one to three hidden nodes are tested on1980 county-level commuter ¯ows and validated on 1990 ¯ows. Results are summarized inTable 4. Networks with fewer hidden nodes su�er less from spurious troughs and ridges on theirprediction plots, and therefore, are less prone to over®tting. In fact surfaces generated by one-hidden-node networks do not exhibit any spurious feature (Fig. 5). Such networks no longermodel the noise in the training data because they are unable to produce complex surfaces. Thisdoes not translate however into unequivocally better goodness-of-®t with validation data forsparser networks. Furthermore, none of the networks tested with a reduced number of hidden

Fig. 3. Actual number of commuters in 1990 as a function of distance between county of residence and country of work,

and the number of jobs in the country of work. All variables are measured as proportions of maximum 1980 values. The

``Number of workers'' axis is scaled logarithmically. Four countries of residence: a, Clayton Country; b, Cobb Country;

c, DeKalb Country; d, Fulton Country


Fig. 4. 1990 commuter ¯ows from Fulton Country predicted by the 20-node neural networks listed in Table 2; a, ®rst

instance; b, second instance; c, third instance; d, fourth instance; e, ®fth instance. All variables are measured as pro-

portions of their maximum 1980 values. The ``Number of workers'' axis is scaled logarithmically.


nodes (Table 4) succeeds in outperforming the MLE doubly-constrained model with exponentialfunction of distance (Table 1). A straightforward consequence is that lower performance of neuralnetworks cannot be imputed to over®tting and cannot be remediated easily by modifying thetopology of the networks.

The fact remains that neural networks have a limited ability to interpolate spatial interactiondata in a predictive mode. Paradoxically, the cause of this weakness may also be the essence of itsstrength in validation on contemporary data, namely the inherent ¯exibility to approximatecomplex data structures with great accuracy. In short, the poor ®t of neural networks on pre-diction-year data (1990) can be blamed on their unrivaled ®t to base-year data (1980). Accordingto this view, neural networks are such good approximators that they model not only interactiondata structures, but also the context of the transportation systems within which commuter pat-terns take place. By design, spatial interaction neural networks are context-dependent modelswhose parameters do not transfer well to other contexts. The extent of NN context sensitivityremains a subject for future study. A solution to this problem may come from the explicit in-corporation of context dependencies in the network representation. Evidence in Table 4 suggeststhat model transferability is problematic even for sparse model topologies.

It is our contention that the sigmoid form of network output limits the ability of neural net-works to extrapolate interaction data in a meaningful way. Sigmoid output nodes tend indeed togenerate S-shaped predicted surfaces that are ill-suited to model spatial interaction behavior. For

Table 4

Neural network models with few hidden nodes. Trained using 1980 county-to-county commuter ¯ows, and tested using

1990 commuter ¯ows


(a) One-node networks

1 38.1 1.592 2500

2 38.1 1.245 1500

3 35.6 1.666 27 000

4 41.8 1.434 500

5 42.9 1.480 1000

Average 39.3 1.483

(b) Two-node networks

1 32.5 1.276 3000

2 27.7 0.962 37 000

3 64.8 2.387 51 000

4 46.0 2.002 1000

5 60.6 2.262 20 000

Average 46.3 1.778

(c) Three-node networks

1 31.1 1.248 500

2 51.9 2.143 500

3 41.9 1.904 2000

4 43.6 2.021 1000

5 38.0 1.272 1000

Average 41.3 1.718


illustration purposes, let us compare how ¯ows predicted by the NN and conventional gravitymodels respond to distance as the other two input variables are held constant. Most NN ¯owsurfaces (Figs. 4 and 5) have in common an S-shaped pro®le of dependence between ¯ow volumeand distance. This pro®le implies that, all other things being equal, the marginal ¯ow increase withrespect to distance is small and declining, sometimes even negative. On the contrary, observedpatterns (Fig. 3) show no tapering in the relationship between ¯ow volume and distance. Con-sequently, ¯ow extrapolation on the S-shaped pro®le is highly inaccurate. Because enough of the1990 ¯ow data fall outside of the range of the 1980 training data, the overall performance of thenetwork is generally poor. A signi®cant implication is that conventional feedforward backprop-agation NNs may not exhibit the right properties for use in the application domain of trip dis-tribution. Other NN models that do not assume a sigmoidal activation function ± such as theGaussian Radial Basis Function model (Verleysen and Hlavackova, 1994) ± may prove bettersuited for spatial interaction problems.

In contrast to neural networks, the smooth surface generated by the MLE model with a neg-ative exponential function of distance decay provides a better ®t to the empirical data. A good ®tis achieved not only for the data on which the model was calibrated, but also for the unseen databeyond the training range. This indicates that the maximum likelihood model is a betterextrapolator than the neural network, and a better tool for urban and regional planning. Afundamental reason for better performance of the maximum-likelihood model is that, being a one-parameter model, it generalizes more than neural networks and, consequently, is more contextindependent. Also contributing towards better performance is its derivation from the ®rst prin-ciples, whereas the NN approach is purely data-driven. Wilson (1970) showed in his seminal workhow the exponential distance decay function is derived from the entropy principle, by ®nding the

Fig. 5. 1990 commuter ¯ows from Fulton Country predicted by a single-node neural network. All variables are

measured as proportions of maximum 1980 values.


most likely trip matrix given the origin and destination totals and the total distance traveled in thesystem. The principle of maximum likelihood applies to all trip matrices, regardless of their usefor model calibration or model testing; hence the better extrapolation capability of the maximum-likelihood model.

5.4. Geographic scale problem

The dramatically lower performance of neural networks on tract-level data suggests that ad-ditional factors are at work at this scale. The vast majority of commuter ¯ows in tract-level tripmatrices are zero (82.9% of all ¯ows in the 1980 matrix, and 82.5% in the 1990 matrix), while mostnon-zero ¯ows are fairly small. With only a small fraction of ¯ows signi®cantly larger than therest, small random samples of training examples have little chance to include large ¯ows. As aresult, networks trained on random samples of examples primarily learn how to predict zero andvery small ¯ows. Since we established earlier that neural networks are rather poor extrapolators ofspatial interaction ¯ows, their predictions of larger ¯ows is highly inaccurate. Hence the lowoverall performance of neural networks on small analysis zones.

Resorting to larger samples (say, more than 1000 cases), or even to the entire population ofsamples, is not a practical solution because it leads to unacceptably long training. An appealingalternative consists in using strati®ed random sampling instead of uniform random sampling inorder to represent ¯ows of all sizes in the training set. The e�ectiveness of this strategy is nowassessed with two distinct strati®ed sampling schemes.

In strategy I, 20 examples of zero ¯ows and 180 examples of non-zero ¯ows are selected ran-domly without replacement from 121,104 interactions in 1980. In strategy II, we select 10 ex-amples of zero ¯ows, and 10 randomly-selected examples from each bin of origin-destination pairs

Table 5

Neural network models trained using 1980 tract-to-tract commuter ¯ows, and tested using 1990 commuter ¯ows with

training and validation sets selected using strati®ed random sampling

Instance Absolute error (%) SRMSE Epoch network

was stopped

Sampling strategy I; Five-node networks

1 90.5 3.407 8000

2 91.0 3.673 1000

3 94.7 3.766 500

4 90.8 3.416 1500

5 94.6 3.502 2000

Average 92.3 3.553

Sampling strategy II; Five-node networks

1 93.4 3.611 1000

2 91.0 3.565 500

3 90.3 3.540 500

4 90.9 3.564 500

5 93.4 3.613 1000

Average 91.8 3.578


de®ned by 10-unit increments on the ¯ows. In both strategies, validation sets are selected simi-larly. The testing results for a 5-node network are presented in Table 5.

Comparison of these goodness-of-®t results to those of the ®ve-node network trained on asimple random sample Table 3 reveals no signi®cant improvement. The ®ve-node networks withtraining and validation sets selected using strati®ed random sampling have an average absoluteerror of 92.3% for sampling strategy I, and 91.8% for sampling strategy II, against 93.6% withuniform random sampling. This piece of evidence suggests that using strati®ed random samplinginstead of uniform random sampling to select the training set does not improve the accuracy ofNN spatial interaction models. More complex strati®cation strategies may produce better results,but we leave this investigation for the future.

6. Conclusions

This study compared the performance of multilayer perceptron neural networks and maximum-likelihood doubly-constrained models for commuter trip distribution. Our experiments producedoverwhelming evidence that NN models may ®t data better but their predictive accuracy is poor incomparison to that of maximum-likelihood doubly-constrained models. What our thorough studyfailed to identify are perceptron model con®gurations that consistently exhibit a predictive per-formance that surpasses that of maximum-likelihood doubly-constrained models. It points toseveral likely causes of neural network underperformance, including model non-transferability,insu�cient ability to generalize, reliance on sigmoid activation functions, and their essence asdata-driven techniques. An agenda for future research is also proposed to explore the potential forother perceptron formulations (i.e., spatial structure as NN input) and other neural networks(RBF, for instance) to predict spatial interaction ¯ows with greater accuracy.

This conclusion is at variance with the existing literature which has been overly optimisticabout the advantages of modeling trip distribution by spatial interaction with backpropagationneural networks. While neural networks may perform better than conventional models in mod-eling spatial interaction for the base year, they fail to outperform the MLE doubly-constrainedmodel for forecasting purpose, which is the motivation behind these modeling e�orts in the ®rstplace. Therefore, current perceptron neural networks do not provide an appropriate modelingapproach to forecasting trip distribution over a planning horizon for which distribution predictors(number of workers, number of residents, commuting distance) are well beyond their base-yeardomain of de®nition.

Acknowledgements

The authors are grateful to Dr. Frank Koppelman. His insightful comments on an earlierversion of the manuscript were instrumental in enhancing its quality.

Appendix A

Maximum likelihood doubly-constrained models calibrated and tested using 1990 commuter¯ows among the counties of the Atlanta MSA.


Appendix B

Neural network models trained and tested using 1990 county-to-county commuter ¯ows inAtlanta

References

Amrhein, C.G., Flowerdew, R. (1989). The e�ect of data aggregation on a Poisson regression model of Canadian

migration. The Accuracy of Spatial Databases Goodchild, M., Gopal, S. pp. 229±238. Taylor and Francis, London.

Bacharach, M., 1970. Biproportional Matrices and Input-Output Change. Cambridge University Press, Cambridge.

Batty, M., 1976. Urban Modeling: Algorithms, Calibrations, Predictions. Cambridge University Press, Cambridge.

Distance decayparameter (b)

Absolute error(AE) (%)

SRMSE


Instance Absolute error (%) SRMSE Epoch networkwas stopped

Five-node networks1 27.1 0.723 100 0002 18.2 0.470 100 0003 23.3 0.634 100 0004 18.3 0.463 100 0005 21.1 0.520 100 000Average 21.6 0.562

Twenty-node networks1 24.3 0.585 100 0002 15.2 0.379 100 0003 21.4 0.554 100 0004 27.3 0.637 100 0005 24.1 0.636 100 000Average 22.5 0.558

Fifty-node networks1 8.6 0.169 100 0002 8.4 0.168 100 0003 8.6 0.166 100 0004 8.7 0.168 100 0005 10.7 0.212 100 000Average 9.0 0.176


Batty, M., Mackie, S., 1972. The calibration of gravity, entropy, and related models of spatial interaction. Environment

and Planning A 4, 205±233.

Batty, M., Sikdar, P.K., 1982. Spatial aggregation in gravity models: 1. An information-theoretic framework.

Environment and Planning A 14, 377±405.

Bishop, C.M., 1995. Neural Networks for Pattern Recognition. Oxford University Press, Oxford.

Black, W.R., 1995. Spatial interaction modeling using arti®cial neural networks. J. Transport Geography 3 (3), 159±

166.

Bureau of the Census, 1983. 1980 Census of Population and Housing, Census Tracts, Atlanta, GA. PHC80-2-77. US

Department of Commerce, Bureau of the Census, Washington.

Bureau of Transportation Statistics (1993) 1990 Census Transportation Planning Package. US Department of

Transportation, Bureau of Transportation Statistics. CD-Rom, Washington.

Dougherty, M., 1995. A review of neural networks applied to transport. Transportation Research C 3, 247±260.

Evans, A.W., 1971. The calibration of trip distribution models with exponential or similar cost functions.

Transportation Research 5, 15±38.

Fahlman, S.E., 1989. Faster-learning variations on back-propagation: An empirical study. Proceedings of the 1988

Connectionist Models Summer School Touretzky, D., Hinton, G., Sejnowski, T. (Eds). pp. 38±51. Morgan

Kaufmann, San Mateo.

Fischer, M.M., Gopal, S., 1994. Arti®cial neural networks: A new approach to modeling interregional telecommu-

nication ¯ows. J. Regional Science 34, 503±527.

Fotheringham, A.S., Knudsen, D.C., 1987. Goodness-of-®t Statistics. CATMOG series. Geo Abstracts, Norwich.

Fotheringham, A.S., O'Kelly, M.E., 1989. Spatial Interaction Models: Formulations and Applications. Kluwer,

London.

Goodman, P.H., 1996. NevProp software, ver. 3. Reno, NV: University of Nevada, URL: http://www.scs.unr.edu/

nevprop/.

Gopal, S., Fischer, M.M., 1996. Learning in single hidden-layer feedforward network: Backpropagation in a spatial

interaction modeling context. Geographical Analysis 28, 38±55.

Haykin, S.S., 1998. Neural Networks: A Comprehensive Foundation. Prentice Hall, Upper Saddle River.

Himanen, V., Nijkamp, P., Reggiani, A., 1998. Neural Networks in Transport Applications. Ashgate, Brook®eld.

Hua, J., Faghri, A., 1994. Applications of arti®cial neural networks to intelligent vehicle-highway systems.

Transportation Research Record 1453, 83±90.

Kikushi, S., Nanda, R., Perincherry, V., 1993. A method to estimate trip-O-D patterns using a neural network

approach. Transportation Planning and Technology 17, 51±65.

Kreinovich, V., Sirisaengtaskin, O., 1993. Universal approximators for functions and for control strategies. Neural,

Parallel, and Scienti®c Computations 1, 325±346.

Mozolin, M.V., 1997. Spatial interaction modeling with an arti®cial neural network Discussion Paper. Series 97-1,

Department of Geography, University of Georgia, Athens, GA.

Openshaw, S., 1984. The Modi®able Areal Unit Problem. CATMOG 38. Geo Abstracts, Norwich.

Openshaw, S., 1993. Modeling spatial interaction using a neural net. Geographic Information Systems, Spatial

Modeling and Policy Evaluation. Fischer, M.M., Nijkamp, P. (Eds.), Springer, Berlin, pp. 147±164 .

Ortuzar, J. de Dios, Willumsen, L.G., 1994. Modelling Transport. Wiley, Chichester.

Ripley, B.D., 1996. Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge.

Rojas, P., 1996. Neural Networks: A Systematic Introduction. Springer, New York.

Rumelhart, D.E., Hinton, G.E., Williams, R.J., 1986. Learning representations by back-propagating errors. Nature 323

(9 October) 533±536.

Sarle, W.S., (ed.) (1996) A list of frequently asked questions (FAQ), USENET: comp.ai.neural-nets. Available via

anonymous FTP from ftp.sas.com/pub/neural/FAQ.html.

Slater, P.B., 1976. Hierarchical internal migration regions of France. IEEE Transactions on Systems, Man, and

Cybernetics 6 (4), 321±324.

Smetanin, Y.G., 1995. Neural networks as systems for pattern recognition: A review. Pattern Recognition and Image

Analysis 5, 254±293.

Smith, M., 1993. Neural Networks for Statistical Modeling. Van Nostrand Reinhold, New York.


Thomas, R.W., Hugget, R.J., 1980. Modeling in Geography: A Mathematical Approach. Barnes and Noble, Totowa.

Verleysen, M., Hlavackova, K., 1994. An optimized RBF network for approximation of functions. Proceedings of the

European Symposium on Arti®cial Neural Networks, ESANN'94.

Weiss, N.A., 1995. Introductory Statistics. Fourth Edition. Addison-Wesley, Reading.

Williams, P.A., Fotheringham, A.S., 1984. The Calibration of Spatial Interaction Models by Maximum Likelihood

Estimation with Program SIMODEL. Geographic Monograph Series, vol. 7, Department of Geography, Indiana

University, Bloomington, IN.

Wilson, A.G., 1970. Entropy in Urban and Regional Modeling. Pion, London.


Documents

Trip distribution forecasting with multilayer perceptron …directory.umm.ac.id/Data Elmu/jurnal/T/Transportation Research Part... · Trip distribution forecasting with multilayer