1 Automated Transportation Mode Detection Using …docs.trb.org/prp/15-5826.pdf · 1 Automated Transportation Mode Detection Using Smart Phone Applications via ... This model not

Ansari Lari and Golroo 1

Automated Transportation Mode Detection Using Smart Phone Applications via 1

Machine Learning: Case Study Mega City of Tehran 2

3

4

Zahra Ansari Lari 5 Graduate Student 6

Department of Civil and Environment Engineering 7

Amirkabir University of Technology 8

Hafez Avenue, Tehran, Iran 9

Tel: +98 (21) 88336480 10

Email: [email protected] 11

12

Amir Golroo 13 Assistant Professor 14

Department of Civil and Environment Engineering 15

Amirkabir University of Technology 16

Hafez Avenue, Tehran, Iran 17

Tel: +98 (21) 64543010 18

Fax: +98 (21) 66414213 19

Email: [email protected] 20

21

22

23

24

25

Paper submitted for: 26 Presentation at 94th Annual Transportation Research Board Meeting, January 2015 and 27

Publication in the Transportation Research Record: Journal of Transportation Research Board 28

29

30

Word Count: 6147 + 1,750 = 7,897 words 31

(Text) (4 Tables + 3 Figures) 32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

mailto:[email protected]

mailto:[email protected]


ABSTRACT 1

Through the past few decades, travel behaviors have become more complicated especially in 2

mega cities such as Tehran, the capital of Iran. Decision makers require more accurate and 3

comprehensive information to plan for city transportation. As opposed to traditional paper-4

based and telephone-based surveys, a new efficient and effective data collection method has 5

been recently applied using information technology such as the GPS-based data collection 6

method which can track passengers’ trips. Having utilized this new method, the main aim of 7

this study is to analyze the collected data in order to distinguish transportation modes used by 8

passengers using a novel machine learning method called random forest. This model not only 9

classifies transportation modes i.e., car, bus, and walking at a high accuracy of almost 96%, 10

but also determines the most influential attributes in the process of classification based on two 11

importance indices: mean decrease accuracy and Gini index. Results show that instant speed 12

and accuracy of GPS track are the most influential attributes in the transportation modes 13

classification. Transportation planners benefit a lot from this accurate and comprehensive 14

travel behaviors data (used modes) for policy making. 15

16

17

18

19

Keywords: GPS-Based Data collection, Mode Recognition, Random Forest, Feature Selection. 20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51


INTRODUCTION 1

Over the past few decades, there is a growing attention toward transportation modes share, the 2

sequential of used modes in a trip, origin-destination travelers’ information, and travelers’ 3

characteristics, amongst different groups of policymakers. Urban planners and city officials, 4

on one hand, have found this issue as a useful tool to conduct their programs in a way of 5

providing the appropriate and adequate transport services for future travel demands. Health 6

officials, on the other hand, use this information to take feasible and effective actions according 7

to the global problem of obesity among people. 8

One of the essential information required for transportation planning is to understand 9

passengers’ travel behavior in cities. In other words, understanding the modes used, origins, 10

and destinations of daily trips could lead to comprehensive transportation planning. This 11

planning needs a data base with a large number of observations at a high quality. One of the 12

major methods of acquiring data is to conduct a survey. Travel surveys have been undertaken 13

through different methods by various researchers to collect required data (1-3). The first 14

method of collecting travel data was paper-based questionnaires followed up computer-assisted 15

telephone interview (CATI) which reached the survey participants who submitted the paper 16

survey. This method was dependent on respondents’ memory and contained indispensable 17

errors. Also, in the best condition a respondent was able to only describe his/her trips for one 18

or two days without even an acceptable accuracy. 19

The recent progress in information technology gives researchers an excellent 20

opportunity to gather the travel information more efficiently and effectively i.e., not only time 21

and budget are saved, but significantly more data is collected. One of the useful tools which 22

receives attention by transportation planners is GPS-based technology data collection methods. 23

GPS traces are used widely for collecting activity-based travel diary data during recent years 24

(4-7). 25

This study focuses on analyzing data through presenting a novel machine learning 26

modeling formulation to use raw GPS data for mode recognition. Machine learning methods 27

have performed well in both complex regression and classification problems. The formulation 28

used in this study is a random forest which was proposed for the first time in 2001 by Breiman 29

(8). Machine learning methods have been well performed in classification problems. Random 30

forest is an ensemble classifier which is combination of N tree predictors (8). It is generally 31

found that ensemble methods are more accurate than any of the individual classifiers creating 32

the ensemble (9). An ensemble could be aggregate weighted or unweighted votes from each 33

individual classifier to classify an unknown observation. In addition, feature selection aspect is 34

examined based on a random forest model in this study. Three modes are examined in this 35

study: car, bus, and walking. Subway and commuted rails trips are not being included in the 36

model, because almost all the lanes of this mode in Tehran are underground so GPS records 37

just showed the origin and destination stations which cannot result in mode recognition 38

modeling process. One way to classify these trips is to utilize GIS maps including origin and 39

destination stations to determine the associated mode and path. 40

41

LITERATURE REVIEW AND BACKGROUND 42

Extensive studies have conducted on travel behavior using GPS-based data owing to a high 43

level precision, reliability, efficiency, effectiveness, time-saving, and capability of providing a 44

large potential sample size (4-7 and 10-12). This is mainly because of significant advancements 45

in information technology. 46

Vehicle-based GPS data collection was the first devices used to record GPS traces (13 47

and 14). This method had many limitations e.g., not covering other frequently used modes such 48

as walking or public transport (15). To overcome this problem, passengers were equipped with 49

a GPS device to collect data (4 and 16-19). However, the investment for this method was too 50


high (15). It also was a heavy device to be carried by a passenger. Besides, it was a chance that 1

a passenger forgets to take the device with him/ her during daily trips (15). 2

By the advent of smart phones, the GPS-based data collection methods enormously 3

shifted to a more convenient and comprehensive way. Although, it overcomes some problems 4

of previous methods (20 and 21), it has its own limitations at the beginning. For instance, the 5

first smart phone applications were very battery consuming and they were just capable to record 6

data about 5 to 6 hours a day (20). Battery life is still a big issue impacting the more widespread 7

use of smartphones in GPS data collection. Also, these first generation applications did not 8

allow passengers to use the smart phones while it was capturing data (20). This issue might 9

lead to some changes in him/ her plans which could provide not realistic data base and 10

discourage people to take part in the survey. Afterwards, some researches (15 and 17) have 11

been conducted to optimize the process of smart phone-based data collection using GPS. 12

Despite these limitations, this method is still widely used and has several benefits such as being 13

able to use a consistent data collection approach across the sample and not allowing participant 14

smartphone biases interfere with data availability. 15

In terms of accuracy and prediction power, numerous machine learning methods have 16

been applied. Machine learning algorithms are well known in mining data and finding 17

underlying patterns across attributes in order to gain the ability of modeling and predicting. 18

Data mining is the process of data collection, data preparation, data exploration, data analysis, 19

and final inferences. So, the main task after data collection is how to analyze the data in a 20

general term. Several methodologies have been developed to process GPS traces for mode 21

recognition (10-12 and 22-23). The methodologies are divided in to two main categories: 22

procedural and machine learning methods (10). In procedural methods, the mode identification 23

is based on logical assumptions about passengers’ travel behavior such as using more public 24

transportation if a passenger is close to the stations (24). Or, according to passengers’ safety 25

concerns, they might less likely travel after mid-night (25) especially walking. In 2008, Stopher 26

et al. (26) presented a framework based on speed, acceleration, and some additional hierarchical 27

assumptions to distinguish between different modes. For example, availability of a bicycle at 28

home and low speed, and acceleration rate are hints to infer that the passenger used a bicycle 29

to complete his/her trip. This method overlooks the chance of presence of same attributes in 30

different modes such as velocity distribution and also needs more than just raw GPS (Global 31

Positioning System) data to assign a proper mode to a trip. Similarly, Chung and Shalaby in 32

2005 (24), and Liao, Fox, and Kautz in 2007 (25) investigated rule-based methods especially 33

with regards to environmental and situational factors for mode recognition. 34

Machine learning methods have been applied widely in this field of study because of 35

their extensive ability to analyze data through complex approaches. Machine learning methods 36

have the opportunity to find underlying rules in recondite data and utilize them to forecast new 37

observations with a high accuracy. 38

Other studies related to mode recognition have focused on using various available sensors and 39

methods to investigate the most conforming pattern for mode (12, 27-30). Each sensor has its’ 40

own property that could lead to an acceptable outcomes. For example, GIS (Geographic 41

Information System) data are very useful information especially for some modes such as buses 42

or subways. Moreover, using GSM (Global System for Mobile) data need dense GSM cell 43

towers to have reliable data which they are not always available (12). Table 1 shows the list of 44

latest studies in which various machine learning methods have been used. This table also 45

expresses methodology accuracy, different modes detected, and applied sensors. 46

47

48

49

50

51


Table 1: Summary of Studies Related to Sensor-Based Data Collection 1

First Author

Yea

r

Methodology

Acc

ura

cy (

%)

Modes Sensors

Car

Wal

k

Bik

e

Bu

s

Tra

in \

sub

way

GP

S

GIS

GS

M

Wi-

Fi

Acc

eler

om

eter

Feng (22) 2013 Bayesian Belief Network 96 × × × × × ×

Feng (22) 2013 Bayesian Belief Network 81 × × × × ×

Feng (22) 2013 Bayesian Belief Network 96 × × × × ×

Bolbol (10) 2012 Support Vector Machine 88 × × × × × ×

Stenneth (23) 2011 Bayesian Network 74.9 × × × × × ×

Stenneth (23) 2011 Decision Tree 66.9 × × × × × ×

Stenneth (23) 2011 Random Forest 75.4 × × × × × ×

Stenneth (23) 2011 Naïve Bayesian 71.8 × × × × × ×

Stenneth (23) 2011 Multilayer Perceptron 59.1 × × × × × ×

Stenneth (23) 2011 Bayesian Network 92.5 × × × × × × ×

Stenneth (23) 2011 Decision Tree 92.2 × × × × × × ×

Stenneth (23) 2011 Random Forest 93.7 × × × × × × ×

Stenneth (23) 2011 Naïve Bayesian 91.6 × × × × × × ×

Stenneth (23) 2011 Multilayer Perceptron 83.3 × × × × × × ×

Manzoni (27) 2011 Decision Tree 82.14 × × × × × × × ×

Zhang (31) 2011 Support Vector Machine 93 × × × ×

Xu (32) 2010 Fuzzy Logic 93.8 × × × × ×

Reddy (12) 2010

Decision Tree followed

by first-order discrete

Hidden Markov Model

93.6 × × × ×

Zheng (33) 2008 Support Vector Machine 51.7 × × × × ×

Zheng (33) 2008 Decision Tree 72.1 × × × × ×

Zheng (33) 2008 Bayesian Net 57.7 × × × × ×

Zheng (33) 2008 Conditional Random

Field 61.7 × × × × ×

Gonzalez

(34) 2008 Neural Network 91.23 × × × ×

Mun (35) 2008 Decision Tree 83 × × × × ×

2

Table 1 shows the variety of methods used in the field of mode recognition. Most of the 3

researchers have examined at least three different modes of walking, car, and bus which are 4

more frequently used. Different levels of accuracy have been achieved which the highest one 5

by using only a raw GPS data is 93.8%. The level of accuracy depends on the associated applied 6

algorithm, the quality and the size of the data base, the defined attributes, and assumptions 7

made in each methodology. 8

Another aspect which has not received enough attention to date is the concept of 9

variable importance. Identifying influential variables assists to investigate more underlying 10

factors which leads to interpret data comprehensively and to forecast new observation with 11

higher accuracy. 12

Bolbol et al. (10) investigated important variables with analysis of variance (ANOVA) test to 13

select the best features which separately describe each mode. They examined the role of four 14

main attributes for each mode: speed, acceleration, distance, and heading differences. They 15

stated that acceleration and heading changes are the most important attributes in the modes of 16


car and train, respectively. Also, they generally declared that speed and acceleration are the 1

best attributes for distinguishing between transportation modes. 2

Stenneth et al. (23) applied the algorithms of Chi squared and information gain. They 3

concluded that the first ranking attributes were: average speed, average rail line closeness, 4

average bus closeness, and average acceleration. The researchers used GIS information in 5

addition to GPS data. The most important variables according to GPS were average speed and 6

average acceleration. 7

8

OBJECTIVE AND SCOPE 9

This study aims to introduce a novel machine learning method called random forest to not only 10

provide higher accuracy in mode classification but also to introduce two common indexes for 11

feature selection based on a random forest model: a mean decrease index, and a Gini index 12

which specifically determine important attributes in each mode separately. This study only uses 13

data collected using smart phones via GPS and accelerometer sensors. It also focuses on 14

detecting three transportation modes including car, bus, and walking. 15

16

METHODOLGY 17

After a detailed literature review, the framework was created for this study. The first step was 18

to collect travel data. For this aim, a group of participants were trained to record their travel 19

data (GPS tracks) via an appropriate application. Then, data pre-processing was carried out 20

leading to define attributes for each GPS track based on collected data such as acceleration and 21

delta accuracy. After that, different machine learning methods were examined to accomplish 22

the model specification phase in order to find an appropriate model resulting in mode 23

classification and feature selection at a high level of performance. Finally, the accuracy of mode 24

classification and attribute effectiveness for each mode were discussed. 25

In this study, a random forest formulation is applied to investigate underlying factors in a GPS 26

data to not only classify each mode at a high accuracy but also determine influential attributes 27

in describing each class. 28

29

Random Forest Technique Review 30

Random forest consists of N tree classifiers. Creating each tree required two types of 31

information: a subset of training data and the related attributes. The procedure in building a 32

tree is to split the attributes in each node and continue to a next level until a final leaf consists 33

of an ultimate answer generated. In each node the best attribute is selected for decision tree 34

induction based on the entropy formula (Eq. 1) (36). 35

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆) = ∑ −𝑝𝑖 log2 𝑝𝑖

𝑐

𝑖=1

(1) 36

Where S denotes the entropy for each attribute, c is the number of different values in 37

each attribute, and 𝑝𝑖 is the proportion of S belonging to class i. 38

The novelty of ensemble methods is that they aggregate number of individual classifiers 39

outcomes and report the most popular result as a final vote which is the category of the input 40

variables (37). This approach is well performed in a random forest model based on 41

randomization which makes the model more powerful for classification (38). The fundamental 42

of random forest is based on two randomization approach: bagging and random attribute 43

selections (38). 44

In the first step, for developing each tree in a forest a subset of data is randomly drawn 45

with replacement from the original data to be used as a training data which is called 46

bootstrapping. The portion of the data which is not selected is called out-of-bag data and used 47

as a testing set. Similarly, the number of attributes is extracted randomly with replacement from 48

all the existing ones. Then, a tree is generated using these training set and attribute selection. 49


Also, each tree is evaluated by its testing data. The number of misclassified observations is 1

reported as the out-of-bag error rate and is used to evaluate variable importance in the final 2

result. This type of testing provides fair estimation of accuracy of the model compared to the 3

usual test approaches (39). Next, this procedure is undertaken to building N different trees in a 4

forest. 5

In addition to the excellent ability of a random forest in classification and prediction, 6

the model reports the measurement of influential attributes based on permutation process in the 7

original data (8). Subsequently, to compute the effectiveness of each attribute in each class 8

separately and measure the importance level of attributes in a whole data, the quantity of 9

associated attribute changes randomly to incorrect values in the data. Then, the model is 10

developed again using the permuted data set (8). After that, the assigned class to each 11

observation is compared with the correct one. If the number of error in classification of the data 12

is considerable, it is concluded that this attribute has a major effect in the process of modeling 13

and predicting (8). The average importance of each attribute is calculated both in each class 14

and for a whole data called mean decrease accuracy index. Variable importance of 𝑋𝑗 is 15

calculated using Equation2. (40): 16

𝑉𝐼(𝑋𝑗) =1

𝑛𝑡𝑟𝑒𝑒∑(𝑒𝑟𝑟𝑜𝑟𝑂𝑂𝐵𝑓 − 𝑒𝑟𝑟𝑜𝑟𝑂𝑂𝐵𝑓𝑗

)

𝑓

(2) 17

Where VI(Xj) is the measurement of importance of attribute 𝑋𝑗 , the sigma is over all 18

trees 𝑓 of a forest and 𝑛𝑡𝑟𝑒𝑒 is the number of built trees in a forest. 𝑒𝑟𝑟𝑜𝑟𝑂𝑂𝐵𝑓 is the error 19

rate before permutation and 𝑒𝑟𝑟𝑜𝑟𝑂𝑂𝐵𝑓𝑗 is the error rate after the permutation of variable 𝑋𝑗. 20

Another outcome of a random forest model is called mean Gini index (8). According to related 21

studies (41-45), this measure recommended when all attributes are either numerical or 22

categorical because it is sensitive to the kinds of attributes. This index is also an attribute 23

selection measure similar to the mean decrease accuracy index. However, its amount calculated 24

the impurity of an attribute regarding to the classes (8). In this procedure, at each node, the 25

reduction in Gini impurity is reported for each variable which is used for splitting (40). Gini 26

impurity ∆𝐺𝐼(𝑡) is calculated using Equation 3 (40): 27

∆𝐺𝐼(𝑡) = 𝑃𝑡𝐺𝐼(𝑡) − 𝑃𝐿𝐺𝐼(𝑡𝐿) − 𝑃𝑅𝐺𝐼(𝑡𝑅) (3) 28

Where, 𝐺𝐼(𝑡𝐿) is a Gini index on the left side of the node, 𝐺𝐼(𝑡𝑅) is a Gini index on the 29

right side of the node, 𝑃𝑡 is the number of observations before the split, 𝑃𝐿 is the number of 30

observations on the left side after the split, and 𝑃𝑅 is the number of observations on the right 31

side after the split and GI(t) is the Gini index defined by Equation 4 : 32

𝐺𝐼(𝑡) = 1 − ∑ 𝑝(𝑘|𝑡)2𝑘 (4) 33

Where 𝑝(𝑘|𝑡) is the rate at which class k is distinguished correctly at node t. The 34

average of all reductions in the Gini index impurity used to calculate the Gini importance 35

measure. 36

It should be noted that the domain of feature selection and variable importance measurement 37

is a wide subject in machine learning studies (46-48).The methods are sensitive to the number 38

of attributes and observations which is out of scope of this paper (49 and 50). 39

40

DATA COLLECTION 41

Data collection was conducted in December 2013 over 2 weeks in Tehran, the capital of Iran. 42

GPS traces were collected using smart phones equipped with GPS through an appropriate 43

application. Participants were employed to run the application for all day long, between 6 a.m. 44

till 9 p.m., and collect their trip data with their consent. A sample data collected by a participant 45

from home to work in the morning and back from work to home in the afternoon is plotted on 46

the map as shown in Figure 1. This sample travel tracking data encompasses all three desired 47

modes including walking, bus, and car. The participants are 25 males and ten females. All of 48


them are students between 19 and 25 years of age. Prior to starting data collection, participants 1

were trained. In the training session, the objective of the study, method of working with the 2

smart phone application, procedure of filling complement data form for validation (data 3

labeling) were discussed in details. In order to label data with correct classes, participants were 4

asked to report the used modes in complement data form and attach to each GPS file. Finally, 5

more than 245,418 lines of GPS traces were collected that contains three modes: Car, bus, and 6

walking. The segments, time, date, instant speed, accuracy, bearing, altitude, latitude, and 7

longitude were recorded per GPS track in each trip. Table 2 provides a descriptive statistics for 8

each attribute in each class. As shown in Table 2, the bus average speed is larger than that of 9

cars because especial separated lanes are dedicated to a large number of buses in Tehran called 10

Bus Rapid Transit (BRT) moving without traffic interference. However, the car maximum 11

speed is larger than that of buses. The amount of standard deviation in delta bearing for walking 12

is much more than delta bearing of the other modes, since in walking a person has more 13

freedom to change his/her direction. Moreover, the average accuracy in walking is less than 14

the accuracy of other two modes. The reason could be because of the walking low speed which 15

results in capturing more data point in a unit of distance than the other modes. So, the deviation 16

(in a unit of distance) of GPS traces from the main path for walking is more than that of others. 17

18

TABLE 2 Descriptive Statistics of Used Attributes in the Model 19

Attribute Average Std.Dev. Min Max

Delta Bearing (deg)

Bus

-0.447 E-01 29.324 -357 358

Accuracy (m) 6.610 2.201 3.194 47.994

Speed (m/s) 23.346 10.535 0.251 31.618

Delta Speed (m/s) -0.670 E-01 1.187 0 12.005

Acceleration (m/s2) -0.257 E-01 0.535 -19.78 4.729

Delta Acceleration (m/s2) -0.359 E-02 2.257 -12.244 126.960

Delta Bearing (deg)

Car

-0.661 E-01 43.316 -359 359

Accuracy (m) 6.413 4.913 1.478 49.999

Speed (m/s) 10.376 6.181 0.250 35.237

Delta Speed (m/s) -0.503 E-02 0.968 0 29.088

Acceleration (m/s2) 0.395 0.718 -13.258 28.690

Delta Acceleration (m/s2) -0.107 E-01 0.820 -39.120 28.762

Delta Bearing (deg)

Wal

kin

g

-0.537 75.086 -359 359

Accuracy (m) 12.805 10.398 1.706 49.999

Speed (m/s) 1.572 0.933 0.25 5.490

Delta Speed (m/s) -0.514 E-01 1.085 0 4.954

Acceleration (m/s2) 0.596 0.589 -8.435 4.408

Delta Acceleration (m/s2) 0.251 E-01 0.881 -19.539 10.561

20

21


1 FIGURE 1 Plotted GPS Traces for a Sample Work Trip 2

3

MODELING 4

The first step in all modeling procedures is to clean the data set called data preparation. In the 5

case of this study, especially, for the different ranges in accuracy of recording data and also 6

because speed, which is an important attribute, is calculated based on the three geographical 7

parameters: altitude, longitude, and latitude there are an indispensable errors. To overcome this 8

problem, speed limit individually defined for different modes. For motorized modes this 9

amount is assumed to be equal to 35 meter per second (125 kilometer per hour).For the mode 10

of walking, the maximum speed reported 5 meter per second based on fuzzy member ship 11

modeling studied by Zhang et al (11). However, in order to cover fast walking (based on local 12

observations in this study), the maximum speed is assumed to be equal to 5.5 meter per second 13

in the modelling process. 14

As mentioned before, a random forest model is applied in this study to present a new 15

and advanced machine learning method in order to be used for mode classification from raw 16

GPS data. New attributes are defined herein based on the raw data including acceleration (speed 17

changes per second), acceleration differences per second, bearing changes per track, and 18

accuracy changes per track. This new attributes are defined based on some logical rules for 19

each mode. For example, acceleration and its' changes per second cannot be very large for 20

walking. Or, bearing changes in motorized modes is not the same as walking i.e.in walking a 21

passenger has more freedom to suddenly change his/ her direction than a passenger in a 22

motorized mode does. However, the attributes which are going to be used in a data mining 23

should have same properties. For instance, the variables which make an observation very 24

unique such as time, longitude, altitude, and latitude should not be used as an attribute in a 25

model. 26

In developing a random forest model there are some parameters which should be 27

rigorously selected in order to gain reliable and acceptable results (8). One of these parameters 28

is the number of trees that should be built in a forest. The second parameter is the number of 29

random variables used at each node. There has not paid enough attention to the best and 30

efficient number of trees by researchers. Breiman claimed that the best number of trees is the 31

number at which the out-of-bag error is almost stable (8). Moreover, the amount of this 32

parameter is more relied on the number of observations, number of variables, and the desired 33

accuracy which is unique in each problem (51). In this study, four different forests examined. 34

Forest with 120, 140, 160, and 180 trees, and the forest with 160 trees was selected as a final 35

model. The error rates for four abovementioned modeling were 4.18%, 4.19%, 4.16%, and 36

4.17%, respectively, (as the amount of the splitting attributes at each node was equal to three). 37

Since almost the four applied models has equal error rate, the amount with the minimum error 38

was selected. 39

The next parameter is the number of attribute which should be selected at each node for 40

splitting. In classification problems, it is recommended to use the square root of numbers of 41

attributes for this parameter (51). However, both numbers of second and third parameters are 42

modeled to find the best fit. The error rates when two attributes used at each node are 4.46%, 43

Walking Bus

Car

Subway Car


4.43%, 4.43%, and 4.45% for 120, 140, 160, and 180 trees in a forest, respectively. Therefore, 1

the smallest out-of-bag error is preferred and it was fixed to 3 in the final model. 2

3

RESULTS AND DISCUSSION 4

This section devotes to the results from the applied random forest model. As noted before, the 5

main findings are divided in to two parts: the accuracy of mode detection and the feature 6

selection for each mode presented in the following section. 7

8

Mode Classification 9

The first outcome a random forest model is a confusion matrix (Table 3). In this matrix, the 10

number of misclassified observations from out-of-bag data is reported across the forest in each 11

class. Breiman (8) stated that in a random forest model it is not necessary to test the model on 12

a separate test subset, instead the out-of-bag error rate could be reported. This amount is equal 13

to 4.16% in this study. In other word, the model accuracy is about 95.84%. As shown in Table 14

3, the accuracies in bus, car, and walking separately are 87.93%, 97.68%, and 90.33%, 15

respectively. As discussed before, the accuracies depends on the applied algorithm, quality and 16

quantity of the sample size in each class, and the defined attributes (52). Since, the number of 17

observations in mode of bus was less than the other classes, the process of learning performs 18

weaker than the other mode which leads to the smaller accuracy rate. Overall, the average 19

accuracy rate over all modes is higher compared to previous studies. The highest achieved 20

accuracy in the recent studies was 93.8 for modes of walking, bike, and bus through fuzzy logic 21

method conducted by Xu et al. (32) (without considering car as a mode). 22

23

Table 3 Confusion Matrix 24

Mode Bus Car Walk Accuracy (%)

Bus 2967 367 40 87.93

Car 12 37176 871 97.68

Walk 3 757 7100 90.33

25

However, to assure the accuracy of modelling, original data randomly divided into two 26

separate subsets consisting of 30% and 70% of data used for testing and training, respectively. 27

The results are consistent with the random forest out-of-bag error rate. Table 4 shows the error 28

rate in each class for the test data. In total the accuracy is equal to 96.91%. As presented in 29

Table 4, the accuracy associated with each mode is almost similar to what achieved from the 30

results of out-of-bag error rates. The more misclassified observations in the mode of bus are 31

due to presence of cars. The reason is that cars and buses have similar attributes such as speed 32

and acceleration. The same issue occurs for the mode of car owing to presence of the walking 33

mode. The reason may probably be the existence of similar attributes in walking and driving 34

cars during peak hours at very low speed. 35

36

Table 4 Test Results 37

Mode Bus Car Walk Accuracy (%)

Bus 870 124 16 86.14

Car 6 11200 295 97.38

Walk 1 210 2071 90.79

38

Attribute Effectiveness 39

As discussed before, one of the important results of a random forest model is the measure of 40

attribute effectiveness in describing each class in the both process of modeling and predicting. 41

The calculated measures are sorted in a decent order for each class separately and for the whole 42


data. Figure 2 depicts the mean decrease accuracy index. The attribute that have the largest 1

quantity of the index is the most influential factor in describing the data and in the process of 2

prediction. 3

4

5 (a) (b) 6

7 (c) (d) 8

FIGURE 2 Results of Mean Decrease in Accuracy (a) All modes (b)Walking (c)Car 9

(d)Bus 10

11

As expected, in consistent with last studies (10 and 23), speed has the most important 12

role in each class. The second effective factor is the accuracy of tacking GPS data. This attribute 13

has not received enough attention in recent studies (10 and 23). Figure 1, also, illustrates the 14

comparative relationships between attributes in each class. The effect of speed is much higher 15

than other attributes in using bus. On the other hand, the next attribute (accuracy of tracking 16

GPS) in other modes has a relatively large impact on describing the data. 17

Another attribute importance measurement which is mentioned before is Gini index. Similar to 18

mean decrease accuracy, the attribute which has the highest amount of the Gini index is 19

reported in Figure 3. 20

050

100150200250300

Am

ou

nt

of

Mea

n D

ecre

ase

Acc

ura

cy

Variables

All Modes

050

100150200250300

Am

ou

nt

of

Mea

n D

eacr

ease

Acc

ura

cy

Variables

Walking

020406080

100120140

Am

ou

nt

of

Mea

n D

ecre

ase

Acc

ura

cy

Varibales

Car

050

100150200250

Am

ou

nt

of

Mea

n D

ecre

ase

Acc

ura

cy

Variables

Bus


1 FIGURE 3 Results of Mean Decrease Gini for All Modes. 2

3

Kawakubo et al. (40) claimed that the Gini index and mean decrease accuracy is not 4

significantly different, whereas the first important factors are almost the same. Similarly, in our 5

model both attribute importance indexes state that speed is the most influential attribute in the 6

modeling procedure. And, the second influential attribute is the accuracy of tracking GPS. 7

8

CONCLUSION 9

The major aim of this study was to present a new machine learning method and the critical role 10

of defining attributes of classification algorithms for mode recognition and influential factors 11

determination using GPS data. Three modes were considered including car, bus, and walking. 12

A random forest model was applied. The outcomes of this model, not only provided higher 13

accuracy in classification compared to previous studies, but also determined the most important 14

feature in classifying each mode. The following conclusions have been derived: 15

1. The transportation mode detection was accomplished using a machine learning method 16

(forest model) with an overall accuracy of almost 96% (more than previous studies). 17

2. In consistency with last studies, the first most important attribute to detect modes was 18

instant speed in the process of modeling and prediction. 19

3. The second most effective attribute was related to the accuracy in GPS tracking data. 20

Although GPS tracking accuracy has not been examined enough to date, it performed well in 21

the random forest model. 22

FUTURE WORK 23

The future studies shall be concentrated on the modes which have the same attributes. For 24

example, the speed ranges are very similar in walking and car in peak hours. Another 25

investigation should be conducted on recognizing different motorized modes. 26

27

REFERENCES 28

1. Schlich, R., Axhausen, K.W. (2003). Habitual travel behaviour: evidence from a six-29

week travel diary. Transportation 30(1), pp.13–36. 30

2. Samimi, A., & Ermagun, A. (2012). Students' Tendency to Walk to School: Case Study 31

of Tehran. Journal of Urban Planning and Development. 32

3. Liu, S., Murray-Tuite, P., & Scweitzer, L. (2014). Incorporating Household Gathering 33

and Mode Decisions in Large-Scale No-Notice Evacuation Modeling. Computer- 34

Aided Civil and Infrastructure Engineering 29 (2014) pp.107-122. 35

4. Adul, J., Williams, C., Mohammadian, A., & Nelson, P. (2009). An automated 36

GPS_based prompted recall survey with learning algorithms. Vol. 1, Issue 1, pp. 59-37

79. 38

0100020003000400050006000700080009000

10000A

mo

un

t o

f M

ean

Dec

reas

e G

ini

Attributes

All Modes


5. Shoval, N., Wahl, H.W., Auslander, G., Isaacson, M., Oswald, F., Edry, T., Landau, R., 1

& Heinik, J. (2011). Use of the global positioning system to measure the out-of-home 2

mobility of older adults with differing cognitive functioning. Ageing and Society, 3

Vol.31, Issue 05, pp 849-869. 4

6. Ohmori, N., Muromachi, Y., Harata, N., Ohta, K. (2000). Travel Behavior Data 5

Collected Using GPS and PHS. Traffic and Transportation Studies (2000): pp. 851-858. 6

7. Doherty, S.T., Noel, N., Gosselin, M-L, Sirois, C., & Ueno, M. (2001). Moving beyond 7

observed outcomes: Integrating global positioning system in interactive computer-8

based travel behavior surveys. Transportation Research Board, p.p. 449-466. 9

8. Breiman, L. (2001). Random forests. Machine Learning 2001, 45:5-32. 10

9. DIETTERICH, T.G., 2002, Ensemble Learning. The handbook of brain theory and 11

neural networks, M.A. Arbib (Ed.) (Cambridge, MA: The MIT Press). 12

10. Bolbol, A., Cheng, T., Tsapakis, I., & Haworth, J. (2012). Inferring hybrid 13

transportation modes from sparse GPS data using a moving window SVM 14

classification. Computers, Environmental and Urban Systems 36 (2012), p.p.526-537. 15

11. Zhang, L., Dalyot, S., Eggert, D., Sester, M. (2011). Multi-stage approach to travel-16

mode segmentation and classification of GPS traces. International Archives of the 17

Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume 18

XXXVIII-4/W25, 2011 19

ISPRS Guilin 2011 Workshop, 20-21 October 2011, Guilin, China 20

12. Reddy, S., Mun, M., Burke, J., Estrin, D., Hansen, M., & Srivastava, M .(2010). Using 21

Mobile Phones to Determine Transportation Modes. ACM Transportations on Sensor 22

Networks, Vol.6, No.2, Article 13, 23

13. Wolf, J., Guensler, R., & Bachman, W. (2001). Elimination of the Travel Diary: 24

experiment to Derive Trip purpose from Global Positioning System Travel Data. 25

Transportation Research Record, (1768), pp. 125-134 26

14. L.Forrest, T., & F.Pearson, D. (2005). Comparison of Trip Determination Method in 27

Household Travel Surveys Enhanced by Global Positioning system. Transportation 28

Research Record, (1917), pp.63-71. 29

15. Safi, H., Mesbah, M., Ferreira, L. (2013). ATLAS Project- Developing a mobile-based 30

travel survey. Australasian Transport Research Forum 2013 Proceedings. 31

16. Elango, VV & Guensler, R (2010), An Automated Activity Identification Method for 32

Passively Collected GPS Data, paper presented to The 3rd Conference on Innovations 33

in Travel Modeling, Tempe, Arizona, USA. 34

17. Shen, L & Stopher, P (2011), In-Depth Comparison of Global Positioning System and 35

Diary Records, Transportation Research Record, (2246), pp. 32-37. 36

18. Roorda, MJ, Shalaby, A & Saneinejad, S (2011), Comprehensive Transportation Data 37

Collection: Case Study in the Greater Golden Horseshoe, Canada, Journal of Urban 38

Planning and Development, vol. 137, p. 193. 39

19. Gong, H, Chen, C, Bialostozky, E & Lawson, CT (2012), A GPS/GIS method for travel 40

mode detection in New York City, Computers, Environment and Urban Systems, vol. 41

36, no. 2, pp. 131-9. 42

20. Ohmori, N, Nakazato, M & Harata, N (2005), GPS mobile phone-based activity diary 43

survey, in Proceedings of the Eastern Asia Society for Transportation Studies, vol. 5, 44

pp. 1104-15. 45

21. Itsubo, S & Hato, E (2006), A Study of the Effectiveness of a Household Travel Survey 46

Using GPS-equipped Cell Phones and a WEB Diary Through a Comparative Study with 47

a Paper-based Travel Survey, in. 48

22. Feng, T., & Timmermans, H.J.P. (2013). Transportation mode recognition using GPS 49

and accelerometer. Transportation Research Part C. 37 (2013), p.p. 118-130. 50


23. Stenneth, L., Wolfson, O., Yu, P.S., Xu, B. (2012). Transportation Mode Detection 1

using Mobile Phones and GIS Information. GIS: Proceedings of the ACM International 2

Symposium on Advances in Geographic Information Systems, p.p. 54-63. 3

24. Chung, E., & Shalaby, A. (2005). A trip reconstruction tool for GPS-based personal 4

travel surveys. Transportation Planning and Technology, 28(5), p.p.381–401. 5

25. Liao, L., Fox, D., & Kautz, H. (2007). Extracting places and activities from GPS traces 6

using hierarchical conditional random fields. The International Journal of Robotics 7

Research, 26(1), 119. 8

26. Stopher, P., Clifford, E, Zhang, J., & FitzGerald, C. (2008a). Deducing mode and 9

purpose from GPS data. In The 87th annual meeting of the transportation research 10

board, Washington, DC, United States. 11

27. Manzoni, V., Maniloff, D., Kloeckl, K., & Ratti, C. (2011). Transportation mode 12

identification and real-time co2 emission estimation using smartphones, SENSEable 13

City Lab. Massachusetts Institute of Technology, Cambridge, Massachusetts, USA, 14

Technical Report. 15

28. Miluzzo, E., Lane, N., Fodor, K., Peterson, R., Lu, H., Musolesi, M., Eisenman, S., 16

Zheng, X., & Campbell, A. (2008). Sensing meets mobile social networks: The design 17

implementation and evaluation of the cenceMe application. In Proceedings of the 6th 18

ACM Conference on Embedded Network Sensor Systems. ACM, New York. P.p. 337–19

350. 20

29. Anderson, I., & Muller, H. (2006). Practical activity recognition using GSM data. 21

Technical Report CSTR-06-016, Department of Computer Science, University of 22

Bristol. 23

30. Sohn, T., Varshavsky, A., Lamarca, A., Chein, M., Choudhury, T., Smith, I., Consolvo, 24

S.,Hightower, J., Grisworld, W., & Delara, E. (2006). Mobility detection using every 25

day GSM traces. In Lecture Notes in Computer Science, vol. 4206, Springer-Verlag, 26

Berlin, Germany, 212. 27

31. Zhang, L., Dalyot, S., Eggert, D., & Sester, M. (2011). Multi-stage approach to travel-28

mode segmentation and classification of GPS traces. ISPRS workshop on geospatial 29

data infrastructure: From data acquisition and updating to smarter services, pp. 87-93. 30

32. Xu, C., Ji, M., & Zhang, Z. (2010). Identifying Travel Mode from GPS Trajectories 31

through Fuzzy Pattern Recognition. Seventh International Conference on Fuzzy 32

systems and Knowledge Discovery (FSKD 2010). 978-1-4244-5934-6/10/$26.00 33

©2010 IEEE 34

33. Zheng, Y., Liu, L., Wang, L., & Xie, X. (2008) Learning Transportation Mode from 35

Raw GPS Data for Geographic Applications on the Web. 2008, April 21-25, 2008, 36

Beijing, China. ACM 978-1-60558-085-2/08/04. 37

34. Gonzalez, P.A., Weinstein, J.S., Barbeau, S.J., Labrador, M.A., Winters, P.L., Georggi, 38

N.L., & Perez, R. (2008). Automating mode detection using neural networks and 39

assisted GPS data collection using GPS-enabled mobile phones. 40

35. Mun, M.Y., Estrin, D., Burke, J., Hansaen, M. (2008). Parsimonious Mobility 41

Classification using GPS and WiFi Traces. HotEmNets'08, June 2-3, 2008, 42

Charlottesville, Virginia, USA Copyright 2008 ACM ISBN 978-1-60558-209-43

2/08/0006 44

36. Mitchell, T. (1997). Machine learning. McGraw-Hill. Ortúzar, J. D., & Willumsen, L. G. 45

(2011). Modelling transport (4th ed.). London, England: John Wiley & Sons Inc.. 46

37. Dietterich, T. G. Ensemble learning. In The Handbook of Brain Theory and Neural Networks, 47

2nd ed.; Arbib, M. A., Ed.; The MIT Press: Cambridge, 2002. 48

38. Kulkarni, V.Y., & Sinha, P.K. (2013). Random Forest Classifiers: A Survey and Future 49

Research Directions. Proceedings - 2012 International Conference on Data Science and 50

Engineering, ICDSE 2012, art. no. 6282329, pp. 64-68. 51


39. Genuer, R., Poggi, J.M., & Tuleau-Malot, C. (2010). Variable selection using random forests. 1

Pattern Recognition Letter 31, 14 (2010) 2225-2236. 2

40. Kawakubo, H., & Yoshida, H. Rapid Feature Selection Based on random Forests for High-3

Dimentional Data. (2012) Expert Syst Appl, 40, pp. 6241-6252. 4

41. C. Strobl, A. Boulesteix, A. Zeileis, and T. Hothorn, Bias in random forest variable importance 5

measures: Illustrations, sources and a solution, BMC bioinformatics, vol. 8, no. 1, p. 25, 2007. 6

42. Nicodemus, K. (2011). Letter to the editor: On the stability and ranking of predictors from 7

random forest variable importance measures. Briefings in Bioinformatics. 8

43. Strobl, C., A. Boulesteix, A. Zeileis and T. Hothorn. 2007. Bias in Random Forest Variable 9

Importance Measures: Illustrations, Sources and a Solution. BMC Bioinformatics, 8, 25. 10

44. Strobl, C. and A. Zeileis. 2008. ‘Danger: High Power! – Exploring the Statistical Properties of 11

a Test for Random Forest Variable Importance,’ Technical Report Number 017, Department of 12

Statistics, University of Munich. 13

45. Tuv, E., Borisov, A, Runger, G., & Torkkola, K. (2009). Feature Selection with 14

Ensembles, Artificial Variables, and Redundancy Elimination. Journal of Machine 15

Learning Research 10 (2009) 1341-1366. 16

46. Hall, M. A. (2000). Correlation-based feature selection for discrete and numeric class 17

machine learning. In Proceedings of the 17th International Conference on Machine 18

Learning, pp. 359–366, 2000. 19

47. Rakotomamonjy, A. (2003). "Variable selection using SVM-based criteria. Journal of 20

Machine Learning Research 3: 1357-1370. 21

48. Nicodemus, K., J. Malley, C. Strobl and A. Ziegler (2010). The behaviour of random 22

forest permutation-based variable importance measures under predictor correlation. 23

BMC Bioinformatics 11: 110. 24

49. Archer, K., & Kimes,R. (2008). Empirical characterization of random forest variable 25

importance measures. Computational Statistics & Data Analysis, vol. 52, no. 4, pp. 26

2249–2260, 2008. 27

50. Hua, J., Xiong Z., Lowey, J., Suh, E. (2005). Dougherty ER: Optimal number of 28

features as a function of sample size for various classification rules. Bioinformatics 29

2005, 21:1509-1515. 30

51. Liaw, A., & Wiener, M. (2002). Classification and Regression by randomForest. R 31

News, 2(3), pp. 18-22. 32

52. Putri, S. (2013). Identification of affecting factors on the GPA of first year students at 33

bogor agricultural university using random forest. 34

35

36

Documents

1 Automated Transportation Mode Detection Using …docs.trb.org/prp/15-5826.pdf · 1 Automated Transportation Mode Detection Using Smart Phone Applications via ... This model not