61
The Pennsylvania State University The Graduate School Department of Industrial and Manufacturing Engineering A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR MODELING HUMAN GAIT AND GEOSPATIAL TRAJECTORIES A Thesis in Industrial Engineering by Yixiang Han 2013 Yixiang Han Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science August 2013

A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

The Pennsylvania State University

The Graduate School

Department of Industrial and Manufacturing Engineering

A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR MODELING HUMAN

GAIT AND GEOSPATIAL TRAJECTORIES

A Thesis in

Industrial Engineering

by

Yixiang Han

2013 Yixiang Han

Submitted in Partial Fulfillment

of the Requirements

for the Degree of

Master of Science

August 2013

Page 2: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

ii

The thesis of Yixiang Han was reviewed and approved* by the following:

Conrad S. Tucker

Assistant Professor of Industrial Engineering

Thesis Advisor

Timothy W. Simpson

Professor of Industrial Engineering

Paul Griffin

Professor of Industrial Engineering

Head of the Department of Industrial Engineering

*Signatures are on file in the Graduate School

Page 3: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

iii

ABSTRACT

Less than 35% of human communication is verbal (hearing, listening, etc.), whereas

greater than 65% of human communication is nonverbal (body posture, facial expressions, etc.).

By analyzing nonverbal human communication instead of just verbal communication, researchers

may be able to perceive latent human features such as body language, neurological patterns, etc.,

otherwise missed through verbal communication alone. In this thesis, human kinematics (i.e.,

human gait and geospatial trajectory) is modeled and analyzed so as to perceive and predict

human behavior and kinematic patterns. A data mining driven methodology is proposed for

modeling and predicting both human gait (i.e., human walking posture) and human geospatial

trajectory (i.e., a sequence of geospatial locations from a moving individual in an indoor space).

The human gait mining component of the proposed methodology captures multimodal gait data in

order to model and predict neurological patterns that influence human gaits. The human trajectory

mining component of the methodology aims to predict common regions of interest (CRI) in

indoor design spaces by modeling geospatial trajectory patterns. A Parkinson’s disease (PD)

detection case study is used to validate the human gait component of the methodology, and an

engineering design case study involving students working in teams is used to validate the human

trajectory methodology. Analyzing human gait and geospatial trajectory would reduce human

variations and recognize desired patterns in both human gait and geospatial trajectory so as to

evaluate human movement characteristics and understand human movement dynamics.

Page 4: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

iv

TABLE OF CONTENTS

List of Figures .......................................................................................................................... v

List of Tables ........................................................................................................................... vi

Acknowledgements .................................................................................................................. vii

Chapter 1 Introduction ............................................................................................................ 1

Chapter 2 Literature Review ................................................................................................... 4

2.1 Existing Techniques for Modeling Human Movement .............................................. 4 2.1.1 Existing Human Gait Modeling ...................................................................... 4 2.1.2 Existing Human Geospatial Trajectory Modeling ........................................... 6

2.2 Data Mining based Human Movement Modeling ...................................................... 7 2.2.1 Data Mining based Human Gait Modeling ..................................................... 8 2.2.2 Data Mining based Human Geospatial Trajectory Modeling .......................... 9

Chapter 3 Methodology .......................................................................................................... 11

3.1 Human Gait Modeling Methodology ......................................................................... 11 3.1.1 Step 1: Sensor Data Acquisition ...................................................................... 12 3.1.2 Step 2: Data Preprocessing .............................................................................. 14 3.1.3 Step 3: Data Mining Knowledge Discovery .................................................... 15 3.1.4 Step 4: Model Performance Evaluation and Application ................................ 20

3.2 Geospatial Trajectory based Human Motion Modeling Methodology ...................... 23 3.2.1 Step 1: Data Acquisition ................................................................................. 24 3.2.2 Step 2: Data Transfer ....................................................................................... 25 3.2.3 Step 3:Data Mining Knowledge Discovery ..................................................... 25 3.2.4 Step 4:Model Visualization ............................................................................. 30

Chapter 4 Case Studies and Discussion .................................................................................. 31

4.1 Parkinson’s disease detection based Case Study........................................................ 32 4.1.1 PD Data Acquisition and Preprocessing ......................................................... 33 4.1.2 PD-based Data Mining Knowledge Discovery and Evaluation ...................... 35

4.2 Geospatial Trajectory Clustering ............................................................................... 37 4.2.1 Geospatial Trajectory Data Acquisition and Preprocessing ............................ 38 4.2.2 Geospatial Trajectory based knowledge Discovery and Explanation ............. 39

Chapter 5 Conclusions and Future Work ................................................................................ 47

References ................................................................................................................................ 49

Page 5: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

v

List of Figures

Figure 3-1. Framework of the proposed human gait based methodology. ............................... 12

Figure 3-2. Skeletal image with 20 nodes and example data from Shoulder_Center node. .... 14

Figure 3-3. Framework of geospatial trajectory modeling....................................................... 24

Figure 4-1. PD forward walking experiment overhead view. .................................................. 34

Figure 4-2. The learning factory layout. .................................................................................. 38

Figure 4-3. Extracted characteristic points of User 1 ............................................................... 40

Figure 4-4. Visualization of trajectory partitioning for User 1. ............................................... 41

Figure 4-5. Clustering visualization. ........................................................................................ 43

Figure 4-6. Clustering visualizations in the first period. .......................................................... 44

Figure 4-7. Clustering visualizations in the second period. ..................................................... 44

Figure 4-8. Clustering visualizations in the third period. ........................................................ 45

Page 6: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

vi

List of Tables

Table 3-1. Confusion matrix example...................................................................................... 21

Table 4-1. Algorithms performances in walking experiment. ................................................. 36

Table 4-2. Other evaluations among multiple algorithms in walking experiment. .................. 36

Table 4-3. Original trajectory of User 1. .................................................................................. 39

Table 4-4. Example result based on clustering algorithm. ....................................................... 41

Table 4-5. Result of clustering algorithm. ............................................................................... 42

Page 7: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

vii

Acknowledgements

It is my pleasure to thank everyone that helped make my thesis possible.

I would like to express the deepest appreciation to my advisor, Dr. Conrad S. Tucker. He

patiently provided the guidance, motivation, remarks and useful comments for me to proceed

through not just my Master study and the learning process of this master thesis but my overall

academic and professional career as well. He has shaped my growth and development regarding

my research and scholarship. Without his tremendous mentorship and persistent help this thesis

would not be possible.

I would also like to offer my special thanks to my thesis committee member, Dr. Timothy

W. Simpson, for his guidance, encouragement, insightful comments, and immensely helpful

suggestions. His guidance has served me well and I owe him my heartfelt appreciation for taking

the time to advise me through the process of writing this thesis.

I thank my fellow lab mates in the Design Analysis Technology Advancement

(D.A.T.A.) Lab in the School of Engineering Design, Technology and Professional Programs for

their great support and enlightenment. Their friendship and assistance has meant more to me than

I can express in words. Thank you all for your patience and friendly assistance.

It has been a great pleasure to be a student in the Harold and Inge Marcus Department of

Industrial and Manufacturing Engineering at the Pennsylvania State University at University

Park. I deeply thank Dr. Paul Griffin, the department head, Dr. M. Jeya Chandra, the graduate

program coordinator, and all other members of the department and the university.

Last but not least, I would like to thank my family, Xiaokao Han and Fengmei Li, for

giving birth to me and supporting me throughout my life. I love you all my life.

Page 8: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

1

Chapter 1

Introduction

Research has shown that more than 65% of human communication is non-verbal (e.g.,

posture, gesture) while about 35% is considered verbal (e.g., speech, discussion) [1]. The verbal

human communication component conveys a large volume of information, but may miss latent

aspects such as body posture, facial gestures, etc. that may provide researchers with added

dimensions of knowledge. Within research pertaining to nonverbal communication, human

movement behavior modeling is gaining significant interest across research domains ranging

from public security surveillance to human movement-based disease diagnosis [2–4]. By

analyzing human motion, researchers are able to compare and evaluate human movement

characteristics in order to capture movement patterns and understand human dynamics.

The objectives of analyzing human gait behavior and geospatial trajectory behavior are

both important and complementary. Human gait is defined as the act of self-propulsion achieved

by using human extremity [5]. Human geospatial trajectory is defined as a sequence of geospatial

locations from a moving individual in order to recover human motion in a given space [6].

Human gait behavior analysis focuses on human motion analysis including human body segments

(e.g., posture detection) while geospatial trajectory behavior analysis focuses on human trajectory

movement analysis in a given space without considering specific parts of the human body

structure.

There is already a wide spectrum of applications based on human gait modeling such as

athletic performance evaluation, medical diagnosis, public security surveillance and video

Page 9: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

2

conferencing, etc. [7-8]. Analyzing human gait behaviors helps capture and recognize human

movement characteristics and interesting gait patterns that could be taken as evidence for

different targets and applications. For example, potential Parkinson’s disease (PD) patients are

typically diagnosed by specific types of gait such as muscle rigidity, vocal problems, gait

disorders through several kinematics experiments based on published criteria [8–10]. Another

example is that swimmers analyze video tapes from other swimmers in order to learn about some

performance indicators such as basic speed, stoke mechanics, starting and turning abilities which

could be helpful in their personal trainings [12-13]. Other similar case studies could be found in

other domains such as tennis, basketball and airport security surveillance [13–15].

In addition to human gait analysis, human geospatial trajectory methodologies focuses on

human movement within a given space to capture geospatial position, velocity, time, acceleration,

etc., without considering human body segments (i.e., human body is considered as one point, and

different segments are assumed to have the same movement status). The objective is to detect

their geospatial movement patterns such as the trajectory shape, common regions of interest

(CRI) and density of multiple trajectories in specific setting. For example, researchers have

proposed methodologies to detect overcrowded situations in an indoor space (e.g., shopping malls,

career fairs, railway stations, etc.) so as to provide event alarms and better reorganized layouts

[16]. Furthermore, this tracking strategy could also contribute to traffic control in order to

relocate different facilities and promote user experience [17]. Other similar examples can be

found in [19-20].

The aforementioned methodologies for modeling human gait and geospatial trajectories

are usually achieved by a number of different human body models, which range from stick

figures, ribbon-based 2-D contour, to 3-D volumetric human models. However, there are some

Page 10: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

3

limitations that must be addressed [2], [20–22]. First, there is no human motion variation (i.e.,

human size) included in the applied standard 2-D or 3-D models. For example, different

individuals may have different heights, weights, and other parameters which could lead to

variations in human motion modeling and consequently, affect the predictive accuracy. In this

thesis, the human variation is addressed by introducing ratio components for position, velocity,

and acceleration between each pair of joints. In addition, joint correspondence (i.e., identifying

every joint in successive frames) is required in certain 2-D and 3-D models, which may restrict

the modeling flexibility and make it only applicable to some motion types (e.g., simple walking).

In this thesis, each joint in the 3-D model is detected automatically by using a multimodal sensor.

In order to mitigate these challenges, a data mining driven methodology is proposed to models

human gait and geospatial trajectory in order to normalize and categorize various human gaits and

understand utilization density in an indoor space.

The rest of this thesis is organized as follows. This Chapter provides an introduction and

background relating to human motion analysis. Chapter 2 describes previous work related to the

research topic, discusses pros and cons of these methodologies, and contrasts them to the

methodology proposed in this thesis. The human gait and geospatial trajectory components of the

methodology are introduced in Chapter 3 with results and discussions presented in Chapter 4.

Chapter 5 concludes the thesis and identifies future research expansions.

Page 11: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

4

Chapter 2

Literature Review

This chapter discusses the past research that is closely related to the topic in this thesis.

The literature review begins by discussing the performances of various existing human motion

modeling methodologies in section 2.1. Data mining based human motion modeling

methodologies are then discussed in section 2.2. As part of this section, various classification

algorithms that are most relevant to this study topic are discussed and compared with emphasis on

extracting significant motion features in both human gait and trajectory classification and

prediction.

2.1 Existing Techniques for Modeling Human Movement

Existing methodologies proposed to model human movement focus on human motion

tracking without system-based automatic motion recognitions and classifications (e.g., public

security surveillance system) [23–25]. For example, existing passive surveillance system (e.g.,

Closed Circuit Television (CCTV) cameras) can only track human movements and require well-

trained camera operators to manually view video feed so as to recognize any suspicious act. In

this section, multiple existing modeling methodologies are discussed relating to both human gait

modeling (see Section 2.1.1) and geospatial trajectory modeling (see Section 2.1.2).

2.1.1 Existing Human Gait Modeling

An approach to modeling human gait motion is based on 2-D human stick figures, which

consists of multiple joints or nodes connected by multiple line segments. This “skeletal”

Page 12: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

5

representation of the human body could be a significant aid to help track and estimate human gait.

One example is to model human gait based on moving light display (MLD) [23]. In this

methodology, the human body kinematics is modeled using 12 MLD lights representing the head,

shoulders, hips, elbows, wrists, knees, and ankles. This MLD-based model can help translate 3-D

human gait into 2-D projections during different human motion experiments. Other similar

examples can be found in [25], [27-28]. However, the joint correspondence required in human

gait modeling is the most challenging and complex part since each joint requires node-to-node

correspondence between successive frames. In addition, the 2-D projection provided by these

methodologies cannot provide depth data (i.e., only X and Y) for each joint and is not capable of

describing real-world 3-D human movements. Depth data is needed for more accurate for 3-D

modeling. These shortcomings are addressed in the proposed methodology by collecting 3-D joint

data (i.e., X, Y and Z coordinates) based on the applied multimodal sensor that can collect and

store in a database (data structure is discussed in more detail in Chapter 3). Once stored in a

database, researchers are able to extract information or run predictive models on this data, thus

making it possible to model and predict human gait, as demonstrated later in this thesis.

Another approach to tracking and recognizing human motion is based on the application

of 2-D contour modeling. The objective in this methodology is to model human gait by adding

human outlines, which is more precise than just applying “skeletal” models in previously

discussed approaches [27-28]. For example, human gait could be modeled based on a ribbon-

based 2-D model without putting markers on the human body [28]. In this model, there are eight

joints included in the 2-D model, which represent shoulders, elbow, hips, and knees, respectively.

By comparing the difference of moving ribbons between two frames, the moving ribbons could

be extracted. The resulting parameter curves recover motion characteristics in different body parts.

Other similar 2-D ribbon-based examples could be found in [26,30]. However, human size

Page 13: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

6

variation during gait analysis is not considered. In addition, there are body structure constraints

that may restrict the modeling flexibility and make it only applicable to some simple motion types

(e.g., simple walking). In order to reduce human size variation and model other human motion

types, the proposed data mining driven methodology is introduced in Chapter 3.

Comparing to the previous 2-D human gait modeling, 3-D modeling would have several

advantages such as viewing each joint independently based on the 3D angle parameters and

modeling other complex and unconstrained human movements [30]. There are usually two

parameters included in 3-D models, classified as “skeletal” figures and surrounding tissues [30].

For instance, a 22-DOF model is applied to construct the skeleton of the human body such as arm,

leg and torso. Then cylinders, spheres and other different primitives are applied to generate the 3-

D model [31]. However, the 3-D model is still incapable of addressing human gait variations

since parameters in 3-D model are unchanged. In addition, the included body constraints may

reduce the model flexibility in other complex motions instead of just simple walking. Other

similar 3-D models could be found in [32–34].

2.1.2 Existing Human Geospatial Trajectory Modeling

Human trajectory analysis addresses geospatial positions without considering the human

body structure. Some studies have considered qualitative methodologies to collect geospatial

trajectory patterns such as visual observations, interviews, and questionnaires [35], [37-38]. For

instance, human geospatial trajectory analysis has conducted based on a questionnaire at Osaka

Science Museum in Japan [35]. In the experiment, each volunteer was asked to fill out a

questionnaire after touring about their interactions with different robots. By analyzing popular

types of robots and the amount of time spent on each one based on the feedback provided in the

Page 14: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

7

questionnaires, researcher were able to discover that there was no preference between males and

females and there is no correlation between age of visitors and popularity of robots. However,

there are subjective biases in volunteers since different people may give different feedback to the

same one. In addition, this methodology requires a lot of time for human collaboration. Finally,

there could be potential privacy problems during the data collection process. For example,

confidential information such as name, age and phone number might be collected and accessed.

Other examples can be found in [37-38].

Other approaches to modeling human geospatial movement is based on technologies such

as Bluetooth, WI-FI, GPS and video camera recording, which can record precise time, location

and trajectory data. For example, the Bluetooth sensor was installed in one of the busiest regions

inside the Louvre museum in order to record the number of visitors and collect geospatial

position with corresponding time [38]. Finally, researchers concluded that there are strong

connections between Samothrace and Hall access, and between Hall access and big gallery,

which could help explain the most frequent trajectory patterns [34]. In their work, visitors’

trajectories could be clearly described, and correlations between different nodes could also been

discovered. However, the main limitation is that the busiest nodes are predetermined by officers

(i.e., domain knowledge is included). In addition, this methodology may only be available to

model one direction trajectory while other types may not be available (e.g., circular or a back-

and-forth type trajectory). Other similar examples cold be found in [40-41].

2.2 Data Mining based Human Movement Modeling

Instead of just tracking and modeling individual gait and geospatial trajectories, high-

level objectives about comparing and recognizing both different human gait features and

Page 15: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

8

geospatial movement patterns could provide additional insights and discoveries. The

methodology aims to generate clusters for similar features that would finally provide reliable

guidance to recognize human activities for new untested people. In order to achieve this goal,

data mining based methodologies are discussed in this section for both human gait analysis and

human geospatial trajectory analysis.

2.2.1 Data Mining based Human Gait Modeling

Charayaphan and Marble propose a data mining based methodology to detect hand

motion and classify different hand signs [41]. In the methodology, a frame grabber with IBM PC

was applied to help extract multiple frames of hand motions without applying any 2-D or 3-D

models. The hand detection is accomplished by comparing the grey scale difference between two

successive frames followed by hand sign classification based on stop position. Another similar

example can be found in [42]. Polana and Nelson proposed a methodology that does not require

the joint correspondence or track specific parts of an individual as mentioned in Section 2.1 [43].

Instead, human motion is tracked from the moving pixels followed by spatial translation where

the image frames are reduced to the same size as the object. Finally, the generated spatial gray-

valued frame set in each time point t would be considered as the feature vector to compare with

the reference motion set based on the K nearest neighbors (KNN) algorithm [43]. Heisele

proposed another methodology to model human gait based on Color Cluster Flow (CCF) [44]. In

this methodology, the pedestrian is represented by two initial color-based clusters (i.e., lilac

jacket and blue pants). The trajectories of these two clusters are considered as the approximation

of real human motion. By comparing the cluster trajectories to the reference trajectories, it helps

identify the real human motion. However, the main limitation for these methodologies is that they

Page 16: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

9

can only be applied to periodic and simple motion types (e.g., parallel walking), and has lower

predictive accuracies in complex motions such as rotation motion of arms or legs. In addition,

there is usually a predefined reference motion set required before the motion modeling.

2.2.2 Data Mining based Human Geospatial Trajectory Modeling

Johnson [45] proposed one methodology to model human trajectory based on probability

density function (pdf) modeling. In the methodology, the human trajectory is described in a

sequence of flow vectors denoted by 1 2={f ,f ,...,f }nQ , where n is the total number of images

captured for one subject. Then a learning network is applied to model the pdf to classify n input

data nodes into k output nodes (k and n are predetermined parameters) based on nearest distance.

Fu [46] models and predicts human geospatial trajectory that measures the similarity between two

individual trajectories based on a similarity matrix. Then a two-layer clustering algorithm is

employed where the dominant paths and routes are generated in the first layer. Then the Tightness

& Separation Criterion (TSC) is applied to quantitatively evaluate the clustering results.

However, the main shortcoming of the methodologies is that they only deal with trajectory

clustering without detecting common regions of interest (CRI) to help explain the different

motivation behind each individual’s activities. In addition, it may only perform well in single

directional trajectory and cannot be applied in other cases such as circular or rectangular

trajectories.

Having reviewed the research problem, motivation, previous related methodologies and

corresponding pros and cons, a data mining driven methodology is proposed to overcome the

aforementioned limitations mentioned above in both human gait analysis and human geospatial

Page 17: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

10

trajectory analysis and extract significant movement patterns to compare and understand human

gait and geospatial activities. This methodology is introduced in the next chapter.

Page 18: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

11

Chapter 3

Methodology

Based on the background and explanation established, the proposed data mining driven

methodology for modeling human gait and geospatial trajectories is presented in detail in this

section. The proposed methodology aims to overcome the human motion variation by adding the

ratios of position, velocity and acceleration between each pair of joints. There are four steps

included in the methodology. In the first step, data acquisition is conducted in order to collect

human gait and geospatial trajectory data. In the second step, data preprocessing technique

proposed for data cleaning and transferring. In Step 3, data mining algorithms are employed to

model and extract human movement features so as to explore common movement patterns. Step 4

of the proposed methodology outlines a validation and evaluation framework that helps determine

the robustness of the proposed methodology. The human gait based methodology is discussed in

Section 3.1 followed by the human geospatial trajectory based methodology in Section 3.2.

3.1 Human Gait Modeling Methodology

The human gait modeling methodology aims to capture multimodal gait data in order to

model and predict neurological patterns that influence human gait. The Human Gait Modeling

component of the methodology is partitioned into a total four steps: data acquisition (Step 1), data

preprocessing (Step 2), data mining knowledge discovery (Step 3), and model evaluation and

application (Step 4) in Fig. 3-1. Step 1 discusses how to set up experiments and collect human

Page 19: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

12

gait data based on the multimodal sensor employed in this work. Step 2 discusses how to

preprocess and store the collected data into server. Step 3 discusses how to extract correlated

features from the generated data set and apply these features to model, recognize and evaluate

human gait patterns. In Step 4, the trained model could be evaluated and extended into different

domains to help recognize and compare human gait patterns. For example, Step 4 in Fig. 3-1

could apply proposed human gait modeling into human movement related disease detection (e.g.,

Parkinson’s disease detection). Similar applications would be threat detection and athletic

performance evaluation.

Figure 3-1. Framework of the proposed human gait based methodology.

3.1.1 Step 1: Sensor Data Acquisition

The first step of human gait modeling outlines the experiment setup for collecting human

gait data. In this step, the overall body movement (i.e., human gait) is captured through a

Page 20: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

13

multimodal sensor system including RGB video camera and infrared depth sensor. In the sensor

data acquisition, the human body gait is modeled and captured based on the “skeletal” model

shown in Fig. 3-2, where total twenty joints are represented by the black circles. Comparing to

the modeling methodologies mentioned in the Literature Review, this multimodal sensor is able

to automatically recognizing each of the twenty joints without placing sensors on human body. In

addition, there are position, velocity and acceleration ratios created between each pair of joint to

help normalize shape variation existing in a population. By utilizing the multimodal sensor, this

virtual skeletal model is able to capture movements of joints in 3-D environment (i.e., X, Y and Z

coordinates) in real-time manner with privacy preserved that is sometimes a desirable feature in

human gait modeling (e.g. human gait based Parkinson’s disease detection).

In this research study, the Microsoft Kinect sensor is used for data collection which is

capable of tracking human motion by applying a similar virtual skeletal image shown in Fig. 3-2.

The hardware is able to capture each frame of human gait approximately every 33ms and

generating a 3MB data file in 4s.

Page 21: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

14

Figure 3-2. Skeletal image with 20 nodes and example data from Shoulder_Center node.

3.1.2 Step 2: Data Preprocessing

In addition to the initial X, Y, Z position data, velocity and acceleration of each single

joint are calculated by taking the derivative of position and velocity and creating additional

features in the raw data. In order to reduce the human gait variation (e.g., longer legs may have

longer length of stride), the ratio of position, velocity, and acceleration between each pair of joint

are also generated to reduce human gait variation.

Since not all features generated previously are expected to have the same predictive

power to the response variable, only the most relevant features corresponding to the response

variable should be selected in order to obtain more insights to capture and distinguish human gait

patterns, which leads to the feature space reduction [52]. Since multiple data mining algorithms

are applied in this study, a candidate feature selection algorithm should be independent of the

Page 22: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

15

multiple classifiers while maintaining the good performance [52]. In this thesis, the Correlation-

based Feature Selection (CFS) is selected where the most relevant feature set to the corresponding

output variable is selected with minimum correlation inside the feature set [48]. In the CFS

methodology, the correlation between relevant features (i.e., the features included in the relevant

feature set) and irrelevant features (i.e., the features not included in the relevant feature set) is a

function of the number of components inside the feature set, average value of inner-correlation

among inside features, and average value of correlation between inside components and outside

features which is shown in Equation 3.1. More technical details can be found in [48].

( 1)

zizc

ii

krr

k k k r

(Equation 3.1)

where,

: is the correlation between the current relevant features and the potential relevant features.

k: is the number of features.

zir : is the average value between the relevant features and the potentially relevant features.

iir : is the average value between two relevant features.

3.1.3 Step 3: Data Mining Knowledge Discovery

After data preprocessing is completed, the aim is to develop a function ( )f X Y that can

help map the selected features 1 2( , ,..., )nX x x x where n is the total number of features selected

to the class variable Y. From a theoretical point of view, there are two types of data mining

learning methodologies: supervised learning and unsupervised learning. Supervised learning is

Page 23: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

16

the machine learning task of inferring a function from labeled training data while unsupervised

learning refers to the problem of trying to find hidden structure in unlabeled data [49]. Since

observations are labeled in human gait modeling, the supervised learning is selected. In addition,

multiple data mining algorithms including Binary Logistic Regression, Support Vector Machine,

C4.5, Random Forest, and IBK are employed since they are proved to have good performances in

human gait classification problem [50–55]. Based on the performances of different algorithms,

the most accurate and reliable model and partitioning criteria would be generated.

Binary Logistic Regression

In Binary Logistic Regression, each selected input variable would be given a coefficient

in order to formulate a function mapping input variable to the output variable. Here the

coefficients could be considered as prediction power indicators. The equations are shown in

Eq.3.2 and Eq.3.3. By applying multiple linear features as input variable for a new observation,

the model estimates its probability of falling into one category. For example, in terms of the

Parkinson’s disease (PD) case study presented in the following section, one category would be

PD patient and another would be controls. More information about logistic regression can be

found in [50][51]. Linear regression may help the classification problem; however, its accuracy is

sometimes inconsistent since the linear combination of features may not be able to explain all the

variation in the response variable. By considering these limitations, support vector machine is

introduced.

( ) Tf x x (Equation 3.2)

* 2

1

arg min ( ( ) )n

i

i

f x y

(Equation 3.3)

Page 24: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

17

where,

β: is the coefficient and f(x) is the logistic function.

: is the value of the ith feature for one observation.

y: is the value of the output variable.

Support Vector Machine (SVM)

In addition to the logistic regression model, SVM is another available classifier by

maximizing the margin space between two different clusters. In contrast to other data mining

algorithms, the observations that are close to the partition boundary of the clusters receive more

attention in SVM and would finally generate the separating boundary based on a kernel function

shown in Eq. 3.4. In practice, SVMs are made robust by adding some “slacking variables” that

allow training error to be non-zero. In addition, SVM would also be able to transform the current

data to a higher dimensional space and construct the decision boundary. Specific technical details

could be referred in [52][53]. SVM may help increase accuracy in logistic regression modeling;

however, it is sometimes difficult to explain the kernel function and results of the algorithm since

it is a non-parametric technique and lacks transparency of results and cannot represent the kernel

function as simple parametric function of input variables [56]. In order to overcome these

limitations while maintaining the modeling accuracy and robustness, the C4.5 decision tree

algorithms is discussed.

1 2

1

( , ,..., )n

n i i

i

f x x x w x b

(Equation 3.4)

where,

Page 25: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

18

: is the coefficient and f(x) is the logistic function.

b: is the tolerance of the misclassification error.

: is the value of the ith feature for one observation.

C4.5

C4.5 is well established classification algorithm proposed by Quinlan in 1986 [54][55].

C4.5 is usually employed to classify one type of pattern in binary classification problems [54][55].

The algorithm comprises of two main steps: (1) best attributes evaluation and (2) splitting point

selection. The attribute evaluation step attempts to select the most informative node in each

subset of the training data set (the whole training data set for the root node selection) based on the

maximum value of gain ratio, which is calculated based on equations from Eq. 3.5 to Eq. 3.8. The

splitting point selection attempts to decide the best numerical split point that has the minimum

misclassification error which is based on Eq. 3.6. More information about decision tree can be

found in [50], [59-60].

2

1

( ) log ( )m

i i

i

Info D p p

(Equation 3.5)

1

( ) ( )v

j

A j

j

DInfo D Info D

D

(Equation 3.6)

( ) ( ) ( )AGain A Info D Info D (Equation 3.7)

2

1

( ) ( )( )( )

( )log ( )

A

vj j

j

Info D Info DGain AGR A

SplitInfo A D D

D D

(Equation 3.8)

where,

Page 26: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

19

I (D): is the expected information needed to classify a tuple in D.

D: is the data set.

m: is the total number of classes.

: is the probability that an arbitrary sample belongs to class and is estimated by /|D

( )AInfo D : is the information needed to split D into v partitions by selecting the attribute A.

Gain(A): is the information gained by branching on an attribute A.

I(A): is the information of attribute A.

GR (A): is the gain ratio of attribute A.

SplitInfo(A): is the information of attribute A.

: is the number of instances in D that belong to the jth

partition.

Random Forest (RF)

Random Forest retains many benefits of decision tree classification algorithm (such as the

C4.5) while achieving better results through the use of bagging, random subsets of variables, and

a voting scheme [57]. By using a random selection, M random cases are sampled with

replacement in the training data set for each tree. Then N features are also randomly sampled to

help construct single tree (M and N are predetermined parameters). Second, all the input variables

and cases are taken to help general a single tree as the similar procedure in C4.5. Finally, a large

number of trees are generated and they vote for the most popular class. We call this entire

procedure a random forest (RF). More details can be found in [61-62].

Page 27: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

20

IBK

IBK classifier is an instance-based machine learning algorithm based on K-Nearest

Neighbor (KNN). Instead of constructing explicit abstractions such as linear logistic regression

model, decision tree model and SVM model, IBK compares similarity between the observations

in training data set and hold-out observations in test data set. In addition, this algorithm assumes

that similar instance should have similar classifications. By computing the instance similarity

(shown in Eq. 3.9), IBK would be able to classify new instances to its nearest neighbors and

finally generate clusters. More information is given in [59][53].

2

1 1

( , ) ( , ) ( )n n

i i i i

i i

Similarity x y f x y x y

(Equation 3.9)

where,

: is the value in one dimension of one observation.

: is the value in one dimension of another observation.

n: is the total number of features in the feature space.

3.1.4 Step 4: Model Performance Evaluation and Application

After discussing human gait modeling, the next step is to evaluate model performance

based on multiple evaluations. Before employing the following performance metrics, k-fold cross

validation is employed. In the k-fold cross validation, the whole data set is randomly partitioned

into training data set and test data set. Each time the training data set is applied to train the model

while the test data set is applied to validate performance. This procedure is repeated another k

Page 28: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

21

times, and the performance is averaged and represented in multiple evaluation measures. Based

on the literature review, k is assigned to be 10 [52] [60].

The first evaluation is based confusion matrix (example is shown in Table 3-1) that

contains four cells: (1) true positive, (2) false positive, (3) false negative and (4) true negative.

These values would help generate Correctly Classified Instance (CCI), precision, recall, F-

measure and ROC curve [61].

Table 3-1. Confusion matrix example.

Actual Status

Predicted Status True False

True True Positive (TP) False Positive (FP)

False False Negative (FN) True Negative (TN)

The second evaluation measure Correctly Classified Instance (CCI) explains the weighted

average accuracy of different models for the two categories. The calculation is shown in Eq. 3.10.

*100%TP TN

CCITP TN FP FN

(Equation 3.10)

The third metric, Kappa statistic (KS) [53][62], measures the proportion of all positive

and negative cases after considering chance prediction. Generally, its value ranges from -1 to 1

where the model is considered as reliable when its value is from 0.8-1. In addition, KS<=0.2

(poor); 0.2<KS<=0.6 (fair); 0.6<KS<=0.8 (substantial). The calculation is shown in Eq. 3.11.

0

1

c

c

p pKS

p

(Equation 3.11)

where,

0p : is the probability of total agreement.

Page 29: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

22

cp : is the probability because of chance.

There are other evaluation measures called precision, recall, and F-measure, which can be

calculated from confusion matrix. Precision and recall can be considered as the Type I and Type

II errors to describe the confidence interval of applied model and calculations are shown in Eq.

3.12 and Eq. 3.13. For example, if the precision value is greater than 0.95, then the researchers

are 95 % confident that the model is able to classify observations correctly. F-measure is another

performance indicator, and it can be considered as a weighted average of the precision and recall.

Note that it gets the best performance at value of 1 and the worst performance at the value of 0.

The equation is shown in Eq. 3.14.

TPprecision

TP FP

(Equation 3.12)

TPrecall

TP FN

(Equation 3.13)

*2*

precision recallF

precision recall

(Equation 3.14)

The last evaluation measure is the receiver operating characteristic (ROC) curve. Since a

classification model is usually applied based on particular values of thresholds or parameters, the

ROC curve is able to describe different model performances based on different values of

threshold in order to choose the best operating point. The best operating point might be chosen so

that the classifier gives the best trade-off between the costs of failing to detect positives against

the costs of raising false alarms. These costs need not be equal; however this is a common

assumption. Note that the best place to operate the classifier is usually the point on its ROC that

lies on a 45 degree line closest to the north-west corner (0, 1) of the ROC plot.

Page 30: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

23

Once the human gait modeling and performance evaluation are completed, the most

suitable model could be applied to detect and recognize human gait patterns and visualize results

for decision support. The main benefit for decision support is able to quantify and visualize the

human gait results. In addition, the decision support also helps measure and evaluates human gait

patterns based on a small subset of relevant features. Finally, this decision support may serve as a

system to give reference for any interesting gait pattern detection based on the particular

application domain.

3.2 Geospatial Trajectory based Human Motion Modeling Methodology

The motivation of analyzing human geospatial trajectory is not only to model geospatial

movement patterns relative to an indoor space but also recognize common trajectory patterns

from multiple people so as to achieve the objectives in different application domains (e.g.,

averaging indoor space utilization, maintaining crowd control, etc.). Here, the trajectory pattern

could be understood as a set of regions that are of interests to different individuals. In the

methodology, there are a total four steps: data collection (Step 1), data preprocessing (Step 2),

data mining knowledge discovery (Step 3) and visualization (Step 4). The framework for human

trajectory modeling is shown in Fig. 3-3.

Page 31: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

24

Figure 3-3. Framework of geospatial trajectory modeling.

3.2.1 Step 1: Data Acquisition

Since there is not too much novel contribution in Step 2, Step 1 and Step 2 are combined.

The first step of the human trajectory based methodology is data acquisition, which is captured

through a wireless indoor tracking system helping update real-time individual geospatial location

(i.e. X and Y coordinates) with corresponding time stamps. By utilizing the GPS-based tracking

system, geospatial locations of each individual can be updated and considered as an

approximation of the individual geospatial trajectory. Then researchers are able to extract

individual trajectory patterns and establish common trajectory patterns among multiple people. In

Page 32: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

25

this study, the BuzNet Real-Time Locating System (RTLS) was used to track the trajectories of

multiple individuals in an indoor space. Once the data collection is completed, the data would be

stored in a database in a suitable format for subsequent steps in the data mining process.

3.2.2 Step 2: Data Transfer

Human geospatial trajectory data transfer is based on a hardware and software platform

that consists of three primary components: (1) Routers, (2) tags, and (3) Base Station. Routers are

fixed-position devices that form the wireless network infrastructure of the hardware. Tags are

wireless, battery-powered mobile devices placed on individuals in an indoor environment. Base

Station is a PC (typically, Microsoft Windows-based) that is loaded with the software. When

individuals are walking around in a given indoor space, this system provides an interactive

visualization interface for the tracking of individual locations approximately every 2 minutes. At

the same time, the Base Station stores every calculated location for every tag in a database

(locally-stored or cloud-based) that can be accessed and analyzed.

3.2.3 Step 3:Data Mining Knowledge Discovery

By comparing individual trajectories among multiple people in an indoor space,

researchers can extract common trajectory patterns in order to understand and recognize how the

indoor design space is utilized. Since The TRACLUS algorithm is irrespective of trajectory types

(e.g., dual direction trajectory), it can extract individual movement features from different

trajectories which will provide more information in trajectory clustering [63]. The methodology

contains two steps: (1) partitioning and (2) clustering. The first step attempts to capture and

Page 33: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

26

recover the real trajectory based on a subset of trajectory points. The second step attempts to

group different line segments generated in the previous step so as to recognize trajectory patterns

among different people based on clusters. In this section, the individual trajectory partitioning

methodology is explained first followed by the clustering methodology.

Trajectory partitioning

We assume that the original real individual trajectory could be duplicated based on the

data collected in the previous step. Some simple trajectories could be classified directly (i.e., one

directional trajectory); however, most of the trajectories are not in this type and cannot provide

insight if they are classified directly without any partitioning. The partitioning algorithm provides

one approach to duplicate the original trajectory without losing much information based on an

optimal subset of characteristic points. Given an individual trajectory 1 2{ , ,..., }nT t t t , optimal

characteristic points 1 2{ , ,..., }nP p p p would be generated [63]. Here is any position point

collected in previous data collection, and is any characteristic point extracted. In the

partitioning algorithm, the Minimum Description Length (MDL) function is applied to evaluate

each point (equations are shown from Eq. 3.15 to Eq. 3.18). Assuming the first point in the

original trajectory is the starting point, if its MDL_par cost is less than or equal to its

MDL_nonpar cost, then we continue searching until the first point that violates this

requirement is found. Assuming the first point that violates the MDL cost function is , then the

point is considered as one characteristic point and taken as the new starting point to search

next characteristics point until all the points in the original trajectory is checked. Finally the

characteristic point set P can be established.

Page 34: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

27

( ( ) ( | ))parMDL L H L D H (Equation 3.15)

2 1( ) log ( ( ))j jL H len p p (Equation 3.16)

1 1

2 1 1 2 1 1( | ) log ( ( , )) log ( ( , ))j

j

p

j j k k j j k k

k p

L D H d p p t t d p p t t

(Equation 4.17)

2 1log ( ( ))currentindex

nonpar j j

j startindex

MDL len p p

(Equation 4.18)

where,

parMDL : is the MDL cost of one possible characteristic point.

nonparMDL : is the non-MDL cost of one point.

( )L H : is the length of hypothesis when the next location is added.

( | )L D H : is the distance between line segments.

1( )j jlen p p : is the Euclidean distance between two points.

1 1( , )j j k kd p p t t : is the perpendicular distance between two line segments.

1 1( , )j j k kd p p t t : is the angle distance between two line segments.

1( )j jlen p p : is the Euclidean distance between two points.

is the jth

characteristic point in one trajectory.

is the (j+1)th

characteristic point in one trajectory.

is the kth

location point in one trajectory.

is the (k+1)th

location point in one trajectory.

Page 35: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

28

Trajectory clustering

By classifying different individual movement features into different clusters, researchers

would be able to understand the density of all the trajectories in the indoor design space in order

to improve user experience. Based on the characteristic points selected in the previous section, the

original individual trajectory could be duplicated and represented as line segment combinations.

In this section, the objective is to classify these line segments into different clusters where

common movement patterns are restored.

The clustering algorithm is based on the DBSCAN algorithm, which is a type of density

based clustering algorithm [63]. Given a set of line segments 1 2={l ,l ,...,l }jL , multiple clusters

could be generated 1 2={c ,c ,...,c }kC , where j and k are the total number of line segments and

clusters [63]. In the methodology, there are two parameters: (1) ε and (2) MinLn. ε is a threshold

to determine the distance between any pair of line segment, and MinLn helps explain the

minimum number of line segments inside the cluster.

The algorithm contains three steps [63]. First, a queue Q is constructed to include all the

unlabeled line segments during the algorithm. Each time, the ε–neighborhood of one unclassified

line segment in the queue is computed based on the distance function shown in Eq. 3.19 and

Eq. 3.20. If ( )iN l >=MinLn is satisfied, then a density-based set is generated until all the

unclassified line segments are examined. Otherwise, the line segment is considered as noise.

Second, the algorithm attempts to expand clusters. Assuming there are M neighborhoods

generated in the first step, then for any one neighborhood , the similar process is repeated but

in terms of . If there are other neighborhoods connected to (i.e. ), then a

cluster would be generated. Finally, trajectory cardinality is conducted to ensure that all the line

Page 36: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

29

segments inside the cluster are from different individual trajectory. More details can be found in

[63].

( ) { | ( , ) }i j i jN l l Q dist l l (Equation 3.19)

||( , ) ( , ) ( , ) ( , )i j i j i j i jdist l l d l l d l l d l l (Equation 3.20)

where,

( )iN l : is the number of ε–neighborhood of one unclassified line segment .

ε: is a threshold to determine the distance between any pair of line segment.

( , )i jdist l l : is the distance between two line segments.

( , )i jd l l: is the perpendicular distance between two line segments.

( , )i jd l l : is the angle distance between two line segments.

||( , )i jd l l : is the parallel distance between two points.

As mentioned earlier, the goal of trajectory modeling is to detect and recognize

movement patterns not only for individuals but also for multiple people. By visualizing the

trajectory clustering results based on the methodology discussed in Section 3.2.2, we may be able

to understand dynamics behind the geospatial movement patterns in these two aspects. In

addition, the clustering results may also be applied to help achieve some high-level objectives as

well. For instance, introducing the specific facility layout in a specific setting could help

understand indoor space utilization or public space crowd control so as to generate guidance to

relocate various resources and increase user experience. In the next chapter, two case studies

about human gait and geospatial trajectory are discussed to validate the methodologies.

Page 37: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

30

3.2.4 Step 4:Model Visualization

The goal of human geospatial trajectory modeling is to discover all possible utilized

regions to better understand human movement dynamics, describe space utilization patterns

evolution during different time periods based on the clustering results and provide possible better

indoor space design. From Step 1 to Step 3, researchers are able to obtain the total number of

clusters, the number of line segments included in each cluster, number different individuals

included and locations of these line segments in each cluster. Clustering visualization helps better

understand how the indoor space is utilized and may lead a better indoor space design.

In the previous sections, the utilized region is assumed to be the location in a design

space containing clusters of individuals. Based on the data mining trajectory clustering

methodology, several common movement patterns from different individual trajectories can be

detected, which equivalently means the clusters of common movement patterns. In the second

aspect, based on the clustering result, the total number of clusters generated and the number of

individuals included in each one can be obtained. In addition, the evolution of indoor space

utilization based on the change of movement patterns in different time periods is also addressed

to obtain the change of human movement behavior patterns.

Page 38: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

31

Chapter 4

Case Studies and Discussion

In order to validate the proposed methodologies in Chapter 3, a suitable application/case

study is chosen for human gait modeling and geospatial trajectory modeling, respectively. A

Parkinson’s disease (PD) detection case study is introduced for explaining the human gait

modeling. The objective in the case study is to propose a non-invasive motion tracking

methodology that will serve as a healthcare decision support system, capable of predicting the

emergence of PD based on extracted PD gait patterns. An indoor design space utilization case

study is presented to validate the proposed human geospatial movement component of the

proposed methodology. The objective is to understand how the indoor space is utilized based on

the density of all the trajectories.

For the data acquisition in the two case studies, voluntary participants from the university

were invited. Since the two studies focused on human related topics and asked for volunteers,

where the personal information may be identifiable during experiments and may cause privacy

problem, the skeletal frames are applied in human gait related topic while user ID is tracked

anonymously in the geospatial trajectory related topic. It is important to note that all the

experiments were designed and carried out following all the guidelines and rules enforced by the

Institutional Review Board (IRB) and Office for Research Protections (ORP) for research

involving human participants in the experiments.

The details about PD detection are discussed in Section 4.1 while the trajectory clustering

details are discussed in Section 4.2.

Page 39: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

32

4.1 Parkinson’s disease detection based Case Study

Parkinson’s disease (PD) is a motor related disease that affects more than one million

people in North America and is the 2nd

most neurological disorder after Alzheimer’s disease

[7],[68-69]. PD results from the death of dopamine-generating cells in a region of the middle

brain called substantia nigra [66]. The symptoms of PD include shaking, muscle rigidity,

slowness of movement, difficulty with walking, and some vocal problems; however, the most

obvious symptoms are gait-related, especially during the early stages of PD [8]. Here the early

PD stages are defined as the stages from I to III in the Hoehn and Yahr Staging of PD [8].

PD is now diagnosed based on published criteria such as Unified

Parkinson's Disease Rating Scale (UPDRS) [67]. Since the reason of neuron cell death is still

unclear, sometime it is difficult to diagnose PD accurately, and approximately 20%-25%

misdiagnosis is expected in the clinical PD diagnosis [68]. In addition, the current clinical PD

diagnosis process has a high demand for the human resources and facilities which could increase

financial burdens to not only PD patients but also insurance providers and even the government.

All these limitations would let PD patients occupy more healthcare resources, receive more

possible side effects, and decrease PD management efficiency. There are also some data mining

based methodologies applied in PD diagnosis. Even though they have proved effective in PD

recognition, the fundamental limitation is that there are predetermined assumptions for the

biomarkers that may reduce the final PD modeling performance. For example, hands and feet are

usually taken as the biomarkers to track and capture PD [10]. However, these may not be the best

features to predict PD in terms of accuracy and robustness of prediction.

Due to the disadvantages of current PD diagnosis, there is a demand for an integrated PD

detection system that is capable of identifying the emergence of PD motor symptoms in a cost-

Page 40: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

33

efficient, objective, and non-invasive way. The proposed data mining driven methodology would

highly satisfy this.

4.1.1 PD Data Acquisition and Preprocessing

For these experiments, the Microsoft Kinect was configured at an elevation of 3 feet and

10 inches above the floor. Each subject’s body presence was verified, and the camera angle was

adjusted by having the subject stand relaxed while facing the Kinect at a distance of 10 feet. Then

each volunteer was invited to the walking experiment where human gait was updated in about

every 30ms (i.e., collecting each frame of human gait in about every 30ms). In this forward

walking (FW) experiment, the subject was asked to first take 2-3 steps backward (4 feet) from the

point of camera calibration, still remaining within the distance limit of the device. Subjects were

then instructed to walk comfortably to the Kinect and were not given any specific instructions

regarding side of initiation. Finally, individual human gait data set was labeled with a class

variable (i.e., subject is PD or control) since the PD status was known before the experiment. The

experiment overhead view is shown in Fig. 4-1.

In this forward walking experiment, the subject pool consists of a total seven PD patients

without medication and seven controls without PD symptoms. Based on the data sampling rate of

33Hz, more than one thousand frames were collected for each subject during the walking

experiment.

Page 41: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

34

Figure 4-1. PD forward walking experiment overhead view.

In the next step, data preprocessing is conducted to reduce noise in the original data set.

For example, in the FW experiment, arm swing may not be captured when it swung to the back of

the body, and in this case multiple zeros would be generated in multiple successive frames. The

summary of this step is shown as follows:

1. The velocity and acceleration of each node were also generated in X, Y, and Z

directions similar to position data;

2. The ratio about position, velocity, and acceleration are generated between every two

nodes in X, Y, and Z coordinates to reduce human motion variation;

3. PD status is the response variable, and PD is considered as TRUE and control is

considered as FALSE;

4. Two dataset are finally generated. The first one is PD-OFF dataset which contains the

data form seven PD patients without medication. The second one is Control dataset which

Page 42: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

35

contains the data from seven controls. There are 1891 features included in each of the two data

sets.

4.1.2 PD-based Data Mining Knowledge Discovery and Evaluation

In the first step, the feature selection based on the CFS algorithm mentioned in Section

3.1 is conducted to generate the optimal subset in the forward walking experiment in terms of PD

detection. There are 32 features generated from the 1890 original features (the last one is output

variable). Among these 32 features, 18 features are related to position, 9 features are related to

velocity, 1 feature is related to acceleration, and the rest fall into ratios.

In the second step, multiple machine learning algorithms are employed to discover novel

knowledge pertaining to the data acquired. As discussed in Section 3.1, different models may

have different advantages and lead to different classification accuracies. By evaluating the

performances based on the 10 fold cross validation technique, the most accurate and reliable

classification model could be identified. From the Table 4-1, the IBK classifier is the best

classifier since its accuracy is almost 99%, which means the model could identify 99% of the

human gait frames correctly among the seven PD patients and seven controls. At the same time,

the accuracies of J48 (a classifier based on C4.5) and random forest both exceed 90%. The worst

model is logistic regression where the accuracy is only about 64.3%. More information about

confusion matrix in forward walking could be referred in Table 4-1. From this table, we can also

validate that logistic regression has lower accuracy, compared to the SVM, J48, random forest

and IBK models. In addition, the values of other performances can be obtained in Table 4-2. For

example, the IBK classifier can recognize 98.6% of the PD frames correctly. From these two

Page 43: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

36

tables mentioned, the IBK and random forest are the best classifiers in terms of the forward

walking experiment.

Table 4-1. Algorithms performances in walking experiment.

Algorithm Confusion matrix Accuracy

IBK

PD Control Sum

PD 1498 25 1523

Control 16 1349 1365

Sum 1514 1374 2888

98.8%

Binary Logistic Regression

PD Control Sum

PD 1079 444 1523

Control 556 779 1365

Sum 1665 1223 2888

64.3%

J48

PD Control Sum

PD 1414 109 1523

Control 120 1245 1365

Sum 1534 1354 2888

92.1%

SVM

PD Control Sum

PD 1055 468 1523

Control 556 809 1365

Sum 1611 1277 2888

64.5%

Random Forest

PD Control Sum

PD 1472 51 1523

Control 82 1283 1365

Sum 1554 1334 2888

95.4%

Table 4-2. Other evaluations among multiple algorithms in walking experiment.

Algorithm TP Rate FP Rate Precision Recall F-Measure ROC Area

IBK 0.986 0.014 0.986 0.986 0.986 0.986

Binary

Logistic

Regression

0.643 0.364 0.643 0.643 0.642 0.705

J48 0.921 0.08 0.921 0.921 0.921 0.929

SVM 0645 0.36 0.645 0.655 0.645 0.643

Random

Forest 0.954 0.048 0.954 0.954 0.954 0.991

Page 44: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

37

To summarize, the performances of all these machining learning classifiers are different

in FW experiment. In all these classifiers above, the IBK, random forest and J48 are better than

other two classifiers based on multiple evaluation measures and could be applied in future PD

detection application. For example, by looking at the features extracted from these three models,

researchers are able to identify the common relevant features to the PD detection. In addition, the

average value of these three algorithms may be considered as one quantitative PD detection result

in order to do PD comparison. However, since this case study is a pilot study and it has several

limitations. The main one is that there are only seven PD patients and seven controls involved in

the case study to validate human gait based modeling. One possible future work would be having

more subjects involved in different ages and keeping the same proportion in both males and

females. Furthermore, multiple stages scale detections applied in the current PD long-term

evolution (e.g. UPDRS) are attempted to be quantified which may help improve long term PD

management.

4.2 Geospatial Trajectory Clustering

In this section, the geospatial trajectory based case study is discussed. The objective is to

extract individual movement patterns and compare these patterns in order to generate clusters for

common movement patterns that could serve to help explain motivations behind these activities.

In order to achieve this goal, the related data acquisition is discussed in Section 4.2.1, followed by

the modeling and visualization is Section 4.2.2.

Page 45: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

38

4.2.1 Geospatial Trajectory Data Acquisition and Preprocessing

The data collection is conducted in the Learning Factory in Pennsylvania State University

which involves data collected throughout the 3,500 sq. ft. of the facility lab, work, and shop space

(see Fig. 4-2) [74-75]. It is designed for students in the College of Engineering to conduct design

and other related works. BuzNet Real-Time Locating System (RTLS) was used to track the

trajectories of teaching assistants (TAs) at the Learning Factory [71].

Figure 4-2. The learning factory layout.

In the experiment, there are twelve battery-powered Buznet tags provided to TAs when

they are working on their duties. TA was assumed to wear a tag while guiding student’s

experiments until the work is done and the tag is returned to the container. By collecting and

analyzing TA’s trajectories of a semester, we are able to understand their trajectory patterns and

dynamics. During each experiment, the X-Y 2-D dimensional position data would be updated

Page 46: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

39

about every two minutes with corresponding time stamp, tag ID, and sequence number. These

data would also be stored in database automatically.

4.2.2 Geospatial Trajectory based knowledge Discovery and Explanation

By looking at the results in partitioning algorithms, it is clear that this algorithm is able to

approximate the original individual trajectory based on the minimum number of characteristic

points. For example, there are 13 position nodes in the original trajectory in User 1 (shown in

Table 4-3); however, only Points 1, 4, 12 and 13 are selected as characteristic points to

approximate the original trajectory (shown in Fig. 4-3). Similar results could be seen in User 2.

For a clearer understanding, the trajectory visualization of User 1 is shown in Fig. 4-4. The

original trajectory of User 1 is represented as multiple black nodes connected by green line.

Based on the proposed partitioning algorithm, the trajectory is approximated by red dots

connected by a black line.

Table 4-3. Original trajectory of User 1.

Page 47: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

40

user number sequence number x location y location date time

1 1 18.6 11.8 1/20/2012 18:15:16

1 2 21.4 14.9 1/20/2012 18:17:17

1 3 21.5 15.1 1/20/2012 18:19:16

1 4 20.8 15.2 1/20/2012 18:21:17

1 5 20.5 15.6 1/20/2012 18:23:18

1 6 20.7 15.1 1/20/2012 18:25:18

1 7 21.1 15 1/20/2012 18:27:18

1 8 21.5 15.2 1/20/2012 18:29:18

1 9 21.2 15.1 1/20/2012 18:31:18

1 10 20.9 15.4 1/20/2012 18:33:17

1 11 20.9 15.4 1/20/2012 18:35:18

1 12 21.1 15.2 1/20/2012 18:37:18

1 13 18.6 11.8 1/20/2012 18:39:18

2 1 18.6 11.8 1/20/2012 18:43:18

2 2 20.9 15.6 1/20/2012 18:45:18

2 3 21.7 15.3 1/20/2012 18:47:18

2 4 20.9 15.8 1/20/2012 18:49:16

Figure 4-3. Extracted characteristic points of User 1

Page 48: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

41

Figure 4-4. Visualization of trajectory partitioning for User 1.

Based on the results in the partitioning algorithm, there are totally 1287 line segments

generated. By letting ε=1 and MinLn= 8, the clustering algorithm was applied to these line

segments and generated clusters. At last, each effective line segment in the queue was assigned to

a cluster, as well as the original trajectory to which each line segment belongs. Notice that there

are some line segments that cannot be classified into any one cluster since they violate the

parameters and MinLn, and we labeled this type of line segments as noise. For example in

Table 4-4, Line 1 and Line 2 are grouped in a cluster while Line 3 is grouped into Cluster 2 even

though all three lines are from the same trajectory. Since multiple lines could be included in

original trajectory, it is possible that each individual trajectory could be grouped into different

clusters and helps provide more detail about trajectory patterns.

Table 4-4. Example result based on clustering algorithm.

Line Segment No. Cluster No. Trajectory No.

Page 49: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

42

Line 1 C1 Tra 1

Line 2 C1 Tra 1

Line 3 C2 Tra 1

Line 4 C2 Tra 2

Line 5 C2 Tra 2

Line 6 C1 Tra 3

Line 7 C1 Tra 3

Line 8 C1 Tra 3

Line 9 C1 Tra 3

Table 4-5. Result of clustering algorithm.

Cluster No. Total number of line

segments

Cluster

cardinality

C 1 58 20

C 2 41 18

C 3 8 3

C 4 42 15

C 5 15 8

C 6 59 14

C 7 48 14

C 8 224 46

C 9 322 44

The final clustering result is represented in Table 4-5. There are nine clusters generated

based on 817 line segments from the total 1287 line segments in the first step. That is to say,

about 63.5 % of the individual movement patterns could be shared among multiple people,

represented in nine clusters. Moreover, Cluster 8 and Cluster 9 are the most common movement

patterns shown in blue and green in Fig.4-5 since 546 line segments from 90 individual

trajectories are included in these two. Cluster 8 helps explain movements from about 18.7% of

the total people in the case study, and most of the movements are represented in the middle two

Page 50: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

43

spaces (work space and shop space). At the same time, there are “back and forth” movements

patterns since most of the line segments are parallel types. Cluster 9 explains the movements

shared by 17.8% of the sample included in case study. Comparing to Cluster 8, more trajectory

patterns are represented and more spaces are used such as PC room, presentation room, as well as

toilet. The similar thing is that there are still “back and forth” patterns involved. Notice that there

are some lines outside of the Learning Factory because people go out of the building before they

return the tags.

Comparing to Fig. 4-2, the clustering results provide a clearer picture about the human

trajectory movement patterns as well as the indoor space utilization patterns as shown in Fig. 4-5.

Figure 4-5. Clustering visualization.

Page 51: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

44

Figure 4-6. Clustering visualizations in the first period.

Figure 4-7. Clustering visualizations in the second period.

Page 52: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

45

Figure 4-8. Clustering visualizations in the third period.

In order to detect possible movement pattern evolution, the original trajectory data set

was separated into three periods: from January 20th 2012 to February 21th 2012 for the first

period, from February 22th 2012 to March 22th 2012 for the second one, and from March 23th

2012 to April 23th 2012 for the last one. The visualizations are shown in the Fig. 4-6, Fig. 4-7

and Fig.4-8. In addition, there are several points needed to be addressed. First, utilized spaces are

increasing as time goes on from the first picture (Fig. 4-6) to the last one (Fig.4-8). Second, the

similarities among multiple clusters are increasing as time goes on. One possible explanation is

that students have no specific assignments or tasks and just wander around to know each section

in Learning Factory. However, as the semester goes on, students may need to design the

prototype and then go to the machining room for milling. During the end of the semester, the PC

room usage is decreased, but the presentation room is increased since they may complete the

project already and give final presentations. To summarize, it is clear that having more

Page 53: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

46

information about different geospatial trajectory patterns based on proposed methodology in this

thesis instead of just mapping location points. In addition, this methodology provides one

approach to recognize the utilization relationship between or among multiple spaces in order to

capture the indoor space utilization patterns which can be taken as evidence for indoor space

utilization optimization.

Page 54: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

47

Chapter 5

Conclusions and Future Work

This thesis proposes a human movement tracking methodology for both human gait and

geospatial trajectory with preserved privacy, which means person is unidentifiable based on the

movement data collected. The methodology is partitioned into two components. The first

component is human gait modeling where the objective is to model and predict neurological

patterns that influence human gait. In addition, we are able to solve human gait variation problem

by introducing ratios in position, velocity and acceleration. The second component is human

geospatial trajectory modeling and it aims to predict common regions of interest (CRI) in indoor

design spaces in order to capture and optimize indoor space design. The experimental results

show that our proposed human gait modeling is able to detect significant gait difference between

PD patients and controls, and our proposed human geospatial trajectory modeling is able to detect

common regions of interest form multiple people in the Learning Factory which can serve as a

tool for future indoor space design. Based on these research findings, we can demonstrate the

feasibility of employing multimodal sensors and supervised machine learning algorithms to

model and predict human movement kinematics.

It is time to consider how this work can be expanded and improved upon in the future. In

terms of human gait modeling, one possible future work would be to identify the common

relevant features among multiple machine learning algorithms in order to search for the most

relevant features to the human gait class variable. For example, by examining all the selected

features in different machine learning algorithm, researchers are able to recognize the most

predictive features to the PD detection. In terms of human geospatial trajectory modeling, one

possible future extension would be to add indoor space layout information in order to optimize

Page 55: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

48

the indoor space utilization efficiency. For example, by adding facility layout information of the

Learning Factory, the designers are able to better design the space and improve the utilization

efficiency.

Page 56: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

49

References

[1] B. James, Body language: 7 easy lessons to master the silent language. Saddle River, New

Jersey, 07458: FT Press, 2009.

[2] J. K. Aggarwal and Q. Cai, “Human motion analysis: a review,” in Proceedings of the

1994 IEEE Workshop on. IEEE, 1994, pp. 90–102.

[3] L. Wang, W. Hu, and T. Tan, “Recent developments in human motion analysis,” in

Pattern Recognition, vol. 36, no. 3, pp. 585–601.

[4] H. Fujiyoshi, “Real-time Human Motion Analysis by Image Skeletonizadion,” in Fourth

IEEE Workshop on. IEEE, 1998, pp. 15–21.

[5] “Human Gait,” http://en.wikipedia.org/wiki/Gait_(human). .

[6] A. Hakeem, R. Vezzani, M. Shah, R. Cucchiara, and R. Emilia, Estimating Geospatial

Trajectory of a Moving Camera. Hong Kong: ICPR 2006, 2006, pp. 82–87.

[7] D. Gil and D. J. Manuel, “Diagnosing parkinson by using artificial neural networks and

support vector machines,” Global Journal of Computer Science and Technology, vol. 9,

no. 4, 2009.

[8] S. J. G. Lewis, T. Foltynie, a D. Blackwell, T. W. Robbins, a M. Owen, and R. a Barker,

“Heterogeneity of Parkinson’s disease in the early clinical stages using a data driven

approach.,” Journal of neurology, neurosurgery, and psychiatry, vol. 76, no. 3, pp. 343–8,

Mar. 2005.

[9] D. B. Calne, B. J. Snow, and C. Lee, “Criteria for Diagnosing Parkinson’s disease,”

Annals of Neurology, vol. 32, no. Supplement S1, pp. 125–127, 1992.

[10] J. Barth, M. Sunkel, K. Bergner, G. Schickhuber, J. Winkler, J. Klucken, and B. Eskofier,

“Combined analysis of sensor data from hand and gait motor function improves automatic

recognition of Parkinson’s disease,” in Engineering in Medicine and Biology Society

(EMBC), 2012 Annual International Conference of the IEEE, 2012, pp. 5122–5125.

[11] Http://www.swimsmooth.com/certifiedcoaches.html, “Swimming video analysis.” .

[12] and J. M. H. Smith, David J., Stephen R. Norris, “Performance evaluation of swimmers,”

Sports Medicine, vol. 32, no. 9, pp. 539–554, 2002.

Page 57: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

50

[13] G. L. Foresti, “A Real-Time System for Video Surveillance of Unattended Outdoor

Environments,” in IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO

TECHNOLOGY, 1998, vol. 8, no. 6, pp. 697–704.

[14] M. Xu, L. Duan, C. Xu, and Q. Tian, “A fusion scheme of visual and auditory modalities

for event detection in sports video,” in Acoustics, Speech, and Signal Processing, 2003,

vol. 3, pp. 111–189.

[15] A. F. Smeaton, P. Over, and W. Kraaij, “Multimedia Content Analysis,” in Signals and

Communication Technology, 2009, pp. 151–174.

[16] and A. E. E. Prassler1, J. Scholz, Tracking People in a Railway Station during Rush-Hour.

1999, pp. 162–179.

[17] and A. T. Regazzoni, Carlo S., “Distributed data fusion for real-time crowding estimation,”

in Signal Processing, 1996, vol. 53, pp. 47–63.

[18] A. Fod, A. Howard, and A. Overview, “A Laser-Based People Tracker,” in Robotics and

Automation, 2002. Proceedings, 2002, no. May, pp. 3024–3029.

[19] M. S. L. Scanners, H. Zhao, and R. Shibasaki, “A Novel System for Tracking Pedestrians

Using Multiple Single-Row Laser-Range Scanners,” Systems, Man and Cybernetics, Part

A: Systems and Humans, IEEE Transactions, vol. 35, no. 2, pp. 283–291, 2005.

[20] L. W. Campbell and A. F. Bobick, “Recognition of human body motion using phase space

constraints,” in Proceedings of IEEE International Conference on Computer Vision, 1995,

pp. 624–630.

[21] N. H. Goddard, “Incremental Model-Based Discriminat ion of Articulated Movement

from Motion Features,” in Proceedings of the 1994 IEEE Workshop on. IEEE, 1994, pp.

89–94.

[22] I. A. Kakadiaris, D. Metaxas, R. Bajcsy, and I. Science, “Active Part-Decomposition,

Shape and Motion Estimation of Articulated Objects: A Physics-based Approach,”

Computer Vision and Pattern Recognition, 1994. Proceedings CVPR’94, pp. 980–984,

1994.

[23] R. Rashid, “Towards a system for the interpretation of moving light displays,” in Pattern

Analysis and Machine Intelligence, IEEE …, 1980, no. 6, pp. 574–581.

[24] G. Johansson, “Visual perception of biological motion and a model for its analysis,”

Perception & psychophysics, vol. 14, no. 2, pp. 201–211, 1973.

Page 58: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

51

[25] M. K. Leung, Y. Yang, and M. Senior, “First Sight : A Human Body Outline Labeling

System,” Pattern Analysis and Machine Intelligence, IEEE Transactions, vol. 17, no. 4,

1995.

[26] G. Johansson, “Visual motion perception,” Scientific American, vol. 232, no. 6, pp. 76–88,

1975.

[27] J. A. J. k. A. Webb, Visually Interpreting The Motion of Objects in Space. Computer

Science Department, University of Texas at Austin: , 1981.

[28] I.-C. C. H. Chang, “Ribbon-Based Motion Analysis of Human Body Movements,” in In

Pattern Recognition, Proceedings of the 13th International Conference, 1996, pp. 436–

440.

[29] C. S. Works, “Image sequence analysis of real world human motion,” Pattern Recognition,

vol. 17, no. 1, 1984.

[30] D. . Gavrila, “The Visual Analysis of Human Movement: A Survey,” Computer Vision

and Image Understanding, vol. 73, no. 1, pp. 82–98, Jan. 1999.

[31] D. M. Gavrila and L. S. Davis, “3-D model-based tracking of humans in action,” in

Computer Vision and Pattern Recognition,, 1996, pp. 73–80.

[32] L. Goncalvest, E. Di Bernardotl, E. Ursellaj, and P. Peronat, “Monocular tracking of the

human a r m in 3D,” in Computer Vision, 1995, pp. 764–770.

[33] I. A. Kakadiaris and D. Metaxas, “3D Human Body Model Acquisition from Multiple

Views,” in Computer Vision, 1995. Proceedings., Fifth International Conference on. IEEE,

1995, pp. 618–623.

[34] R. Szeliski, O. K. Square, and S. B. Kang, “Recovering 3D Shape and Motion from Image

Streams using Non-Linear Least Squares,” in Computer Vision and Pattern Recognition,

1993. Proceedings CVPR ’93., 1993 IEEE Computer Society Conference, pp. 752–753.

[35] T. Nomura, T. Tasaki, and T. Kanda, “Questionnaire – Based Research on Opinions of

Visitors for Communication Robots at an Exhibition in Japan,” in Human-Computer

Interaction-INTERACT, 2005, pp. 685–698.

[36] T. Shibata, K. Wada, and K. Tanie, “Tabulation and analysis of questionnaire results of

subjective evaluation of seal robot at Science Museum in London,” Proceedings. 11th

IEEE International Workshop on Robot and Human Interactive Communication, pp. 23–

28, 2002.

Page 59: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

52

[37] F. Girardin, F. D. Fiore, C. Ratti, and J. Blat, “Leveraging explicitly disclosed location

information to understand tourist dynamics: a case study,” Journal of Location Based

Services, vol. 2, no. 1, pp. 41–56, Mar. 2008.

[38] C. R. Yuji Yoshimura, Fabien Girardin, Juan Pablo Carrascal, “New Tools for Studying

Visitor Behaviors in Museum: A Case Study at the Louvre,” in and Communication

Technologies in Tourism 2012. Proceedings of the International conference in

Helsingborg (ENTER 2012)., pp. 15–27.

[39] H. Cao, N. Mamoulis, D. W. Cheung, P. Road, and H. Kong, “Mining Frequent Spatio-

temporal Sequential Patterns,” in Data Mining, Fifth IEEE International Conference, pp.

27–30.

[40] G. Andrienko and S. Augustin, “Visual Analytics Tools for Analysis of Movement Data,”

ACM SIGKDD Explorations Newsletter, vol. 9, no. 2, pp. 38–46, 2007.

[41] C. Charayaphan, “Communications Image processing system for interpreting in American

Sign Language motion,” Journal of Biomedical Engineering, vol. 14, no. 5, pp. 419–425,

1992.

[42] S. Tamura and S. Kawasaki, “Recognition of sign language motion images,” Pattern

Recognition, vol. 21, no. 4, pp. 343–353, Jan. 1988.

[43] F. Polana, R. Nelson, and N. York, “Low Level Recognition of Human Motion,” in

Motion of Non-Rigid and Articulated Objects, 1994., Proceedings of the 1994 IEEE

Workshop on. IEEE,, 1994, pp. 77–82.

[44] U. Kreljel, W. Ritter, R. Dbag, and U. Daimlerbenz, “Tracking Non-Rigid, Moving

Objects Based on Color Cluster Flow,” in IEEE Computer Society Conference, 1997, pp.

257–260.

[45] N. Johnson and D. Hogg, “Learning the distribution of object trajectories for event

recognition,” Image and Vision Computing, vol. 14, no. 8, pp. 609–615, Aug. 1996.

[46] T. T. Zhouyu Fu , Weiming Hu, “Similarity based vehicle trajectory clustering and

anomaly detection,” in Image Processing, 2005. ICIP 2005. IEEE International

Conference on (Volume:2 ), pp. 11–14.

[47] I. K. Fodor, “A Survey of Dimension Reduction Techniques,” 2002.

[48] M. A. Hall, “Correlation-based feature selection for machine learning,” Doctoral

dissertation, The University of Waikato, 1999.

[49] B. Fritzke, “Growing Cell Structures: A Self-Organizing Network for Unsupervised and

Supervised Learning,” Neural networks, vol. 7, no. 9, pp. 1441–1460, 1994.

Page 60: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

53

[50] R. G. Ramani, G. Sivagami, and and G. S. Ramani, R. Geetha, “Parkinson disease

classification using data mining algorithms,” International Journal of Computer

Applications, vol. 32, no. 9, pp. 17–22.

[51] N. Landwehr, M. Hall, and E. Frank, “Logistic Model Trees,” Machine Learning, vol. 59,

no. 1–2, pp. 161–205, May 2005.

[52] A. Tsanas, M. A. Little, P. E. McSharry, J. Spielman, and L. O. Ramig, “Novel speech

signal processing algorithms for high-accuracy classification of Parkinson’s disease,”

Biomedical Engineering, IEEE Transactions on, vol. 59, no. 5, pp. 1264–1271, 2012.

[53] A. Ozcift, “SVM feature selection based rotation forest ensemble classifiers to improve

computer-aided diagnosis of Parkinson disease,” Journal of medical systems, vol. 36, no. 4,

pp. 2141–2147, 2012.

[54] S. Wu, “A Data Mining Analysis of The Parkinson’s Disease,” in iBusiness, 2011, vol. 03,

no. 01, pp. 71–75.

[55] S. M. Gabrilovich, Evgeniy and E. Gabrilovich, “Text Categorization with Many

Redundant Features: Using Aggressive Feature Selection to Make SVMs Competitive

with C4.5,” in Proceedings of the twenty-first international conference on Machine

learning, pp. 41–48.

[56] S. Abe, Support vector machines for pattern classification. Springer London Dordrecht

Heidelberg New York, 2010.

[57] L. E. O. Breiman, “Random Forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.

[58] M. F. Amasyalı, B. Diri, and M. F. Amasyal\i, Automatic Turkish Text Categorization in

terms of Author, genre and gender. Springer Berlin Heidelberg, 2006, pp. 221–226.

[59] D. W. Aha, D. Kibler, and M. K. Albert, “Instance-based learning algorithms,” Machine

Learning, vol. 6, no. 1, pp. 37–66, Jan. 1991.

[60] T. D’heygere, P. L. M. M. Goethals, and N. De Pauw, “Use of genetic algorithms to select

input variables in decision tree models for the prediction of benthic macroinvertebrates,”

Ecological Modelling, vol. 160, no. 3, pp. 291–300, Feb. 2003.

[61] A. H. Fielding and J. F. Bell, “A review of methods for the assessment of prediction errors

in conservation presence/absence models,” Environmental conservation, vol. 24, no. 1, pp.

38–49, 1997.

[62] E. Dakou, T. D’heygere, A. P. Dedecker, P. L. M. Goethals, M. Lazaridou-Dimitriadou, N.

Pauw, and N. De Pauw, “Decision Tree Models for Prediction of Macroinvertebrate Taxa

Page 61: A PROPOSED DATA MINING DRIVEN METHDOLOGY FOR …

54

in the River Axios (Northern Greece),” Aquatic Ecology, vol. 41, no. 3, pp. 399–411, Jul.

2006.

[63] J. Lee and J. Han, “Trajectory Clustering : A Partition-and-Group Framework,” in

Proceedings of the 2007 ACM SIGMOD international conference on Management of data,

2007, pp. 593–604.

[64] S. Patel, K. Lorincz, R. Hughes, N. Huggins, J. Growdon, D. Standaert, M. Akay, J. Dy,

M. Welsh, and P. Bonato, “Monitoring motor fluctuations in patients with Parkinson’s

disease using wearable sensors,” Information Technology in Biomedicine, IEEE

Transactions on, vol. 13, no. 6, pp. 864–873, 2009.

[65] X. Huang, H. Chen, W. C. Miller, R. B. Mailman, J. L. Woodard, P. C. Chen, D. Xiang, R.

W. Murrow, Y.-Z. Wang, and C. Poole, “Lower low-density lipoprotein cholesterol levels

are associated with Parkinson’s disease,” Movement disorders, vol. 22, no. 3, pp. 377–381,

2007.

[66] “Parkinson’s disease introduction.” [Online]. Available:

http://en.wikipedia.org/wiki/Parkinson’s_disease.

[67] P. Martinez-Martin, A. Gil-Nagel, L. M. Gracia, J. B. Gomez, J. Martínez-Sarriés, and F.

Bermejo, “Unified Parkinson’s disease rating scale characteristics and structure,”

Movement disorders, vol. 9, no. 1, pp. 76–83, 1994.

[68] a J. Hughes, S. E. Daniel, L. Kilford, and a J. Lees, “Accuracy of clinical diagnosis of

idiopathic Parkinson’s disease: a clinico-pathological study of 100 cases.,” Journal of

Neurology, Neurosurgery & Psychiatry, vol. 55, no. 3, pp. 181–184, Mar. 1992.

[69] T.W. Simpson and E. Kisenwether, “Driving entrepreneurial innovation through the

learning factory: The power of interdisciplinary capstone design projects,” in ASME

Design Engineering Technical Conferences-Design Education Conference., 2013.

[70] T. W. Lamancusa, John S and Simpson, “The Learning Factory–10 Years of Impact at

Penn State.,” in International Conference on Engineering Education, pp. 16–21.

[71] “Simple & Reliable Indoor Positioning Overview.” [Online]. Available:

http://www.buzbynetworks.com/buznet/buznet-overview.