HILBERT SPACE FILLING CURVE (HSFC) NEAREST NEIGHBOR CLASSIFIER · the use of space filing curves to index the training data. In Skubalska-Rafajlowicz, the data was indexed using Peano

HILBERT SPACE FILLING CURVE (HSFC) NEAREST NEIGHBOR CLASSIFIER

by

John David Reeder II

A thesis submitted in partial fulfillment of the requirements for the Honors in the Major Program in Computer Engineering

in the College of Engineering and Computer Science and in The Burnett Honors College at the University of Central Florida

Orlando, Florida

ABSTRACT

The Nearest Neighbor algorithm is one of the simplest and oldest classification

techniques. A given collection of historic data (Training Data) of known classification is

stored in memory. Then based on the stored knowledge the classification of an unknown

data (Test Data) is predicted by finding the classification of the nearest neighbor. For

example, if an instance from the test set is presented to the nearest neighbor classifier, its

nearest neighbor, in terms of some distance metric, in the training set is found. Then its

classification is predicted to be the classification of the nearest neighbor. This classifier is

known as the 1-NN (one-nearest-neighbor). An extension to this classifier is the k-NN

classifier. It follows the same principle as the 1-NN classifier with the addition of finding

k (k > 1) neighbors and taking the classification represented by the highest number of its

neighbors.

It is easy to see that the implementation of the nearest neighbor classifier is effort-

less, simply store the training data and their classifications. The drawback of this

classifier is found when a test instance is presented to be classified. The distance from the

test pattern to every point in the training set must be found. The required computations to

find these distances are proportional to the number of training points (N), which is

computationally complex, especially with N large.

The purpose of this thesis is to reduce the computational complexity of the testing

phase of the nearest neighbor by using the Hilbert Space Filling Curve (HSFC). The

HSFC NN classifier was implemented and its accuracy and computational complexity is

compared to the original NN classifier to test the validity of using the HSFC in

classification.

ACKNOWLEDGEMENTS

First and foremost I would like to thank God above for giving me the opportunity

to attend the University of Central Florida, and for helping me achieve all that I have set

out to accomplish

Second, I would like to thank my chair Dr. Georgiopoulos for guiding me through

this process and allowing me to work with his group in Machine Learning. I look forward

to starting graduate school with him as my advisor. Also, I would like to recognize my

committee Dr. Bauer, and Dr. Gonzalez, for there comments and guidance in finalizing

my paper

I would like to thank my family, for always believing in me and giving me to

confidence to move away from home to pursue my college education. They have always

supported me in every endeavor and without them none of this would be possible.

My friends Anton Kiriwas and James Ginn also deserve recognition for their help

in reviewing my paper, and giving advice from their lessons learned on their own

thesis’s.

Last but not least, I would like to thank Keisha Enfinger for her love and support

through out my college career. With out her I would not have been able to succeed away

from home.

iv

TABLE OF CONTENTS

Chapter 1: Introduction ....................................................................................................... 1

Chapter 2: Nearest Neighbor Algorithms ........................................................................... 5

2.1 1-NN Algorithm........................................................................................................ 5

2.1.1 1-NN Pseudo Code ............................................................................................ 6

2.2 k-NN Algorithm........................................................................................................ 7

2.2.1 k-NN Pseudo Code ............................................................................................ 8

2.3 Weighted k-NN algorithm ........................................................................................ 9

2.3.1 Weighted k-NN Pseudo Code.......................................................................... 11

Chapter 3: The Hilbert Space Filling Curve ..................................................................... 13

3.1 Butz’s Algorithm .................................................................................................... 15

2-D Example: Hilbert Index to 2-D Coordinate ....................................................... 16

2-D Example: 2-D coordinates to Hilbert Index....................................................... 18

3.2 HSFC NN Pseudo Code.......................................................................................... 21

Chapter 4: Experimental Results ...................................................................................... 22

4.1 Experimental Procedure.......................................................................................... 22

4.2 Test Databases ........................................................................................................ 24

4.2.1 Gaussian Databases.......................................................................................... 24

4.2.2 IRIS Plant Database ......................................................................................... 24

4.2.3 Abalone Database ............................................................................................ 25

4.3.4 Page blocks Database....................................................................................... 25

4.3 Results..................................................................................................................... 27

4.3.1 Large Database Experiments: Forest Cover Database..................................... 32

4.3.2 Multiple Curve HSFC NN ............................................................................... 35

Chapter 5: Conclusions on Results ................................................................................... 39

Chapter 6: Implementation Issues..................................................................................... 41

Chapter 7: Future Research............................................................................................... 45

References......................................................................................................................... 49

v

LIST OF FIGURES

Figure 1: sample distribution used in example by Cover and Hart..................................... 6

Figure 2: Example of 5-NN classification. ......................................................................... 8

Figure 3: Example of Weighted 5-NN.............................................................................. 10

Figure 4: 2-D Examples of Hilbert Space filling curves................................................... 13

Figure 5: Mapping 2-D coordinates to 1-D Hilbert Index ................................................ 14

Figure 6: Iris Plant Database mapped to the HSFC .......................................................... 20

Figure 7: Gaussian 2 class Accuracy vs. K....................................................................... 30

Figure 8: Run Time vs. Training set size.......................................................................... 31

Figure 9: Forest Cover Accuracy...................................................................................... 33

Figure 11: Multi-HSFC 1-NN vs. Single-HSFC 1-NN .................................................... 36

Figure 12: Multi-HSFC k-NN vs. Single-HSFC k-NN .................................................... 37

Figure 13: HSFC Neighbor Selection............................................................................... 43

Figure 14: Hilbert Decision Space Classifier Example .................................................... 47

vi

LIST OF TABLES

Table 1: Examples of HSFC Mapping.............................................................................. 20

Table 2: Results on small databases.................................................................................. 27

Table 3: Forest Cover Results........................................................................................... 33

Table 4: Multi vs Single HSFC accuracy on the Forest Cover Database ......................... 36

vii

LIST OF SYMBOLS/ABBREVATIONS

HSFC…………………………………………………………. Hilbert Space Filling Curve NN………………………………………………………………………..Nearest Neighbor PNN…………………………………………………………..Probabilistic Neural Network N………………………………………………………………..Number of Training Points Symbols used in Standard NN x …………. A vector representing the input attributes. This vector is of dimensionality n. k……………………………………………………….Number of Neighbors used in k-NN w………………………………………………………….. Weight used in Weighted k-NN Symbols used in HSFC NN n………………………………………………………….Number of dimensions of Data m …………………………………………………………………………. Order of HSFC M ………………………………………….. The number of bits in the HSFC index (n*m) r ………………………………………………………………………………. HSFC index Byte……………………………………………………………….a word containing n bits

ja ………………………Vector >< naaa ...21 representing a datum whose derived key is r .

Gaussian Database Naming Convention

gXc_YY……………………….X represents Number of Classes YY represents % overlap g2c_05……………………………………………………….. Gaussian 2-class 5% overlap

viii

Chapter 1: Introduction

Classification is the problem of correctly classifying (labeling) data based on the

features that they possess. For instance, an example of a classification problem is to

recognize the type of fruit that we are dealing with (e.g., banana, apple, pear, etc.) based

on certain features, such as color, smell, shape, and so on. In building such a classifier

system one relies on domain knowledge about the problem at hand (e.g., that the color of

the banana is yellow) and quite often on examples of data whose classification is known.

The task then becomes to efficiently use this knowledge to design a classifier that

performs this recognition (labeling), fast and accurately.

The Nearest Neighbor classifier is one of the oldest classifiers in use today. It has

remained popular because of its simple implementation and its guaranteed error rate of

less than twice the error rate of the Bayesian classifier, the best possible classifier when

statistical information about the data is known (Cover and Hart, 1967). The nearest

neighbor classifier can be described very simply. The classification (label) of a datum (of

certain features) and unknown classification is the same as the classification of any datum

of known classification whose features are closest in distance to the features of datum of

unknown classification. The assumption here is that we have available to us data of

known classification and the features of these data belong to a feature space for which

meaningful distances can be defined.

One of the problems of the nearest neighbor classifier is the time that it takes to

find the label of a datum of unknown classification. Assuming that we have stored the

1

information about the features and labels of N known points to determine the

classification of a new datum it requires that we calculate the distances of the new datum

from all the stored points. This is a calculation that is proportional to N and it is

prohibitively slow for large N. One of the ways that have been suggested in the literature

to deal with this problem is the design of a prototype nearest neighbor. A prototype

nearest neighbor is a nearest neighbor approach where compression of the stored data of

known classification is performed first, by clustering. Then, when a new datum of

unknown classification arrives its nearest prototype is first found and its classification is

predicted to be identical with the classification of the prototype. Since the number of

prototypes is, quite often, significantly smaller than the number of points that they

represent this approach significantly reduces the computational complexity of the nearest

neighbor approach. The disadvantage of the prototype nearest neighbor is that it results,

at times, in the reduction of accuracy attained by the original nearest neighbor approach.

Examples of fast approaches that have been introduced into the literature to speed

up the complexity of nearest neighbor include (Friedman, et al., 1975, Hart 1968). In

particular, in Hart you start with a single randomly chosen observation as the training set,

and then each additional data item is processed one at a time, adding it to the training set

only if it is misclassified by the nearest neighbor rule computed on the current training

set. While in Friedman the training data is stored in an optimized k-d tree, similar to a

binary tree, allowing the algorithm to search only the training data that is sufficiently

close to the test point for the nearest neighbor. The increase in speed comes from the

decreased number of training points that are searched for the nearest neighbor.

2

Another way of dealing with the high computational complexity of the nearest

neighbor classifier is by using the HSFC approach. Through this approach the N stored

points of dimensionality n are mapped into a 1-dimensional index, called the HSFC

index. The indexing happens in time. Then when a new datum arrives the

point with its closest index is found first (in time), and its classification is

designated to be the classification of the point of closest index. What makes this approach

successful, in addition to being computationally efficient, is that points whose Hilbert

indexes are close are close in the original n – dimensional space. However, there might

be points that are close in the original n – dimensional space whose Hilbert indices are

not close. Hence, although Hilbert indexing improves the speed of determining a nearest

neighbor, this advantage happens at the expense of not being able to find the nearest

neighbor at all times.

))(log( 2 NNO

))((log2 NO

This approach is similar to the method of (Skubalska-Rafajlowicz, et al, 1996) in

the use of space filing curves to index the training data. In Skubalska-Rafajlowicz, the

data was indexed using Peano and Sierspinski space filling curves. They show that these

curves were able to produce classification performance on par with the standard NN

approach. Their experiments were on very small data sets in the range of 80 to 200

Training points, and did not show the increase in classification speed. They dismissed the

Hilbert space-filling curve saying it provided lower performance, but personal

communication with Castro suggested that this was not the case. The experiments in this

study focus on the Hilbert space-filling curve and its performance on data sets in the

range of 2,000 to 500,000 training points, and dimensions ranging from 2 to 12.

3

The HSFC has been used in (Castro, 2004) to partition training data for the Fuzzy

ARTMAP classifier. The training data was partitioned into separate sets and trained on

individual ARTMAP classifiers. This approach decreased the training time for the large

training set without negatively affecting the classification accuracy. This also allowed the

problem to be implemented on a Beowulf cluster to further speed up the classification

time.

The HSFC has also been used in (J.K.Lawder, et al., 200) to index multi-

dimensional databases and to allow efficient querying on the indexed databases.

Lawder’s work provides step by step instructions for producing Hilbert indexes from

multi-dimensional data that were helpful during the implementation of the HSFC NN

classifier.

In the following sections several different nearest neighbor techniques are

introduced and pseudo code for each is provided. Also an explanation of the HSFC and

the procedure for indexing multi-dimensional data is provided with examples, followed

by experiments comparing the performance of the HSFC NN classifier with the Standard

NN classifier on several natural and generated databases. Experiments are also carried

out on the very large Forest Cover database. Lastly avenues for future exploration are

presented.

4

Chapter 2: Nearest Neighbor Algorithms

In order to validate the HSFC NN algorithm it is necessary to implement and test

the standard NN algorithm. For the purposes of this thesis the 1-NN, k-NN and Weighted

k-NN algorithms will be implemented, and will be subjected to the same testing process

as the HSFC NN algorithm. In this way the Standard NN algorithms will act as a

benchmark to gauge the success of the HSFC NN. In this section each of these algorithms

are described and their classification process is detailed and analyzed.

2.1 1-NN Algorithm

The single nearest neighbor classification implemented by Cover and Hart

is the most basic of all the NN implementations (Cover and Hart, 1967). Proposed in

1966, the paper showed that the single nearest neighbor implementation had a minimum

probability of error that was less the twice the Bayes probability of error, and because the

Bayes classifier is the optimum choice for a decision classifier, less than twice the

probability of error for any classification rule. The single nearest neighbor

implementation classifies the unknown observation by finding the nearest neighbor to the

observation by some relevant distance function, and assigning its class to the unknown

observation. Cover and Hart show that for a simple distribution the single nearest

neighbor algorithm will have a probability of error that is less than the k-NN probability

on the same distribution. This means that the 1-NN implementation is strictly better than

the k-NN implementation for distributions where each in-class distance is larger than any

5

of the between-class distances (Cover and Hart, 1967). The figure below is the set used in

their example.

Figure 1: sample distribution used in example by Cover and Hart

2.1.1 1-NN Pseudo Code Terminology: x : A vector representing the input attributes. This vector is of dimensionality n.

:jix The i-th vector of class j. The index i ranges from 1 to , where represents the

number of vectors that belong to class j. The index j ranges from 1 to J, where J

represents the number of classes that the measured vector x could belong.

jtN j

tN

x

Step 1: We store all the points in memory. j

ix Step 2: We present the test pattern to the 1-NN. x Step 3: We calculate the distance of from all the stored patterns ( ’s). The distance

of the pattern from pattern is defined as:

x jix

x jix

∑=

−=d

l

ji

ji lxlxxxdis

1

2))()((),(

Step 4: We find the minimum such distance. That is we find

),(min),( 1,1minj

iNiJj xxdisjxdis jt≤≤≤≤

=

Step 5: The predicted class for test pattern x is then class . minj

6

This process is repeated for every test pattern , whose classification we want 1-NN to

predict.

x

2.2 k-NN Algorithm

One of the short comings of the 1-NN approach is that it does not handle noise

data very well. Noise data are points in the training set that are misclassified. If the test

point’s nearest neighbor is misclassified, then the algorithm will misclassify the test

point. The k-NN algorithm overcomes this short-coming by considering the classification

of multiple neighbors. During the performance phase the algorithm will select k

neighbors, where k is a positive integer, and will give the test point the classification of

the class with the highest number of representatives in the set of neighbors. Using this

technique if one of the neighbors is misclassified the other neighbors will still give the

correct classification. This also increases performance on databases that are not

completely separable, when the data for different classes overlap each other. In these

cases the decision boundary is blurred. A test point that is near a class boundary could

have a neighbor that is correctly classified but still in the other class. Once again in this

situation the 1-NN approach could misclassify the point, but the k-NN, choosing multiple

neighbors, will classify the point correctly. In the following diagram an example of 5-NN

is shown. In this example if the 1-NN rule is used the algorithm would choose class 2, but

as you can see most of the points around the test point are class 1. The 5-NN rule shown

will classify the test point as class 2, because 4 out of 5 of the neighbors selected are

class 1.

7

Test Point Class 1 Class 2

Figure 2: Example of 5-NN classification.

2.2.1 k-NN Pseudo Code Terminology: x : A vector representing the input attributes. This vector is of dimensionality n.


number of vectors x that belong to class j. The index j ranges from 1 to J, where J

represents the number of classes that the measured vector could belong.

jtN j

tN

x

Step 1: We store all the points in memory. j

ix Step 2: We present the test pattern to the k-NN. x

8

Step 3: We calculate the distance of from all the stored patterns ( ’s). The distance


x jix

x jix

∑=

−=d

l

ji

ji lxlxxxdis

1

2))()((),(

Step 4: We find the k minimum such distances. We call the class labels corresponding to

these k smallest such distances as:

kjjj min

2min

1min ...,,,

Step 5: The predicted class for test pattern x is class . Class is the class that

appears more often in the discrete set .

minj minj

}...,,,{ min2min

1min

kjjj

This process is repeated for every test pattern , whose classification we want k-NN to

predict.

x

2.3 Weighted k-NN algorithm

The Weighted k-NN algorithm is almost identical to the standard k-NN algorithm

with the exception that each neighbor is given a weight depending on its distance from

the test point. The classifier decision is then based on the weights. The class with the

highest weight is the classification chosen for the test point. In most situations this

method will choose the same classification as the k-NN. The difference between the two

becomes useful when the test data is sparse. In a sparse dataset the training points are

spread thin throughout the training space, this means that during the k-NN classification

the algorithm might have to look far away from the test point to find k neighbors. It is

possible that in looking for neighbors the algorithm will cross the decision boundary and

9

select points of the wrong classification. The Weighted k-NN method is better suited for

this situation because it computes a weight for each neighbor based on the distance from

the test point. Neighbors that are very close to the test point receive a higher weight than

the neighbors that are further away. The weights of all of the neighbors with the same

classification are summed and the class with the highest weight is chosen for the test

point. The figure below shows an example of this situation. The k-NN classifier would

choose class 1 because the 3 out of 5 neighbors are class 1, but the weighted k-NN would

choose class 2 because the two neighbors of class 2 are much closer to the test point than

the 3 neighbors of class 1.


Figure 3: Example of Weighted 5-NN

10

2.3.1 Weighted k-NN Pseudo Code Terminology: x : A vector representing the input attributes. This vector is of dimensionality n.


number of vectors that belong to class j. The index j ranges from 1 to J, where J

represents the number of classes that the measured vector could belong.

jtN j

tN

x

x

:w The weight of a training point in relation to the test point. The closer the training

point is to the test point the higher the weight.

Step 1: We store all the points in memory. jix

Step 2: We present the test pattern to the k-NN. x Step 3: We calculate the distance of from all the stored patterns ( ’s). The distance


x jix

x jix

∑=

−=d

l

ji

ji lxlxxxdis

1

2))()((),(

Step 4: We find the k minimum such distances. We call the class labels corresponding to

these k smallest such distances as:

kjjj min

2min

1min ...,,,

11

Step 5: We find the weighted distance of the presented pattern from each of the k

closest points (found in Step 4). That is, we calculate (for every l, such

that )

x

kl ≤≤1

∑=

⋅

⋅⋅ = k

r

j

jj

nr

l

l

dis

disdis

1)(

)()(

),(

),(),(

min

min

min

xx

xxxx

∑=

−= k

i

ji

ji

xxdis

xxdisw

0

),(

),(1

Step 6: Weighted distances, calculated in Step 5, that are calculated from points

belonging to the same class are added together. The predicted class for test

pattern is class , which is the class from the group of classes

that produces the largest such sum of weighted distances.

x minj

}...,,,{ min2min

1min

kjjj

This process is repeated for every test pattern , whose classification we want k-NN to

predict.

x

12

Chapter 3: The Hilbert Space Filling Curve

A Space-filling curve is a continuous map of a one-dimensional interval) into a

two-dimensional area (a plane-filling function) or a three-dimensional volume.

The Hilbert curve is used to map an n-dimensional coordinate system to a 1-

dimensional index. The Hilbert curve is known to maintain some of the spatial

relationships of the n-dimensional space. This trait makes is useful in clustering multi-

dimensional data.

Figure 4: 2-D Examples of Hilbert Space filling curves

13

Figure 5: Mapping 2-D coordinates to 1-D Hilbert Index

The HSFC converts an n-dimensional space to a 1-dimensional space. In the

implementation of the HSFC NN algorithm this property is used to convert the Training

Data to a 1-dimensional array. This array is then sorted on the Hilbert index making it

possible to search the training space in time using a quick sort or a merge

sort. The Hilbert Index is found by using Butz’s Algorithm (Butz, 1971) to convert the n-

dimensional coordinates to a Hilbert Index.

))(log( 2 NNO

14

3.1 Butz’s Algorithm This section provides the pseudo code for Butz’s Algorithm, and gives simple

examples of the mapping process from 2-D coordinates to the 1-D Hilbert index, and the

reverse operation.

Algorithm Definitions n : number of dimensions

m : the order of approximation

M : the number of bits in a derived–key (n*m)

r : an N–Bit binary Hilbert derived–key expressed as a real number in the range [0, 1).

byte : a word containing n bits ijρ : where and },...,1{ ni∈ },...,1{ mj∈ A binary digit in r such that:

mnnnr ρρρρρρρρρ ..........0 3

231

222

21

112

11=

iρ : binary byte in thi r . 112

11

1 ... nρρρρ =

ja : A coordinate in dimension j of the point >< naaa ...21 whose derived key is r . It is

also expressed as a real number between [0, 1). ijα : A binary digit in the coordinate such that ja m

jjjja ααα ...21=

Principal position J: The first bit from the right that is different. If all of the bits are the same then it is the least significant bit (or the one farthest to the right). Position is counted from the left. Ex. 00110011 J = 6 01000000 J = 2 11111111 J = 8 Parity: Even or Odd depending on the number of 1’s in a byte. Ex. 01100110 = Even 01010001 = Odd

15

2-D Example: Hilbert Index to 2-D Coordinate Variables used in conversion: 1. : Principal position of iJ iρ

2. : Grey code of . where and

are the bits of and .

iσ iρ in

in

in

iiiii112211 ,..., −⊕=⊕== ρρσρρσρσ i

nσ

inρ

thn iσ iρ

3. : Obtained by complimenting in the position and if it is odd parity then

complimenting it in the Principal Position.

iτ iσ thn

4. :~ iσ Circular Shift to the right by iσ ∑−

=

1

1

i

kkJ

5. :~ iτ Calculated the same way as :~ iσ using iτ

6. : =iω 11 ~ −− ⊕ ii τω , where )0,...0,0,0(1 ≡ω

7. : iα ii σω ~⊕

2-D Example:

n = 2 m = 2 M = 4

1ρ = First 2 bits of r 2ρ = Second 2 bits of r

This algorithm is an iterative algorithm. There are m iterations. In the following steps

represents then iteration number. i

Let 0010.0=r001 =∴ρ , 102 =ρ

When converting from a Hilbert Index to an n-dimensional coordinate we want to find the value of from r . Also most of the calculations for multiple iterations can be done simultaneously.

iα

16

1. Find iJ

1J = 2 = 1 2J

2. Calculate iσThe first bit of matches the first bit of . The second bit of is the first and second bits of XORed with each other.

iσ iρ iσiρ

1σ = 00 = 11 2σ 3. Calculate iτ

Compliment in the position and if odd parity compliment the principal position.

iσ thn

1τ = 00 = 00 2τ4. Calculate iσ~

Never shift the first iteration. The second iteration and after shift right

∑−

=

1

1

i

kkJ

1~σ = 00 2~σ = 11 5. Calculate iτ~

Calculate the same way as iσ~ . 1~τ = 00 2~τ = 00

6. Calculate iω

1ω = 00 otherwise =iω 11 ~ −− ⊕ ii τω 1ω = 00 = 00 2ω

7. Calculate iαiα = ii σω ~⊕1α = 00 = 11 2α

8. Convert to iα ja

iα is a vector containing the bit of each coordinate. Therefore contains the first bits of each coordinate and contains the second

bit of each coordinate .

thi1α ja 2α

ja

1a = 0.01 = 0.01 2a

17

2-D Example: 2-D coordinates to Hilbert Index Variables used in conversion:

1. = iω 11 ~ −− ⊕ ii τω , where )0,...0,0,0(1 ≡ω

2. , where :~ iσ ii ωα ⊕ 11~ ασ ≡

3. : circular shift left times. First iteration does not shift. iσ iσ~ ∑−

=

1

1

i

kkJ

4. : iρ in

in

in

iiiii σρρσρρσρ ⊕=⊕== −121211 ,...,,

5. : principal position of iJ iρ

6. : Obtained by complimenting in the position and if it is odd parity then

complimenting it in the Principal Position.

iτ iσ thn

7. :~ iτ Calculated the same way as :~ iσ using iτ

When calculating the Hilbert Index from a coordinate each value must be calculated in the order above; one iteration at a time. The steps below have the values for each iteration side by side, but each iteration must be completed before continuing to the next.

2-D Example:

n = 2 m = 2 M = 4

1a = 0.01 = 0.01 2a1α = 00 = 11 2α

We want to find , we already know iρ iα

1. Calculate iω is always 00, each other iteration is 1ω 11 ~ −− ⊕ ii τω = 00 = 00 1ω 2ω 2. Calculate iσ~

each other iteration is 11~ ασ ≡ ii ωα ⊕ = 00 = 11 1~σ 2~σ 3. Calculate iσ Rotate circular left according to rule above. First iteration never rotates. iσ~

= 00 = 11 1σ 2σ

18

4. Calculate iρ i

nin

in

iiiii σρρσρρσρ ⊕=⊕== −121211 ,...,, = 00 = 10 1ρ 2ρ 5. Calculate iJ principal position of iρ = 2 = 1 1J 2J 6. Calculate iτ

Obtained by complimenting in the position and if it is odd parity then complimenting it in the Principal Position.

iσ thn

= 00 = 00 iτ iτ

7. Calculate iτ~

Calculated the same way as :~ iσ using iτ = 00 = 00 iτ~ iτ~

8. Find from r iρ After completing all iterations the Hilbert index mr ρρρ ....0 21=

0010.0=r

The following table and figure show examples of the Iris Plant Database mapped

to the Hilbert Space filling curve. As you can see from the Figure the HSFC does a good

job of clustering the data. The clustering properties of the Hilbert Space Filling curve are

what make it possible to implement the NN algorithm based on the Hilbert index. As

shown in the figure below the classes are easily separated, leading to very high accuracy

in the classification phase.

19

Table 1: Examples of HSFC Mapping Class Attr 1 Attr 2 Hilbert Index

Versicolor 0.450113 0.159633 0.194565

Virginica 0.653459 0.679114 0.534393

Figure 6: Iris Plant Database mapped to the HSFC

20

3.2 HSFC NN Pseudo Code Terminology (Hilbert NN) n : Number of dimensions m : Order of approximation M : The number of bits in a derived key )*( mnr : An binary Hilbert derived key where bitN − )1,0[∈r

ja : An n -dimensional vector representing the input attributes 1-NN Hilbert Step 1: Store Training Set in an Array Step 2: For each instance in the Training set use Butz’s Algorithm to convert to ja r Step 3: Sort the Training Set by r Step 4: Use Butz’s algorithm to convert to for each instance to be classified ja r Step 5: Use a Binary Search based on to find the Training Instances before and after

your test instance on the Hilbert Curve. r

Step 6: Test the difference between both Training Instances and the Test Instances r r Assign the Test Instance the class of the Training Instance that meets the

following criteria |)||,min(| 21 CTCT rrrr −−

Where and are the Hilbert Key’s of the Training Instances and is the

Hilbert Key of the Test Instance.

1Tr 2Tr Cr

21

Chapter 4: Experimental Results

This section details the experiments that were run to compare the HSFC NN and

the Standard NN. It gives a brief overview of the experimental procedure, as well as

presenting the results of the experiments.

4.1 Experimental Procedure

The experiments were carried out in MATLAB using the HSFC_NN_Alg.dll.

This DLL is capable of performing both the Standard NN and the HSFC NN. The

datasets are stored in MAT files, and M-file scripts control the flow of the experiments.

The RunNNExperiments script runs the experiment routine on each database, and the

TestHSFCNNAlg script controls the flow of each individual experiment. The

TestHSFCNNAlg script takes 3 arguments; DBName, NumClass, and Hilbert. DBName

is the name of the database to be tested; this matches the beginning of the file name that

contains the data, NumClass is the number of classes of the data, and Hilbert is a Boolean

variable, true if you want to use the HSFC, or false if you want to use the Standard NN.

The sequence of a single experiment is as follows:

1. Load the Training, Test, and XV sets from the mat files for the database.

2. Test Phase - Run the Algorithm in One NN mode, if using the HSFC the M

parameter is set to 7.

22

3. Cross Validation for k-NN Unweighted – The algorithm is run repeatedly,

each time changing the parameters m and k for HSFC and only k for Standard

NN. The m parameter is ranged from 7:12 and the k parameter is ranged from

2:20.

4. Test Phase - the algorithm is run once in k-NN Unweighted mode using the

parameters mmax and kmax – the parameters that maximized the performance

in the cross validation phase.

5. Cross Validation for k-NN Weighted – The algorithm is run repeatedly, each

time changing the parameters. M and K for HSFC and only K for Standard

NN. The m parameter is ranged from 7:12 and the k parameter is ranged from

2:20.

6. Test Phase – The algorithm is run once in k-NN Weighted mode using the

parameters mmax and kmax – The parameters that maximized the

performance in the cross validation phase.

7. Output Results – The HSFC_NN_Alg returns the Accuracy, an array of the

Result classification, and the running time of the algorithm. The

TestHSFCNNAlg script creates a mat file that contains the accuracy, running

times, and result class array for each of the test phase runs of the algorithm, it

also stores the running time for the entire experiment. The mat file is named

<DBName>_HSFC_Results.mat or <DBName>_NN_Results.mat depending

on which mode the experiment was run in.

23

The TestHSFCNNAlg script is run on each of the fifteen databases. After this the

OutputResults script is run to create a comma delimited text file that contains the

accuracy and runtimes for both the HSFC NN and the Standard NN for each database.

4.2 Test Databases This section will give a short description of each of the databases that were used

in the experiments. Some of these databases are artificially generated while others were

obtained from the UCI repository found here:

http://www.ics.uci.edu/~mlearn/MLRepository.html

4.2.1 Gaussian Databases

Twelve of the fifteen databases used in testing are artificially created Gaussian

databases. These databases are created by setting a mean value for the class and filling in

the points around the mean using a Gaussian distribution. All of these databases are

dimensionality n = 2. The differences in each of the databases are the number of classes

and the percent overlap of the classes. The number of classes available are 2, 4, and 6,

and the percent overlap values are 5 %, 15 %, 25 %, and 40 %. The percent overlap

determines the maximum accuracy that you can expect for each database. With a 5 % the

maximum accuracy that you can expect will be 95 %. This is because the 5 % of the data

that are overlapping have an equal probability to be any of the classes.

4.2.2 IRIS Plant Database

The Iris Plant Database was donated to the UCI Repository by Michael Marshall,

and was created by R.A. Fisher in 1936. It contains 3 classes with 50 instances each.

Each instance has four numeric attributes, and the purpose of the data is to classify the

24

http://www.ics.uci.edu/%7Emlearn/MLRepository.html

type of iris plant. The four attributes are the sepal length in cm, the sepal width in cm, the

petal length in cm, and the petal width in cm. The classes are iris setosa, iris versicolour,

and iris virginica. The underlying statistics for this data set are available, as well as any

other documentation needed. This data base is one of the most widely used because it has

been around for a long time, and it is very well documented. One of the drawbacks of this

database is that it has very few instances. The IRIS database used in this research has

been modified by adding points around the original data points increasing the total

number of observations. Also the data has been reduced to a dimensionality of n = 2 by

choosing two dimensions where the classes are linearly separable. And lastly one of the

classes has been dropped from the database. The final database is a 2 dimensional, 2 class

problem with around 5 % overlap in the classes.

4.2.3 Abalone Database

The abalone database was donated to the UCI repository in 1995 by Sam Waugh.

The classification task is to predict the age of an abalone from physical measurements.

The normal process for determining the age of an abalone is to cut the shell through the

cone, stain it, and then count the number of rings. The task is to classify the age using

easier to obtain measurements. This database has 29 classes and 8 attributes. This

database has also been altered by generating instances from the original data.

4.3.4 Page blocks Database

The abalone database was donated to the UCI repository in 1995 by Donato

Malerba. The problem consists of classifying all of the blocks of the page layout of a

document that have been identified by a segmentation process. This is an important step

25

in the analysis of a document. There are 5 classes; text (1), horizontal line (2), picture (3),

vertical line (4), and graphic (5). The database has ten numeric attributes taken from

measurements of each block, has very little noise, and has no missing data. This database

has also been altered by generating new instances.

26

4.3 Results

The results are presented in the table below. The run times in the table below are

measured in the HSFC_NN_Alg.cpp file. The number of clock cycles that pass while the

algorithm is running are measured and then divided by the number of clock cycles per

second. This time includes the Training Phase and the Performance phase of both

algorithms.

Table 2: Results on small databases Database Name Type One NN Run Time k-NN Run Time k-WNN Run Time g2c_05 HSFC 91.56 0.016 94.34 0.031 95.04 0.031

g2c_05 NN 92.04 1.281 95.14 1.375 95.12 1.359

g2c_15 HSFC 79.4 0.015 84.5 0.031 84.42 0.047

g2c_15 NN 78.52 1.281 84.22 1.375 84.22 1.375

g2c_25 HSFC 64.9 0.032 73.22 0.031 73.36 0.031

g2c_25 NN 67.2 1.281 73.84 1.375 73.84 1.375

g2c_40 HSFC 53.98 0.031 58.58 0.032 58.44 0.032

g2c_40 NN 52.68 1.281 57.98 1.391 57.98 1.375

g4c_05 HSFC 89.46 0.015 94.76 0.031 95.06 0.031

g4c_05 NN 92.2 1.265 94.74 1.407 94.56 1.39

g4c_15 HSFC 76.74 0.016 83.88 0.031 83.94 0.047

g4c_15 NN 78.38 1.281 83.98 1.407 83.68 1.406

g4c_25 HSFC 65.82 0.031 74.7 0.032 74.88 0.031

g4c_25 NN 65.86 1.297 74.74 1.359 74.5 1.344

g4c_40 HSFC 49.32 0.031 57.78 0.032 57.58 0.031

g4c_40 NN 48.02 1.266 58.44 1.375 58.08 1.375

g6c_05 HSFC 90.167866 0.031 91.08713 0.031 92.825739 0.031

g6c_05 NN 93.105516 1.281 94.664269 1.39 94.684253 1.375

g6c_15 HSFC 77.278177 0.016 80.695444 0.031 80.935252 0.032

g6c_15 NN 75.979217 1.281 84.572342 1.453 84.572342 1.438

g6c_25 HSFC 62.589928 0.032 69.744205 0.032 70.063949 0.032

g6c_25 NN 64.128697 1.343 72.941647 1.422 72.781775 1.484

g6c_40 HSFC 45.043965 0.031 54.236611 0.031 53.816946 0.031

g6c_40 NN 44.964029 1.312 56.714628 1.406 57.394085 1.39

Iris_DrG_Norm HSFC 90.979167 0.016 94.333333 0.031 94.416667 0.031

Iris_DrG_Norm NN 91.041667 1.218 93.833333 1.265 93.8125 1.328

new_abalone_500_Norm HSFC 52.176279 0.015 52.339499 0.016 51.904244 0.016

new_abalone_500_Norm NN 49.945593 1.297 52.774755 1.343 52.50272 1.344

pageblocks_Norm HSFC 71.531966 0.032 86.529956 0.032 86.489747 0.031

pageblocks_Norm NN 88.661037 2.484 90.349819 2.484 90.269401 2.5

27

In the table above the g2c_05 database the HSFC 1-NN has a running time of

0.016 seconds and the Standard NN has a running time of 1.281 seconds. This means that

the HSFC 1-NN performs the Training Phase and classifies the entire test set in 0.016

seconds. The training phase of the standard NN is simply passing the Training Data array,

so this means that it only has to classify the Test Set. It takes the Standard NN 1.281

seconds to classify the test set. This means that for the g2c_05 database the HSFC NN

gives a 98.75% decrease in the time needed to classify the test set while maintaining

nearly the same accuracy.

Table 2 shows that the HSFC NN algorithm maintains the accuracy of the NN

algorithm in almost all of the databases, and increases the accuracy in a few instances.

However, there are exceptions. In the Gaussian 6 class problems the HSFC algorithms

were outperformed by the NN algorithms by 2 to 4 %, and on the page blocks database it

was outperformed by 3 to 16%. These discrepancies are caused by the fact that the HSFC

does not save all of the spatial relationships between data; it only preserves some of them.

Because of this the HSFC will choose slightly different neighbors than the standard NN.

In most of the databases choosing different neighbors still leads to the correct

classification. However in the 6-class Gaussian database the decision boundaries cross

over the quadrant boundaries. One of the properties of the HSFC is that it will naturally

divide the data up into quadrants, the 2 and 4 class Gaussian databases are easily

separated into quadrants and because of this the performance of the HSFC NN does not

suffer. The 6-class problem however is not easily separated by quadrants and because of

this the HSFC NN has a higher possibility of choosing the wrong neighbors. This and the

higher number of dimensions are the reason for the discrepancies in the page blocks data.

28

One possible solution to this problem would be to use multiple curves, shifting each

curve slightly so that neighbors that are close in n space will be close on one of the

curves. After the three curves are created they would then vote on the classification of a

test point. This technique should decrease the error associated with missing actual

neighbors, and will be experimented with in the future.

During the cross validation phase of the experiments the algorithms were run on

validation sets to determine the optimal values of k and m. This is done by running the

algorithm multiple times each time changing the values of the parameters. The values that

produce the best accuracy are selected as the optimal values and the experiments are run

on the test set using these values. During the cross validation some observations were

made about the effect of the parameters on the accuracy of the algorithms. The figures

below show the values of k and m and there effect on the accuracy of the HSFC

algorithm. Figure 7 shows that as k increases the accuracy increases up until around

k = 9, after that the accuracy levels off and remains nearly the same for all greater values

of k. For the order of the curve m it was found that the accuracy fluctuated until an order

was reached that assigned each of the training points there own index. For lower orders of

the Hilbert curve multiple training points would be mapped to the same index. The

sorting algorithm used in the HSFC NN is a quick sort. This sorting algorithm is only

partially determinant. This means that if two values have the same index then each time

the algorithm is run their order in the HSFC could change. Once the order of the curve

was high enough to assign each training point a unique index the accuracy stopped

fluctuating. That order also produced the best accuracy, any increase in the order after

this point did not affect the accuracy of the algorithm. From these observations it was

29

determined that the best choice for the order of the curve, is the order that assigned each

training point a unique index. The selection of the quick sort as the sorting algorithm was

arbitrary, any sorting algorithm would suffice. For example a merge sort,

which is completely determinant, could also be used, and would resolve the fluctuating

accuracy at lower orders.

))(log( 2 NNO

Figure 7: Gaussian 2 class Accuracy vs. K A second set of experiments were run on the Gaussian 2-class 5 % database to

show the effects of training set size on classification time. The experiments were run

again using training set sizes of 500, 1000, 1500, and 2000. The points were obtained by

converting some of the points in the cross validation set to training points. The

30

experiments were run exactly the same as they were in the original experiments. The

results are shown in the table and figure below.

Figure 8: Run Time vs. Training set size

Figure 8 shows the run time vs. the Training Set size, as you can see the run time

increases linearly with training set size for the standard k-NN algorithm but the HSFC

NN run time is barely affected. This same experiment will be run again on a much larger

database so that the training set size can have a much larger domain, and the differences

in the run times of both algorithms will be more pronounced. The results from this second

experiment show that the HSFC NN is significantly more efficient than the standard NN.

31

4.3.1 Large Database Experiments: Forest Cover Database

To further test the speed increase of the HSFC NN experiments were run on the

Forest Cover Database. This database was obtained from the UCI KDD archive. It

consists of 581,012 instances, and 7 classes. The classification task of this database is to

determine the Forest Cover type from measurements acquired from a parcel of land. Of

the 54 attributes, 10 of them are quantitative, and the other 44 are binary representations

of 2 qualitative attributes. The HSFC has a hardware limitation, such that the number of

dimensions times the order of the curve must be less than 64 (n*m < 64) on a 32-bit

machine. In order to classify the Forest Cover Database the 44 binary attributes were

converted to two binary strings, and from there to two decimal attributes. This reduced

the number of dimensions to 12 allow the use of an order 5 curve. After this each

attribute was normalized, and the database was split into classes. To test the scale up of

both algorithms, the database was used to create data sets with training sizes ranging

from 2k – 512k in increments of , where i ranges from 1 to 9. The algorithm was then

run on each of these datasets. The results are presented in the following table.

ki2

32

Table 3: Forest Cover Results DBName Alg Type One NN Run Time k-NN Run Time k-WNN Run Time ForestCover_2k_ HSFC_NN 45.426829 0.141 48.970412 0.172 48.65054 0.188

ForestCover_2k_ NN 49.515194 94.859 45.766693 93.985 45.816673 93.797

ForestCover_4k_ HSFC_NN 50.809676 0.156 50.114954 0.203 49.930028 0.203

ForestCover_4k_ NN 58.146741 186.828 56.862255 187.797 57.477009 187.859


ForestCover_8k_ NN 64.614154 373.485 62.664934 374.5 63.834466 374.531


ForestCover_16k_ NN 65.19892 746.328 63.269692 747.5 64.414234 747.469


ForestCover_32k_ NN 68.472611 1487.843 64.32427 1489.047 65.353858 1489.093


ForestCover_64k_ NN 69.167333 2971.922 58.371651 2972.906 58.811475 2973.015


ForestCover_128k_ NN 69.537185 5952.219 56.332467 5955.594 56.467413 5953.406


ForestCover_256k_ NN 67.443023 11916.312 49.595162 11925.297 49.910036 11879.274


ForestCover_512k_ NN 62.240104 23616.265 54.328269 23577.36 54.523191 23575.984

Classification Accuracy

0

10

20

30

40

50

60

70

80

Fore

stCov

er_2

k_

Fore

stCov

er_4

k_

Fore

stCov

er_8

k_

Fore

stCov

er_1

6k_

Fore

stCov

er_3

2k_

Fore

stCov

er_6

4k_

Fore

stCov

er_1

28k_

Fore

stCov

er_2

56k_

Fore

stCov

er_5

12k_

Databases

Accu

racy HSFC NN

NN

Figure 9: Forest Cover Accuracy

33

The results from the Forest Cover experiments show that the HSFC NN gives a

dramatic speed increase, but at a cost. In most of the experiments there is a significant

reduction in the accuracy of the HSFC NN in relation to the Standard NN. For, example

the k-NN algorithm running on the 32k training set achieved an accuracy of 68.47%

compared to the HSFC k-NN with an accuracy of 60.54%. The run time for the HSFC k-

NN however is much faster than the Standard NN, at 0.266 seconds compared to 1488

seconds. The run times for each algorithm are shown in the figure below.

Figure 10: Forest Cover Database Run Times

In the case of the Forest Cover Database the speed increase from the HSFC NN

comes at a significant reduction in the accuracy. In the next section the experiments are

34

run again on both the small and large databases using multiple curves. Using multiple

curves should decrease the number of neighbors that are missed by the HSFC NN and

increase the accuracy of the HSFC NN. It is believed that this method will close the gap

between the HSFC NN and the Standard NN on difficult data sets.

4.3.2 Multiple Curve HSFC NN In this section the HSFC NN classifier is altered to use multiple curves in the

classification phase. It is believed that this will increase the accuracy, and bring the

performance of the HSFC NN closer to the performance of the Standard NN on difficult

databases. To create multiple curves, the data is shifted by one unit towards and away

from the origin. When these datasets are indexed using Butz’s algorithm, it will create

two extra curves that have different orders of the training points and that have saved

different n-space relationships.

When using multiple curves there are a number of ways to classify the points.

One method is to allow each of the three curves to classify the point independently, and

then let them vote on the classification of the point. A second method, the one

implemented in this section, is to find the closest neighbors from each of the curves, and

use the best neighbor or neighbors from each to classify the point.

The results of the altered algorithm on the Forest Cover Database, and on selected

small databases are presented in the following tables.

35

Table 4: Multi vs Single HSFC accuracy on the Forest Cover Database DBName Alg Type One NN Run

Time k-NN Run

Time k-WNN Run

Time ForestCover_2k_ Multi Curve 46.53639 0.344 43.487605 0.422 43.297681 0.438 ForestCover_2k_ Normal 47.30108 0.14 47.785886 0.234 43.932427 0.188 ForestCover_4k_ Multi Curve 52.68393 0.359 52.064174 0.406 51.064574 0.406 ForestCover_4k_ Normal 50.85966 0.156 49.445222 0.219 47.361056 0.219 ForestCover_8k_ Multi Curve 57.95682 0.422 56.597361 0.469 56.077569 0.453 ForestCover_8k_ Normal 55.69772 0.172 49.965014 0.25 49.92503 0.235 ForestCover_16k_ Multi Curve 58.30168 0.516 56.937225 0.562 56.437425 0.562 ForestCover_16k_ Normal 55.89764 0.219 56.297481 0.266 55.267893 0.266 ForestCover_32k_ Multi Curve 59.53119 0.719 58.276689 0.75 57.472011 0.735 ForestCover_32k_ Normal 57.02719 0.312 55.582767 0.328 57.027189 0.313 ForestCover_64k_ Multi Curve 58.27169 1.094 56.157537 1.125 55.437825 1.141 ForestCover_64k_ Normal 54.96801 0.453 53.253699 0.469 54.963015 0.469 ForestCover_128k_ Multi Curve 58.15174 1.89 56.902239 1.938 55.077969 1.937 ForestCover_128k_ Normal 54.95302 0.719 52.314074 0.75 54.938025 0.75 ForestCover_256k_ Multi Curve 55.20292 3.547 53.363655 3.547 51.07457 3.562 ForestCover_256k_ Normal 48.9954 1.313 47.501 1.437 48.985406 1.329 ForestCover_512k_ Multi Curve 56.72731 6.906 52.264094 6.906 49.830068 6.906 ForestCover_512k_ Normal 52.2591 2.531 50.87465 2.516 50.929628 2.579

1-NN Multi Curve vs Single Curve Accuracy

0

10

20

30

40

50

60

70

ForestC

over_2

k_

ForestC

over_4

k_

ForestC

over_8

k_

ForestC

over_1

6k_

ForestC

over_3

2k_

ForestC

over_6

4k_

ForestC

over_1

28k_

ForestC

over_2

56k_

ForestC

over_5

12k_

Databases

Accu

racy Multi-Curve

Single-Curve

Figure 11: Multi-HSFC 1-NN vs. Single-HSFC 1-NN

36

k-NN Multi Curve vs Single Curve Accuracy

0

10

20

30

40

50

60

70

ForestC

over_2

k_

ForestC

over_4

k_

ForestC

over_8

k_

ForestC

over_1

6k_

ForestC

over_3

2k_

ForestC

over_6

4k_

ForestC

over_1

28k_

ForestC

over_2

56k_

ForestC

over_5

12k_

Databases

Accu

racy Multi-Curve

Single-Curve

Figure 12: Multi-HSFC k-NN vs. Single-HSFC k-NN

The Figures above show that on the Forest Cover Database the Multi-HSFC

approach improves the accuracy for most of the databases tested. However the increases

do not bring the accuracy of the HSFC NN classifier up to the level of the Standard NN

classifier for these databases. The Standard NN classifier still outperforms the HSFC NN

Classifier even using the Multi-curve approach. The high dimensionality of the Forest

Cover database is mostly to blame for the large deficit in accuracy. As the number of

dimensions increases the number of possible neighbors within one unit from the test point

increases while the number of neighbors within one unit on the HSFC remains at two.

This means that as the number of dimensions increase it becomes less likely that the

HSFC will produce the same nearest neighbor as the Standard NN. While using the multi

curve approach did increase accuracy, in order to bring that accuracy up to the level of

37

the Standard NN it would require using many more curves. The best chance of achieving

that accuracy would be to use a curve that has been shifted towards each corner of the n-

space. The problem with this is that for each curve you create you have to store another

copy of the training data. On large databases like the Forest Cover that would not be

practical. It is possible that the other type of multi curve NN discussed earlier could

improve on the accuracy found here. The implementation of a voting curve system will

be experimented some time later.

The multi-curve experiments were also run on the small databases. It was found

that the accuracy of the 1-NN was only slightly affected changing very little for better

and worse on different databases. However it was noted that the performance of the k-NN

and Weighted k-NN suffered as a result of the multi-curve method. In general it was

found that the multi-curve method used is producing the best improvements when applied

to the 1-NN classifier.

38

Chapter 5: Conclusions on Results

The experiments conducted in the research were carried out to test the validity of

using the HSFC to decrease the runtime of the Nearest Neighbor classifier. In each and

every case the HSFC NN classifier outperformed the Standard Nearest Neighbor in terms

of speed. As the number of training points increases the difference in run time becomes

more and more pronounced. On the smaller databases, such as the Gaussian 2 class

database, the HSFC classifies the database 80.06 times faster than the Standard NN,

while on the larger datasets like the Forest Cover 512k database the HSFC classifies the

database 10642.75 times faster reducing the classification time from 6.56 hours to 2.219

seconds. These results show that the HSFC NN successfully improved on the

classification time of the Nearest Neighbor algorithm. This improvement however comes

at the cost of accuracy. On low dimensionality databases this cost in accuracy is very

minimal as the results on the small databases show; the HSFC maintains accuracy very

close to the Standard NN in most of the small databases. However, on higher

dimensionality problems, and on databases with complicated decision boundaries the

HSFC accuracy falls short of the Standard NN. The results show that the accuracy of the

HSFC NN can be improved by using multiple curves, but this technique does not

improve the accuracy enough to match the Standard NN. The accuracy could be

improved further by increasing the number of curves used to classify the data, but with

each additional curve another copy of the training data must be stored. At this point it

becomes a trade off between space and accuracy; higher accuracy can be achieved but at

39

the cost of space. Several other techniques for increasing the accuracy of the HSFC NN

will be discussed in the future work section.

Overall the experiments show that the HSFC is able to classify data reliably, in

most cases the accuracy is comparable to the Standard NN, and in each instance it gives a

significant improvement in classification time. The HSFC NN would be a good candidate

for applications where speed is the most important factor. Also in low dimensional

problems the HSFC NN is interchangeable with the Standard NN giving equivalent

results, and improved classification time.

40

Chapter 6: Implementation Issues

This chapter details the implementation of the HSFC NN and the NN classifiers,

and also discusses some of the issues that arose during the process. The code and scripts

used in the experiments can be found in Appendix B.

The HSFC code used to convert n-dimensional data to the Hilbert index was

implemented by Doug Moore of Rice University. This code will not be included in the

appendix as it is interchangeable with any function that implements Butz’s Algorithm.

This code was used for its simple interface and straight forward implementation.

The HSFC NN and the Standard NN are both implemented in a single C++ global

function. This function takes a number of parameters that are needed for each classifier to

run. Most of the parameters, such as the number of neighbors k, the training data, and the

test data are used by both classifiers. However parameters like the order of the Hilbert

curve are used only by the HSFC classifier. To control which of the two classifiers is

used there is a single Boolean argument passed as the first parameter. Both classifiers are

Nearest Neighbor classifiers, so the code used to classify each point remains the same for

both the HSFC NN and the Standard NN. The difference comes in the selection of the

neighbors. Once the algorithm is started the correct neighbor selection routine is run

according to the classifier that is selected. In the case of the HSFC NN the indexing and

sorting of the training data happens only once, at the beginning of the algorithm, and only

if the HSFC classifier is selected. After this phase is complete, the algorithm goes into the

41

main loop. This loop runs through each one of the points in the test set. Inside this loop

each point is classified, and the classification is checked against the known classification

to test if the algorithm was correct. The process of classifying a test point begins with an

if statement that checks to see which of the two classifiers is to be used. In the case of the

HSFC the test point is converted to a Hilbert index, and then a modified binary search is

run on the training set to find the test index. The binary search was modified to return the

index of either an exact match or the index of the point just after where the test index

would fall. This search would return an index on the training array. Starting with this

index and moving out away from the test point, neighbors are selected and stored in the

neighbor array. In the case of the Standard NN the distance from the test point to all the

other points in the training set are calculated and the information, pertaining to the

training points of the smallest k distances, is stored in the neighbor array. After this point

the algorithms are the same. The neighbor array is used to classify the test point

according to the type of classification being done; 1-NN, k-NN, or Weighted k-NN.

The implementation of the Standard NN is straight forward and the neighbor

selection process is simply finding the smallest distances among the training points. As

such, once this section was coded there was no need for modifications. The HSFC NN

neighbor selection however is not as straight forward, and as such a few modifications

were made to improve the performance of the classifier. When the test point’s location on

the Hilbert curve has been found, there are two possible locations for its nearest neighbor

either before or after it on the curve. In the original implementation the first neighbor was

selected as the index that was returned by the search, the next neighbor was selected as

the index that came before the test point, the rest of the neighbors were selected

42

alternating before and after the test index until the number of neighbors was reached.

This method did not take into account the distance of these neighbors from the test point

on the curve. To clarify, if the number of neighbors k was 6, then 3 neighbors were

selected before the test point on the curve and 3 neighbors were selected from after the

test point on the curve. This method is not the best choice because we are not guaranteed

that the HSFC is full. There could be any number of missing indexes between the test

point and the training points just before and after the test point in the training array. The

neighbor selection method was altered to take into account the distances on the curve. In

the new method the 1st neighbor is selected as the closer of the test point’s first neighbors

on either side. According to which one is chosen a counter is incremented and the loosing

point is tested against the next point further out from the test point on the winning side.

This method increased the accuracy of the classifier slightly by ensuring that only the

closest points on the Hilbert curve were selected as neighbors.


Neighbor

Initial HSFC Neighbor Selection

New HSFC Neighbor Selection

Figure 13: HSFC Neighbor Selection

43

The experiments were run in MATLAB; this was accomplished by compiling the

algorithm into a MATLAB DLL. This was the most difficult process of the

implementation. In order to create the MATLAB DLL requires an interface. This

interface file is used to talk between MATLAB and the NN_Alg function. It defines the

function that is called from MATLAB, and also does all of the conversions between the

MATLAB Array types to C arrays, and it takes care of error checking. When the function

is called from MATLAB, the parameters are passed to the interface file, where they are

converted to the correct types and checked to make sure that they are within the accepted

ranges. Then the interface calls the NN_Alg function with the converted parameters.

After the algorithm has finished the interface file returns the accuracy and the results

back to MATLAB. The implementation of the MATLAB DLL, while adding some

complexity to the process made it possible to quickly run experiments and generate data

using MATLAB scripts. It also made handling the databases easier by removing the need

for C++ file access routines.

The largest issue that arose from the MATLAB implementation was the fact that

once the algorithm had run MATLAB did not free up the memory used inside the

function. This meant that after many experiments had been run the computer would start

to slow down because MATLAB was taking up more and more memory. This was solved

by changing the C++ algorithm to take care of freeing up the memory manually.

In general the implementation of these algorithms is fairly simple. The actual

code is straight forward and simple to understand. This makes these classifiers much

easier to implement and run than other more complicated classifiers such as neural

networks.

44

Chapter 7: Future Research

During the course of this research many ideas for future work were brought to

light. There are a few techniques to improve the accuracy of the HSFC NN classifier that

have yet to be implemented, and an idea for a new type of classifier based on the HSFC

similar to the patterns of the ARTMAP Neural Network is being formulated

(Carpenter,1992). Also there are other classification techniques that could benefit from

reducing the dimensionality of the problem, such as the Probabilistic Neural Network

(Specht, 1990).

In this research one technique using multiple HSFC’s was implemented and

tested. It was shown that this technique improved the accuracy for the Forest Cover

database, but had mixed results with the smaller databases. Other multi-curve methods

are also available and could be implemented in the future. One such method discussed

earlier is to allow each individual curve to classify the test point independently, and then

allow them to vote on the final classification of the test point. This method is desirable

because this means that the original results obtained would still be intact within the

original curve. This technique is therefore less likely to perform worse than the single

curve implementation. It is hoped that this technique would improve the accuracy in the

k-NN and Weighted k-NN as well. Another multi-curve technique designed to improve

accuracy on high dimensionality databases, would be to split the dimensions of the

database into groups and use a different HSFC for each group of attributes. For example,

on the Forest Cover Database, the 12 attributes would be split into groups of 4, and each

45

of these three groups would be indexed onto separate curves. The curves would then vote

on the classification of each test point. This method would reduce the amount of spatial

information that is lost from the original data.

The Hilbert Decision Space Classifier is based on the clustering properties of the

HSFC. The idea is to create ranges on the HSFC that are mapped to a certain class. These

ranges would be similar to the patterns in ARTMAP Neural Networks. The training data

would be used to create these ranges, and then once the training is complete the training

data can be discarded. This classifier will be very fast because most of the work will be

done during the training phase. Once the training is complete the test points can be

classified quickly by finding the correct range. The ranges will all be in one dimension so

the correct range can be found using a binary search which again is very computationally

efficient. Also the number of ranges should be less than the number of training points so

the binary search will be over a smaller array. The major benefit of this technique,

however, would be the ability to discard the training data, and the memory savings that

would bring, especially on large databases like the Forest Cover. This new classifier

would be much more complicated to implement than the HSFC NN. The biggest

challenge in designing this technique would be to define the training phase. There are

many issues that must be taken into consideration when creating the ranges, such as, what

to do about points that fall into a range but do not agree on the classification. Should a

new range be created in the middle of the current range, or should the point be thrown out

as noise. In the end it is likely that the training phase will require multiple passes, with

the first pass taking in all of the training data, and the subsequent passes weeding out un-

needed ranges. The Hilbert Decision Space Classifier (HDSC) will likely be the focus of

46

my future research; if this technique maintains the accuracy of the HSFC NN along with

the added benefit of discarding the training data, it could turn out to be a very robust

classifier. The following figure shows a possible selection of ranges on the IRIS database.

The ranges colored blue would be class 1, and the ranges colored red would be class 2.

This range selection would be using very loose parameters as can be seen by the number

of points that are in the wrong ranges.

Figure 14: Hilbert Decision Space Classifier Example

Other classifiers that would benefit from a reduction in dimensionality could also

be implemented using the HSFC. One such classifier is the Probabilistic Neural Network.

Implementing the PNN using the HSFC would be much more complicated than the

47

Nearest Neighbor, and as such would require more time and energy. The PNN works by

estimating the probability distribution of the training data, and makes is classification

based on the Bayesian classifier. Further research is required to determine how to

implement the PNN using the HSFC.

One last avenue that might be pursued would be to develop a technique for

estimating the expected accuracy of a given database. When testing on generated

databases the optimal accuracy is known, however, on natural databases, unless all the

statistical information about the data is known, it is difficult to know what the optimal

accuracy would be. When multi-dimensional data is mapped onto the Hilbert curve it is

possible to see relationships between the classes that are impossible to graph in multiple

dimensions. We hope to develop a technique to estimate the expected accuracy of data by

mapping the data to the Hilbert curve and determining the overlap of the classes. This

would be useful to many researchers working on Natural databases.

The Hilbert Space Filling Curve has many qualities that make is useful in

classification algorithms. It reduces the dimensionality of the data while still maintaining

some of the spatial relationships, as well as naturally clustering the data. These properties

will undoubtedly lead to many new applications for the HSFC.

48

References

Arthur R. Butz., “Alternative Algorithm for Hilbert's Space-Filling Curve”. IEEE

Transactions on Computers, 20:424-426, April 1971.

Carpenter, G. A., Grossberg, S., Markuzon, N., Reynolds, J. H., "Fuzzy ARTMAP: A

neural network architecture for incremental supervised learning of analog multi-

dimensional maps," IEEE Transactions on Neural Networks, Vol. 3, No. 5, pp.

698-713, 1992.

Castro, Jose; Georgiopoulos, Michael; Demara, Ronald; Gonzalez, Avelino; “Data-

Partitioning using the Hilbert Space Filling Curves: Effect on the Speed of

Convergence of Fuzzy ARTMAP for Large Database Problems,” Neural

Networks; accepted for publication.,2004

Friedman, J.; Baskett, F.; Shustek, L. “An Algorithm for finding nearest neighbors,”

IEEE Transactions on Computers, Vol. 24, 1000-1006, 1975

T. Cover, P. Hart, “Nearest Neighbor Pattern Classification,” IEEE Trans. on Information

Theory, IT – 13:21-27, 1967

J.K.Lawder; P.J.H.King. “Using Space-filling Curves for Multi-dimensional Indexing.”

Advances in Databases: proceedings of the 17th British National Conference on

Databases (BNCOD 17), volume 1832 of Lecture Notes in Computer Science,

pages 20 - 35. Springer Verlag, July 2000

J.K.Lawder; P.J.H.King. Using State Diagrams for Hilbert Curve Mappings.

International Journal of Computer Mathematics 78(3): 327-342 (2001) (formerly

Research Report BBKCS-00-02 or JL2/00, August 2000)

49

J.K.Lawder. "Calculation of Mappings Between One and n-dimensional Values Using the

Hilbert Space-filling Curve." Research Report BBKCS-00-01

(formerly JL1/00), August 2000 http://www.dcs.bbk.ac.uk/~jkl/publications.html

Skubalska-Rafajlowicz, E.; Krzyzak, A., “Fast k-NN classification rule using metric on

space-filling curves,” Pattern Recognition, 1996, Proceedings of the 13th

International Conference on, Volume: 2, Pages: 121 – 125, 25-29 Aug. 1996

Specht, D. F., "Probabilistic Neural Networks", Neural Networks, Vol. 3, pp. 109-118,

1990.

50

http://www.dcs.bbk.ac.uk/%7Ejkl/publications.html

Documents

HILBERT SPACE FILLING CURVE (HSFC) NEAREST NEIGHBOR CLASSIFIER · the use of space filing curves to index the training data. In Skubalska-Rafajlowicz, the data was indexed using Peano