191
VERTICAL TOTAL VARIATION FOR DEVELOPING A SCALABLE NEAREST NEIGHBOR CLASSIFIER A Dissertation Submitted to the Graduate Faculty of the North Dakota State University of Agricultural and Applied Science By Taufik Fuadi Abidin In Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY Major Department: Computer Science

Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

VERTICAL TOTAL VARIATION FOR DEVELOPING A

SCALABLE NEAREST NEIGHBOR CLASSIFIER

A Dissertation Submitted to the Graduate Faculty

of theNorth Dakota State University

of Agricultural and Applied Science

By

Taufik Fuadi Abidin

In Partial Fulfillment of the Requirements for the Degree of

DOCTOR OF PHILOSOPHY

Major Department: Computer Science

May 2006

Fargo, North Dakota

Page 2: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

This page is intentionally left blank for approval sheet

ii

Page 3: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

ABSTRACT

Abidin, Taufik Fuadi, Ph.D., Department of Computer Science, College of Science and Mathematics, North Dakota State University, May 2006. Vertical Total Variation for Developing a Scalable Nearest Neighbor Classifier. Major Professor: Dr. William Perrizo.

Recent advances in computer power, network, information storage, and multimedia

have led to a proliferation of stored data in various domains, such as bioinformatics, image

analysis, the World Wide Web, networking, banking, and retailing. This explosive growth

of data has opened the need for developing efficient and scalable data-mining techniques

that are capable of processing and analyzing large datasets. In data mining, classification is

one of the important functionalities. Classification involves predicting the class label of

newly encountered objects using feature attributes of a set of pre-classified objects. The

classification result can be used to understand the existing objects in the dataset and to

understand how new objects are grouped. In this dissertation, we focus our work on

classification, more precisely on a scalable classification algorithm. We propose an

efficient and scalable nearest neighbor classification algorithm that efficiently filters the

candidates of neighbors by creating a total variation contour around the unclassified object.

The objects within the contour are considered as the superset of nearest neighbors. These

neighbors are identified efficiently using P-tree range query algorithm without having to

scan the total variation values of the training objects one by one. The proposed algorithm

further prunes the neighbor set by means of dimensional projections. After pruning, the

k-nearest neighbors are searched from the pruned neighbor set. The proposed algorithm

uses P-tree vertical data structure, one choice of vertical representation that has been

experimentally proven to address the curse of scalability and to facilitate efficient data

mining over large datasets. An efficient and scalable Vertical Set Squared Distance (VSSD)

is used to compute total variation of a set of objects about a given object. The efficiency

iii

Page 4: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

and scalability of the proposed algorithm are demonstrated empirically through

experimentation using both real-world and synthetic datasets. The application of the

proposed algorithm in image categorization is also discussed. Finally, the step-by-step

integration of the proposed algorithm into DataMIMETM as a prototype of a new nearest

neighbor classification algorithm that uses P-tree technology is also reported.

iv

Page 5: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

ACKNOWLEDGMENTS

First of all, I would like to express my truthful gratitude and appreciation to my

adviser and research supervisor, Dr. William Perrizo, for his strong support, constructive

comments, suggestions, and encouragement that has brought me into a high level of

research accomplishment and made me successfully complete the highest degree.

I would also like to gratefully acknowledge my supervisory committee, Dr. D.

Bruce Erickson, Dr. Akram Salah, and Dr. Xiwen Cai, for their valuable advice and

comments. Also thanks to Amal Perera and Masum Serazi for their friendship and much

help in assisting me understand the P-tree API. I am also grateful to Ranapratap Syamala

for his willingness in checking the language. Last but not least, many thanks also go to all

DataSURG members for numerous stimulating research discussions.

v

Page 6: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

DEDICATION

To my wonderful children, Alif, Zafir, and Jaza, and my lovely wife, Ridha, whose

love and patience have instilled in me the spirit to complete my doctoral degree.

To my father, Abidin, and my mother, Salmah, who have waited for so long for this

time to come, and my sisters, Rina, Nurul, and their families, who have always been the

true supporter in my life.

vi

Page 7: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

TABLE OF CONTENTS

ABSTRACT .......................................................................................................................iii

ACKNOWLEDGMENTS......................................................................................................v

DEDICATION.......................................................................................................................vi

LIST OF TABLES..................................................................................................................x

LIST OF FIGURES...............................................................................................................xi

CHAPTER 1. INTRODUCTION..........................................................................................1

CHAPTER 2. CLASSIFICATION........................................................................................7

2.1. Overview.............................................................................................................7

2.2. Classification Algorithms...................................................................................8

2.1.1. Support Vector Machine.............................................................................8

2.1.2. Naïve Bayesian Classifiers.........................................................................9

2.1.3. Decision Tree Classifiers..........................................................................10

2.1.4. K-Nearest Neighbor Classifiers................................................................11

CHAPTER 3. P-TREE VERTICAL DATA STRUCTURE................................................14

3.1. Introduction.......................................................................................................14

3.2. The Construction of P-Trees.............................................................................15

3.3. P-Tree Operations.............................................................................................15

CHAPTER 4. VERTICAL APPROACH FOR COMPUTING TOTAL VARIATION.....18

4.1. Introduction.......................................................................................................18

4.2. The Proposed Approach....................................................................................19

4.2.1. Vertical Set Squared Distance..................................................................19

4.2.2. Retaining Count Values............................................................................22

4.2.3. Complexity Analysis.................................................................................25

4.3. Performance Analysis.......................................................................................28

4.3.1. Datasets.....................................................................................................28

4.3.2. Run Time and Scalability Comparison.....................................................30

4.4. Conclusion........................................................................................................33

CHAPTER 5. SMART-TV: AN EFFICIENT AND SCALABLE NEAREST NEIGHBOR BASED CLASSIFIER....................................................................................34

5.1. Introduction.......................................................................................................34

vii

Page 8: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

5.2. Hyper Parabolic Graph of Total Variations......................................................35

5.3. The Proposed Algorithm...................................................................................39

5.3.1. Preprocessing Phase..................................................................................40

5.3.2. Classifying Phase......................................................................................41

5.3.3. Detailed Description of the Proposed Algorithm......................................42

5.4. Illustrative Examples of the Pruning Technique...............................................47

5.5. Weighting Function..........................................................................................50

5.6. Performance Analysis.......................................................................................52

5.6.1. Datasets.....................................................................................................53

5.6.2. Parameterization.......................................................................................56

5.6.3. Classification Accuracy Comparison........................................................57

5.6.4. Classification Time Comparison...............................................................64

5.7. Conclusion........................................................................................................68

CHAPTER 6. THE APPLICATION OF THE PROPOSED ALGORITHM IN IMAGE CLASSIFATION..................................................................................................................69

6.1. Introduction.......................................................................................................69

6.2. Image Preprocessing.........................................................................................70

6.3. Experimental Results........................................................................................72

6.3.1. An Example on Corel Dataset...................................................................73

6.3.2. Classification Accuracy............................................................................75

6.3.3. Classification Time Comparison...............................................................77

6.4. Conclusion........................................................................................................78

CHAPTER 7. INTEGRATING THE PROPOSED METHOD INTO DATAMIMETM......80

7.1. Introduction.......................................................................................................80

7.2. Server-Side Components..................................................................................81

7.3. Client-Side Components...................................................................................82

7.4. Graphical User Interface...................................................................................82

CHAPTER 8. CONCLUSION AND FUTURE WORK.....................................................87

8.1. Conclusion........................................................................................................87

8.2. Future Work......................................................................................................89

REFERENCES.....................................................................................................................91

APPENDIX ......................................................................................................................97

A.1. SmartTVApp Class...........................................................................................97

A.2. SmartTV Header Class...................................................................................101

viii

Page 9: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

A.3. SmartTV Class................................................................................................103

A.4. Makefile..........................................................................................................116

ix

Page 10: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

LIST OF TABLES

Table Page

1. Example dataset................................................................................................................22

2. The count values of each class....................................................................................24

3. The specification of the machines..............................................................................28

4. Time for VSSD to compute all count values..............................................................31

5. The average time to compute the total variations under different machines..............32

6. Loading time comparison...........................................................................................33

7. Class distribution of KDDCUP dataset......................................................................55

8. Classification accuracy on the KDDCUP dataset for k = 3........................................58

9. Classification accuracy on the KDDCUP dataset for k = 5........................................58

10. Classification accuracy on the KDDCUP dataset for k = 7........................................59

11. Classification accuracy on the WDBC dataset for k = 3............................................59

12. Classification accuracy on the WDBC dataset for k = 5............................................59

13. Classification accuracy on the WDBC dataset for k = 7............................................59

14. Classification accuracy comparison on the OPTICS dataset for k = 3.......................60

15. Classification accuracy comparison on the OPTICS dataset for k = 5.......................61

16. Classification accuracy comparison on the OPTICS dataset for k = 7.......................61

17. Classification accuracy on the Iris dataset for k = 3...................................................62

18. Classification accuracy on the Iris dataset for k = 5...................................................62

19. Classification accuracy on the Iris dataset for k = 7...................................................62

20. Average classification accuracy for k = 3..................................................................63

21. Average classification accuracy for k = 5..................................................................63

22. Average classification accuracy for k = 7..................................................................63

23. Run time and scalability comparison on the RSI dataset............................................66

24. Preprocessing time of SMART-TV algorithm on the RSI dataset.............................66

25. Average classification time.........................................................................................67

26. Classification accuracy comparison using k = 3, k = 5, and k = 7.............................76

27. Preprocessing time of SMART-TV algorithm on the Corel dataset...........................78

x

Page 11: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

LIST OF FIGURES

Figure Page

1. Example of maximized and non-maximized margins..................................................9

2. A decision tree............................................................................................................11

3. The 1-dimensional P-trees from attribute A1..............................................................17

4. Algorithm to get the count values...............................................................................25

5. Algorithm to compute TV(X, a)..................................................................................27

6. The original image of the RSI dataset........................................................................29

7. Time trend for computing the total variations............................................................32

8. Graph of .........................................................................................37

9. Graph of .......................................................................38

10. Graph of ......................................................................................40

11. The pre-image of the contour of interval [g(b), g(c)] creates a Nbrhd(a, e)...............42

12. P-tree range query algorithm......................................................................................43

13. Algorithm to create a contour mask............................................................................43

14. An illustration of the dimensional projection contour................................................44

15. Pruning algorithm.......................................................................................................46

16. Pruning the neighbor set using dimensional projections............................................47

17. Pruning example 1......................................................................................................48

18. Pruning example 2......................................................................................................49

19. Weighting function ....................................................................................51

20. Weighting function .......................................................................................51

21. Weighting function ....................................................................................52

22. Run time and scalability comparison on the RSI dataset............................................65

23. Classification time on the KDDCUP dataset..............................................................67

24. The classes of the images............................................................................................73

25. Example using Corel dataset with pruning.................................................................74

26. Classification time, k = 5, e = 0.01, and MS = 1000...................................................78

27. Code segments of SMART-TV in the Predictor class................................................82

28. Graphical user interface for mining with SMART-TV algorithm..............................83

xi

Page 12: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

29. Graphical user interface showing the classification results........................................84

30. Graphical user interface showing the vote histogram.................................................85

31. Graphical user interface showing the performance of a validation............................86

xii

Page 13: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

CHAPTER 1. INTRODUCTION

Recent advances in computer power, network, information storage, and multimedia

have led to a proliferation of stored data in various domains like bioinformatics, image

analysis, the World Wide Web, networking, banking, and retailing. This fast growth of

stored data has opened the need for developing efficient and scalable data-mining

techniques that can extract valuable and interesting information from a large volume of

data.

Data mining, or knowledge discovery in databases, emerges due to the explosive

growth of data. It is a non-trivial process of extracting interesting and potentially valuable

patterns in a large volume of data [1]. Data-mining functionalities can be divided into three

broad categories: association rules mining (ARM), clustering, and classification. ARM

discovers interesting association or correlation relationships among objects in the databases

that match the support and confidence thresholds. A common application of ARM is

market basket research which analyzes the correlation between customers’ purchasing

habits and the data items that the customers purchased. The results of the analysis can help

the decision makers in designing catalogs, arranging shelves, and deciding appropriate

marketing processes.

1

Page 14: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

Clustering can be defined as the process of grouping a set of data objects to

discover meaningful clusters such that the objects within the same cluster are more similar

to one another but dissimilar to the objects across clusters. Clustering is also known as

unsupervised learning because the class label of each data instance in the dataset does not

exist. Cluster analysis is useful in understanding the distribution of the data and is often

used as the preprocessing step for other data-mining algorithms operating on the detected

clusters [1].

Classification, in contrast, is a process of assigning class label to the unclassified

objects based on some notion of similarity between the data objects in the training set and

the unclassified object. Because the training set of pre-classified objects is used to

supervise the classification process, classification is also known as supervised learning. The

first step in classification is to build a model or a classifier, and the second step is to predict

the class membership of new data instances using the classifier. Often, before the classifier

is used to classify the real samples, the accuracy of the classifier is estimated. To do this

estimation, a set of testing samples with known class labels that is independent from the

training set is created. The accuracy of the classifier is measured based on the percentage of

the correct assignments of the testing samples.

2

Page 15: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

This dissertation is based on the research projects and papers published in

[2, 3, 4, 5, 6]. The work is focused on classification, one of the data-mining functionalities.

Classification is commonly used in various domains such as bioinformatics, image

analysis, spatial databases, and banking. In bioinformatics, a classification model is used to

predict the functions of the newly discovered genes based on the functions of a collection

of well annotated genes. In the Sorcerer II oceanographic expedition headed by J.C. Venter

[7] for example, the group of researchers discovered at least 1,800 new species of bacteria

and more than 1.2 million new genes from about 200 liter samples of ocean water collected

in the Sargasso Sea near Bermuda. These new genes need to be classified in order to

understand their behavior and group. To classify such large number of new genes, efficient

and scalable classification techniques are needed.

Many excellent studies in classification have been conducted. Vapnik [8]

introduced the Support Vector Machine (SVM) classification algorithm that transforms the

input space into a higher dimensional feature space with a nonlinear mapping (kernel

function). With an appropriate mapping, SVM creates a hyperplane (decision boundary)

such that the distance between the hyperplane and the closest samples (support vectors)

from the two classes is maximized. Once the maximum hyperplane is determined, the class

label of the new sample is a matter of deciding on which side of the hyperplane does the

new sample lie. The ability to determine the maximum hyperplane has made SVM often

achieve good classification accuracy. However, SVM is very slow in training because it has

to find the hyperplane with the largest margin, always treat the classification problems as

binary classification problems, and does not scale to train very large datasets [9, 10].

3

Page 16: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

Cover et al. [11] introduced what is called the nearest neighbor classifier. Nearest

neighbor classifiers are fascinating because of their simplicity and generality to model a

wide range of problems. The classifiers search the nearest neighbors in the training set in a

brute-force fashion and assign a class label to the unclassified sample based on the plurality

of category of the nearest neighbors. The search is repeated for every new instance. In the

case of large training sets, brute-force search for nearest neighbors is very expensive and

tedious.

The work proposed in [12] introduced an algorithm to accelerate the k-nearest

neighbor search for image retrieval problems. In retrieval problems, the goal is to

determine the nearest neighbors of the query object, while in the classification problems,

the goal is to find the nearest neighbors that will determine the class label of the

unclassified object. In the algorithm, the distance between the query object and data objects

in the training set is accumulated by scanning the dimensional projections one by one. The

assumption is that, after scanning a few of them, the partial distance to the query object is

known and the lower and upper bounds of the complete distance can be estimated.

Subsequently, the data objects that are outside the estimated bounds are pruned. The

process is repeated until the candidate set contains exactly k objects or all dimensions have

been scanned. Good acceleration time was reported to search for the nearest neighbors.

However, when the database is very large, the time to scan the dimensions to estimate the

partial distance will be significant. In addition, the algorithm was designed specifically for

accelerating nearest neighbor search in content-based image retrieval problems.

Another strategy commonly used to accelerate the nearest neighbor search for

classification and retrieval problems is to use additional data structure such as k-d tree. K-d

tree hierarchically decomposes the space into relatively small cells such that the cells

4

Page 17: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

contain small number of objects. The objects are then accessed quickly by traversing down

the tree. Studies show that k-d tree can reduce the searching complexity from O(n) to

O(log n) since the object is not exhaustively compared with every object in the space.

However, k-d tree is efficient only for small datasets in the range of thousands to hundreds

of thousands, and its performance degrades in high dimensions [13].

Most of the state-of-the-art classification algorithms use the traditional horizontal

record based structuring to represent the data. The use of traditional horizontal record and

sequential scan based approach are known to scale poorly to very large data repositories

[14]. Jim Gray from Microsoft in his talk at ACM SIGMOD, 2004 [15] emphasized that the

vertical column based structuring as opposed to horizontal row based structuring can speed

up query process and ensures scalability. Jin et al. [16] proposed a solution for scalability

of data mining algorithms through a parallelization approach, which distributes the

processing time on several high performance clusters. However, according to [17],

parallelization approach alone is inadequate to solve the problem of scalability because the

size of data volume increases much faster than the CPU processing time advancement.

In this dissertation, we propose a new and scalable nearest neighbor based

classification algorithm that efficiently filters the candidates of neighbors by means of

vertical total variation. The vertical total variation of a set about a given object is computed

using a new, efficient, and scalable Vertical Set Square Distance (VSSD) algorithm, which

will be discussed in detail in Chapter 4. The proposed algorithm employs P-tree vertical

data structure, one choice of vertical representation that has been experimentally proven to

address the curse of scalability and to facilitate efficient data mining over large datasets.

This vertical data structure was introduced by Perrizo [18].

5

Page 18: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

The proposed algorithm is a nearest neighbor based classification algorithm, which

will be discussed in detail in Chapter 5. Unlike the k-nearest neighbor classification

algorithm where the k-nearest neighbors are searched from the entire training set, in the

proposed algorithm, the nearest neighbors are searched from a pruned neighbor set. This

neighbor set is obtained by forming a total variation contour around the unclassified object.

The objects within this contour are considered as the superset of nearest neighbors, which

can be identified efficiently using P-tree range query algorithm without the need to scan the

total variation values of the training objects one by one. An efficient pruning technique

using dimensional projections is also introduced to prune the superfluous neighbors in the

neighbor set so that after pruning, the k-nearest neighbors are searched from a small set of

neighbors. The efficiency, scalability, and effectiveness of the proposed algorithm are

demonstrated empirically using both real-world and synthetic datasets. In particular, the

datasets of size up to ninety six million is used to evaluate the run time and scalability of

the algorithm.

We extend the work by applying the proposed algorithm in image classification

problem. Chapter 6 is an attempt to demonstrate that the proposed algorithm can be used

for image classification task, which in general uses large image repositories and represents

image feature vectors in a high dimensional space. The feature vectors are constructed from

color distribution, image texture, image structure (shape) or their combination. In this

work, we extract both color distribution and image texture features to represent the images

and observe the performance of the proposed algorithm in image classification task.

We integrate the proposed algorithm into DataMIMETM, the P-tree based data

mining system, as a prototype of a new vertical nearest neighbor classification algorithm.

The discussion about the integration process and snap shots of the graphical user interface

6

Page 19: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

will be presented in Chapter 7. The code of the algorithm is enclosed in the appendix and

can be downloaded from http://www.cs.ndsu.nodak.edu/~datasurg/codes.php3.

Finally, we conclude this dissertation in Chapter 8 by summarizing the main

contributions of the work and presenting some directions for future work.

7

Page 20: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

CHAPTER 2. CLASSIFICATION

2.1. Overview

Classification is one of the data-mining functionalities. Classification involves

predicting the class label of newly encountered objects using feature attributes of a set of

pre-classified objects. The predictive pattern from the classification result can be used to

understand the existing objects in the databases and to understand how new objects are

grouped [19]. The goal of classification is to determine the data membership of the

unclassified objects.

Classification in data mining has much in common with the classification done by

machine learning and statistics communities. The main difference is in the cardinality of

the datasets. In data mining, the cardinality of the data is assumed to be very large,

generated from many resources such as satellite images, sales, and microarray data, which

has now reached terabytes in size [20]. This has made the scalability becomes a major issue

and a precondition for the success of any algorithm nowadays.

Many classification techniques have been introduced in the literature. We will

review some of them in the next section. The classification algorithms can be compared

based on several criteria such as scalability, speed, accuracy, and robustness. In this

dissertation, we compare the proposed algorithm in terms of scalability, speed, and

accuracy. The scalability refers to the ability of the algorithm to run, given large amount of

data, the speed is the reasonable amount of time needed to finish the classification task, and

the accuracy refers to the ability of the algorithm to correctly predict the class membership

of the new instance [1].

8

Page 21: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

2.2. Classification Algorithms

In general, classification algorithms can be divided into two subgroups. The first

group is the classification algorithms that construct a model from the set of labeled objects

in the training set. These algorithms are known as eager classifiers. Much of the time in

these algorithms is invested for the learning phase to generate a general model from the

training set, and the classification is just a matter of using the generated model. Examples

of these classification methods are Support Vector Machine, Bayesian classifiers, Neural

Net, and decision tree classifiers. The second group is the lazy classification algorithms that

invest no effort in learning phase, but put all effort in classification phase. The example of

this type of algorithm is the k-nearest neighbor classifiers. The k-nearest neighbor

classifiers find the most similar objects to the new unclassified object in the training set,

and classify the new object into the most common class among these most similar objects.

These most similar objects are usually called the nearest neighbors. In the following

sections, we will briefly summarize some of the classification algorithms.

2.1.1. Support Vector Machine

Support Vector Machine (SVM) is a well-known classification technique [8]. SVM

transforms the input space into a higher dimensional feature space with kernel function and

creates a hyper plane that separates the binary classes such that the distance between the

hyper plane and the support vectors in each class is maximized (Figure 1). SVM has been

validated experimentally to often achieve good accuracy. However, SVM does not scale to

very large training set, and the overall performance of the SVM algorithm largely depends

on the choice of the kernel function. One example case of expensive training of SVM can

be found in [10]. In this work, SVM algorithm takes about 2.85 hours to learn from a

9

Page 22: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

training set of 1000 points in 2 dimensional spaces (checkerboard dataset), running on 400

MHz Pentium II Xeon machine with 2 Gigabytes of memory.

Figure 1. Example of maximized and non-maximized margins.

2.1.2. Naïve Bayesian Classifiers

Naïve Bayesian classifiers are statistical classifiers based on the Bayesian theorem

[21]. The class label of the new instance is predicted based on the probability to which

class the new instance should belong. Let X be the new instance whose class label is

unknown, H be the hypothesis such that the instance X belongs to a specific class C, and

is the prior probability of H for any instance, the objective is to determine ,

the posterior probability of H given X using the probabilities P(H), P(X), and P(X|H). The

posterior probability can be estimated using the Bayesian theorem:

)()()|()|(

XPHPHXPXHP

The naïve Bayesian classifiers take the above relation to estimate a class label of a

new instance, X. Let be the class labels in the given training set. The class

10

Page 23: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

label of X can be estimated using the highest conditional probability, )|( XCP i , such that

, where as follows:

P(X) can be considered as a constant for all classes while can be estimated from the

training set, i.e., the number of samples in class divided by the total number of training

samples. The naïve Bayesian classifiers take an assumption of conditional independence

between the attributes and estimate as the product of the probability of attribute

value in the new instance, X, that belongs to class Ci.

Although the naïve Bayesian classifiers have the minimum error rate compared to

the other classifiers, in practice, this is not always the case because of inaccuracies in the

assumption of attributes and class-conditional independence [21]. The computational cost

of the naïve Bayesian classifiers lies on the complexity to compute the probability values

which can be very expensive in large training sets and dimensions.

2.1.3. Decision Tree Classifiers

The concept of decision trees was initially introduced by Quinlan [22]. In a decision

tree classifier, the training set is split into smaller subsets based on attribute values using a

split rule. The internal nodes of the tree represent the decision rules while the leaf nodes of

the tree represent the predicted class labels. The unclassified sample is classified by

traversing the tree starting at the root. An evaluation about the attribute of the unclassified

sample is made at each internal node to determine the next branch. The class label of the

unclassified sample is the leaf node where the tree traversal ends. A simple example of a

decision tree adopted from [23] is shown in Figure 2.

11

Page 24: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

Figure 2. A decision tree.

Most of the decision tree algorithms such as ID3, C4.5 and CART work well for

relatively small datasets. Their efficiency becomes questionable when applied to real-world

large datasets since the training set will not fit into memory. More recent decision tree

algorithm such as CLOUD [24] proposed a decision tree algorithm for large datasets and

introduced a new mechanism for splitting the attributes. However, the proposed splitting

method requires at least one pass over the training set, which also can be very expensive as

the training set size grows.

2.1.4. K-Nearest Neighbor Classifiers

Assigning classes for unclassified sample based on the nearest neighboring samples

has been investigated since 1967 [11]. The classification scheme can be summarized as

follows: Given training set of data objects in d-dimensional spaces, the k-nearest neighbor

(KNN) classifiers assign a category to an unclassified object based on the plurality of

categories of the k-nearest neighbors.

KNN classifiers do not build the model in advance. Instead, they search for the

k-nearest neighbors directly from the training set. The closeness is defined in terms of a

distance function, e.g. Euclidian distance. KNN classifiers often produce good

classification accuracy on some cases. However, a potential drawback of KNN classifiers is

12

High Risk

High Risk Low Risk

Car Type

Age

Low Risk

30< 20 20 < 30

Sports Family

Page 25: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

that the classification time will be significant when the size of the training set is very large.

Searching through the training set to find the k-nearest neighbors can be a time-consuming

process. The complexity to find the k-nearest neighbors in brute force manner takes

O(nm) for each unclassified sample since the classifiers have to visit each of the n training

objects and perform m operations to calculate the distance [25]. This high complexity

makes the approach impractical for applications involving very large datasets.

Many excellent studies have been done to make KNN classifiers scalable, such as

those reported in [26, 27]. Khan et al. [26] introduced a P-tree based k-nearest neighbor

classifier (P-KNN), which uses P-tree vertical structure to accelerate the classification time

and uses Higher Order Bit Similarity (HOBBIT) as the similarity metric. As the name

implies, the similarity is measured based the number of consecutive bits in the higher order

position that are in common. Formally, HOBBIT similarity between integers A and B is

defined as follows:

HOBBIT(A, B) = max{s | 0 i s ai = bi}

where ai and bi are the ith bits position of integer A and B. The distance between A and B is

then defined as:

where n is the number of dimensions, and m is the number of bits used to represent the

integer values A and B. In order to use the HOBBIT metric, all dimensions must be

represented using the same number of bits.

The closed k-nearest neighbor set (closed-KNN set) was also introduced in P-KNN

algorithm. The closed-KNN set is a superset of k-nearest neighbor set that takes all

equidistant neighbors within the k distance. According to Khan, the closed-KNN set can

13

Page 26: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

improve classification accuracy [26]. P-tree technology requires no additional computation

to determine the closed-KNN set. On the other hand, the classical KNN classifiers require

an additional scan over the training set to find the closed-KNN set. This additional scan can

be very expensive when the training set is very large.

P-KNN works as follows: It builds a neighborhood ring around the unclassified

sample and successively expands the ring until at least k-nearest neighbors are found. The

ring expansion grows by ignoring one least significant bit at a time. Experiments show that

P-KNN is fast and accurate in spatial data. However, the neighborhood ring produced by

HOBBIT similarity cannot evenly expand from the unclassified object when a bit is

ignored, which can consequently move the center of the ring away from the unclassified

object.

An alternative approach to alleviate the uneven expansion of the neighborhood is to

use an Equal Interval Neighborhood ring (EIN-ring) approach, which builds the ring

around the unclassified object equally. However, this approach has additional

computational overheads, i.e., it requires many logical AND and OR operations to form an

equal neighborhood ring.

Podium Incremental Neighbor Evaluator (PINE) [27] is another P-tree based

k-nearest neighbor classifier. PINE allows all data objects in the training set to vote, but

each of them is weighted in a podium fashion based on their distance to the unclassified

object. In PINE, HOBBIT metric is also used as the distance metric, and the Gaussian

function is used as the podium function.

14

Page 27: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

CHAPTER 3. P-TREE VERTICAL DATA STRUCTURE

3.1. Introduction

Vertical data representation represents and processes the data differently from

horizontal data representation. In vertical data representation, the data is structured column

by column and processed horizontally through logical AND or OR operations, while in

horizontal data representation, the data is structured row by row and processed vertically

through scanning or using some notion of index. P-tree vertical data structure [18] is one

choice of vertical data representation. This vertical data structure is used for data

presentation and processing in this dissertation. P-tree vertical data structure was invented

in 2001 and was primarily used for representing spatial data vertically [26, 27]. However,

since then, the P-tree has been intensively exploited in various domains and data mining

algorithms, ranging from classification, clustering, association rule mining to outlier

analysis [3, 4, 28, 29, 30, 31]. In September 2005, P-tree technology was patented in the

United States by North Dakota State University, patent number 6,941,303.

P-tree vertical data structure is a lossless, compressed, and data-mining ready data

structure. P-tree is lossless because the vertical bit-wise partitioning guarantees that the

original data values can be retained completely. P-tree is compressed because when the

segments of bit sequences are either pure-1 or pure-0, they can be represented in a single

bit. P-tree is data-mining ready because it addresses the curse of cardinality or the curse of

scalability, one of the major issues in data mining. P-tree vertical data structure has been

used in various data mining algorithms and has been experimentally proven to have great

potential to address the curse of scalability.

15

Page 28: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

3.2. The Construction of P-Trees

P-tree can be formed directly from binary data as well as from categorical and

numerical data. The categorical and numerical data are typically organized in a relational

table containing several attributes. The construction of P-tree vertical data structure is

started by converting the dataset, normally arranged in a relation R(A1, A2,…, Ad) of

horizontal records, into binary. Each attribute in the relation is vertically partitioned

(projected), and for each bit position in the attribute, vertical bit sequences (containing 0s

and 1s) are subsequently created. During partitioning, the relative order of the data is

retained to ensure convertibility. In 0-dimensional P-trees, the vertical bit sequences are left

uncompressed and are not constructed into predicate trees. The size of 0-dimensional

P-trees is equal to the cardinality of the dataset. In 1-dimensional compressed P-trees, the

vertical bit sequences are constructed into predicate trees. In this compressed form, AND

operations can be accelerated. The 1-dimensional P-trees are constructed by recursively

halving the vertical bit sequences and recording the truth of “purely 1-bits” predicate in

each half. A predicate 1 indicates that the bits in that half are all 1s, and a predicate 0

indicates otherwise. To indicate the P-tree, two subscripts are used. The first subscript

indicates the attribute to which the P-tree belongs, and the second subscript indicates the bit

position of that attribute. Consider Figure 3 to get some insights on how 1-dimensional

P-trees of a single attribute A1 are constructed.

3.3. P-Tree Operations

As opposed to horizontal record structure in which data are processed vertically

through scanning, in P-tree vertical data structure, data are processed horizontally through

logical operations such as AND and OR. These logical operations are extremely fast, and

16

Page 29: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

thus, any data mining functionality that facilitates these operations can be performed

extremely fast. COMPLEMENT, a unary operator that flips every bit into its negation can

also be applied on P-trees vertical structure. The range queries, values, or any other patterns

can be obtained using a combination of these Boolean algebraic operators.

Besides AND, OR, and COMPLEMENT, another powerful aggregate operation is a

COUNT. The COUNT operation is very important in P-tree vertical structure, which

counts the number of 1s in the basic or complement P-tree. For example, when using

P-trees P11, P12, and P13 from Figure 3, the count values resulted from COUNT(P11),

COUNT(P12), and COUNT(P13) are 4, 5, and 4 respectively. The COUNT operation has

been implemented in the P-tree API [32]. In fact, it is the main operation exploited in the

Vertical Set Squared Distance algorithm, which will be discussed in Chapter 4. Detailed

information about the P-tree data structure and its operations can also be found in [33].

17

Page 30: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

Figure 3. The 1-dimensional P-trees from attribute A1.

18

1 0 00 1 00 1 01 1 11 0 10 0 11 1 00 1 1

A1

42275163

A1

10011010

P13

01110011

P12

00011101

P11

0 1

0

0 0

0 1 0 1

P12

1 0 0 1 1 0 1 0

0

0 0

0 0 0 0

P13

0 1 0 1

0

0 0

0 0 1 0

P11

Page 31: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

CHAPTER 4. VERTICAL APPROACH FOR COMPUTING

TOTAL VARIATION1

4.1. Introduction

The rapid growth of data poses great challenges and generates an urgent need for

efficient and scalable algorithms that can deal with massive datasets. In this chapter, we

propose a vertical approach for computing set squared distance that measures the total

variation of a set of objects about a given object in vector space. The total variation is very

useful in classification, which will be demonstrated in Chapter 5, and clustering to identify

the cluster boundary and determine the cluster membership [28]. Set squared distance,

defined as , measures the total variation or the cumulative squared

separation of a set of vectors in X about a given vector a, denoted as TV(X, a).

Since scalability is becoming a major issue nowadays due to the availability of large

volume of datasets, any new algorithms should be able to handle large datasets. In this

chapter, we focus on the scalability of the proposed approach. The proposed approach

employs P-tree vertical data structure that organizes the data vertically and processes it

horizontally through fast and efficient logical AND, OR, or NOT operations. Using P-tree

vertical data structure, the need for repeatedly scan the dataset, as commonly done in

horizontal record-based approach, can be avoided. We will demonstrate the scalability of

the proposed approach empirically through several experiments.

Throughout the chapter, we use the term “VSSD” to refer to Vertical Set Squared

Distance, a vertical approach for computing total variation, and use the term “HSSD” to

1 This chapter is a modified version of a paper which appears in the International Journal of Computers and their Applications (IJCA), vol. 13, no. 2, pp. 94-102, June 6, 2006.

19

Page 32: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

refer to Horizontal Set Squared Distance, a horizontal approach for computing total

variation. The horizontal approach uses a scanning approach to compute the total variation.

4.2. The Proposed Approach

4.2.1. Vertical Set Squared Distance

Let R(A1, A2, …, Ad) be a relation in d-dimensional space. A numerical value, v, of

attribute Ai can be written in b bits binary representation as follows:

, where can either be 0 or 1 (1)

The first subscript corresponds to the attribute to which v belongs, and the second

subscript corresponds to the bit order. The summation in the right-hand side of the equation

is equal to the numerical value of v in base 10.

Now let x be a vector in d-dimensional space. The binary representation of x in b

bits can be written as follows:

(2)

Let X be a set of vectors in a relation R, xX, and a be the vector being examined.

The total variation of X about a, denoted as TV(X, a), can be measured quickly and scalably

using vertical set squared distance as follows:

(3)

20

Page 33: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

(4)

Commuting the summation of xX further inside to first process all vectors that belong to

X vertically, and then, process each attribute horizontally, we get

(5)

Let PX be a P-tree mask of set X that can quickly identify data objects in X. PX is a

bit pattern containing 1s and 0s, where bit 1 indicates that an object at that bit position

belongs to X, while 0 indicates otherwise. Using the mask, equation (5) can be simplified

by substituting with . Recall the aggregate COUNT

operation will count the number of bit 1 in the pattern. Hence, the simplified form of

equation (5) can be written as follows:

(6)

Similarly for terms T2 and T3, we derive the solution for the terms as shown in

equation (7) and (8), respectively.

(7)

(8)

21

Page 34: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

Hence, VSSD is defined to be

(9)

Now, let us consider X as the relation R itself. In this case, the mask PX can be

removed since all objects now belong to R. Then, the equation (9) can be written as

(10)

where |X| is the cardinality of R.

Furthermore, note that the aggregate function COUNT in both equations (9) and

(10) are independent from the input vector a. The independency of COUNT operations is

really an advantage because once the count values are obtained they can be reused every

time the total variation is computed. This reusability will expedite the computation of total

variation significantly.

4.2.2. Retaining Count Values

We will discuss the strategy to retain the count values in this section. Let us

consider an example dataset containing 10 data points as shown in Table 1. The dataset has

two feature attributes: X and Y, and a class attribute containing two values: C1 and C2. The

binary values of each data point are included in the table for clarity. The last two columns

on the right-side of the table are the masks of each class, denoted as PX1 and PX2. In this

example, we want to measure the total variations of each class about a given point, and

22

Page 35: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

thus, the dataset is subdivided into two sets. The first set is a collection of data points in

class C1, and the second set is a collection of data points in class C2.

Table 1. Example dataset.

X Y CLASS X in Binary Y in Binary PX1 PX2

7 6 C1 111 110 1 0

2 6 C2 010 110 0 1

6 3 C1 110 011 1 0

3 3 C2 011 011 0 1

3 4 C2 011 100 0 1

7 5 C1 111 101 1 0

7 2 C1 111 010 1 0

4 5 C2 100 101 0 1

1 4 C2 001 100 0 1

6 5 C2 110 101 0 1

The count values are stored separately in three files. Assume that the dataset are

subdivided into several sets, then the first file contains the count values of COUNT(PX).

The second file contains the count of , and the third file

contains the count values of operations.

Conversely, if the dataset is considered as a single set, the first file contains the cardinality

of the dataset. The second file contains the count of basic P-trees ,

and the third file contains the count of . The count

values in each file are organized accordingly in appropriate order. For example, for

23

Page 36: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

COUNT(PX) operation, the count value of the first set is stored first, followed by the count

value of the second set, and so forth. Similarly for , the count

values of the first set are stored first, followed by the next set, and so forth. For each set,

the total number of count values is equal to summation of bit length of each dimension.

This number is needed to correctly retrieve the values of each set when a total variation is

computed. The same strategy is also used for storing the count values of

. For each set, the total number of count values is

equal to the summation of bit length squared in each dimension. Table 2 summarizes the

count values of class C1 and C2, and Figure 4 shows the algorithm to get the count values.

Table 2. The count values of each class.

CLASS i j kCOUNT

CLASS i j kCOUNT

PX^Pij^Pik PX^Pij PX PX^Pij^Pik PX^Pij PX

C1 1 2 2 4 4 4 C2 1 2 2 2 2 6

1 4 1 1

0 3 0 0

1 2 4 4 1 2 1 4

1 4 1 4

0 3 0 2

0 2 3 3 0 2 0 3

1 3 1 2

24

Page 37: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

0 3 0 3

2 2 2 2 2 2 2 2 5 5

1 1 1 1

0 1 0 2

1 2 1 3 1 2 1 2

1 3 1 2

0 1 0 1

0 2 1 2 0 2 2 3

1 1 1 1

0 2 0 3

Subscript i represents index of attributes while subscript j and k represent bit-position.

ALGORITHM: GetCounts()INPUT: Ptree set Pi(n-1), ..., Pi1, Pi0

OUTPUT: Count values stored in c3, c2, c1// n is the number of attributes// b is the bitwidth// px is the ptree mask of set X

for i=0 to n-1for j=0 to b-1 for k=0 to b-1 rc = COUNT(p(i,j) & p(i,k) & px)c3.insert(rc)

endfor rc = COUNT(p(i,j) & px) c2.insert(rc)

endforendforc1.insert(COUNT(px))

Figure 4. Algorithm to get the count values.

25

Page 38: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

4.2.3. Complexity Analysis

The cost of VSSD lies in the computation of count values. When datasets are

subdivided into several subsets, the complexity is O(kdb2) where k is the number of

subsets, d is the number of feature dimensions, and b is the average bit length. However,

when the entire dataset is considered as a single set, the complexity is reduced to O(db2).

The choice whether to consider the entire dataset as a single set or subdivide it into several

subsets depends on the situation. In classification tasks, the set will be the entire training

set, whereas in clustering tasks, the sets are the clusters [28].

Moreover, the complexity to compute the total variation using VSSD is a constant

or O(1). This constant complexity is obtained because the same count values can be reused

for any given input values. It is just a matter of taking the right count values and solving

equation (9) or (10), without the aggregate function COUNT anymore. For example, let

a = (2, 3) be the vector being examined and the count values of class C1 and C2 are as listed

in Table 2, the total variation of class C1 about a, denoted as TV(C1, a), can be computed as

follows:

= 24 4 + 23 4 + 22 3 + 23 4 + 22 4 + 21 3 + 22 3 + 21 3 + 20 3 +

24 2 + 23 1 + 22 1 + 23 1 + 22 3 + 21 1 + 22 1 + 21 1 + 20 2 –

2 (2 (22 4 + 21 4 + 20 3) + 3 (22 2 + 21 3 + 20 2)) +

4 (22 + 32)

= 105

Similarly, the total variation of class C2 about a, can be computed as follows:

26

Page 39: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

= 24 2 + 23 1 + 22 0 + 23 1 + 22 4 + 21 2 + 22 0 + 21 2 + 20 3 +

24 5 + 23 1 + 22 2 + 23 1 + 22 2 + 21 1 + 22 2 + 21 1 + 20 3 –

2 (2 (22 2 + 21 4 + 20 3) + 3 (22 5 + 21 2 + 20 3)) +

6 (22 + 32)

= 42

Figure 5 shows the algorithm to compute total variation using VSSD. The input of

the algorithm is a set of P-trees while the outputs of the algorithm are the count values.

27

Page 40: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

ALGORITHM: TV(X,a)INPUT: PTree set Pi(b-1), ..., Pi1, Pi0

OUTPUT: Count values stored in c3, c2, c1

// n is the number of attributes// b is the bit width// interval2 = n x b// interval3 = n x b2

// X=0 means the entire set, X=1 means 1st class, etc.// C1, C2, and C3 are the arrays recording COUNT values

T1=0T2=0T3=0indexC2 = 0indexC3 = 0for i=0 to n-1

SumA=0 Sum1=0 Sum2=0

for j=0 to b-1 for k=0 to b-1 Sum1=Sum1 + 2(j+k) * C3.at(X * interval3 +

indexC1)indexC3 = indexC3 + 1

endforSumA=SumA + 2j * ai

Sum2=Sum2 + 2j * C2.at(X * interval2 + indexC2)indexC2 = indexC2 + 1

endforT1 = T1 + Sum1T2 = T2 + Sum2 * SumAT3 = T3 + SumA * SumA

endforT2 = T2 * (-2)T3 = C1.at(X)

RETURN (T1+T2+T3)

Figure 5. Algorithm to compute TV(X, a).

4.3. Performance Analysis

In this section, we report the performance analysis. The objective is to compare the

efficiency and scalability between VSSD employing a vertical approach (vertical data

28

Page 41: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

structure with horizontal bitwise operations) and HSSD utilizing a horizontal approach

(horizontal data structure with vertical scans). HSSD is defined as shown in equation (3).

Both VSSD and HSSD were implemented using the C++ programming language. The

programming application interface for P-tree vertical technology, P-Tree API [32], was

incorporated in the implementation of VSSD. The performance of both approaches was

observed under several different machine specifications, including an SGI Altix

CC-NUMA machine, as listed in Table 3.

Table 3. The specification of the machines.

Machine Specification Memory

AMD AMD Athlon K7 1.4GHz processor 1.0 GB

P4 Intel P4 2.4GHz processor 3.8 GB

SGI Altix SGI Altix CC-NUMA 12 processors Shared Memory (12 x 4 GB)

4.3.1. Datasets

The datasets were taken from a set of aerial photographs from the Best Management

Plot (BMP) of Oakes Irrigation Test Area (OITA) near Oakes, North Dakota. Latitude and

longitude are 970 42'18"W, taken in 1998. The images contain three bands: red, green, and

blue reflectance values. The values are between 0 and 255, which in binary numbers can be

represented using 8 bits. The original image is of size 1024x1024 pixels (having cardinality

of 1,048,576) and depicted in Figure 6. Corresponding synchronized data for soil moisture,

soil nitrate, and crop yield were also used. Crop yield was selected to be a class attribute.

Combining all bands and synchronized data, we obtained a dataset with 6 dimensions

(5 feature attributes and 1 class attribute).

29

Page 42: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

Additional datasets with different cardinality were synthetically generated from the

original dataset to study the speed and scalability of the methods. Both speed and

scalability were evaluated with respect to dataset size. Due to the small number of

cardinality obtained from the original dataset (1,048,576 records), we super sampled the

dataset using a simple image processing tool on the original dataset to produce five other

larger datasets, each having cardinality of 2,097,152, 4,194,304 (2048x2048 pixels),

8,388,608, 16,777,216 (4096x4096 pixels), and 25,160,256 (5016x5016 pixels). We

categorized the crop yield attribute into four different categories to simulate various subsets

in the datasets. The categories are: low yield having intensity between 0 and 63, medium

low yield having intensity between 64 and 127, medium high yield having intensity between

128 and 191, and high yield having intensity between 192 and 255.

Figure 6. The original image of the RSI dataset.

30

Page 43: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

4.3.2. Run Time and Scalability Comparison

Our first observation was to evaluate the performance of VSSD and HSSD when

running on different machines. We discovered that VSSD is significantly faster than HSSD

on all machines. VSSD takes only 0.0004 seconds on an average to compute the total

variations of each set (low yield, medium low yield, medium high yield, and high yield)

about 5 tested points. As discussed before in Section 4.2.3, the cost for VSSD lies in the

computation of count values. However, this computation is extremely fast because the

COUNT operations are simply counting the number of 1s in the patterns. We discovered

that for each dataset, the aggregate function COUNT were executed 1,280 times, derived

from 4 x 5 x 82, or equal to the complexity of computing count values O(kdb2).

Table 4 summarizes the amount of time needed for VSSD to run all COUNT

operations on different datasets and machines. Notice that when running on AMD machine,

VSSD only needs 0.4 seconds on an average to finish a single COUNT operation for the

dataset of size 25,160,256, while on P4 machine, VSSD only needs 0.15 seconds on an

average to finish a single COUNT operation for the same dataset. The COUNT operation

was even faster when VSSD was running on SGI Altix. It takes 183.81 seconds to complete

all 1,280 COUNT operations, or on an average, it takes only 0.14 seconds to complete a

single COUNT operation.

The computation of total variation is really fast for VSSD once the count values are

obtained. It is a matter of taking the appropriate count values and completing the

summation as shown in equation (9) but without the COUNT operations. We only report

the time to compute the total variations when VSSD was running on AMD machine

because the same time was also found on the other machines.

31

Page 44: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

Table 4. Time for VSSD to compute all count values.

Dataset SizesTime

(Seconds)AMD P4 SGI Altix

1,048,576 14.57 5.05 4.39

2,097,152 36.32 11.05 9.19

4,194,304 75.89 24.03 21.73

8,388,608 147.79 49.69 50.25

16,777,216 305.22 97.59 121.73

25,160,256 513.98 192.07 183.81

In contrast, HSSD takes more time to compute the total variations. The time is

linear to the dataset sizes and difference on every machine. For example, on AMD

machine, HSSD takes 79.86 and 132.17 seconds on average to compute the total variations

for the datasets of size 16,777,216 and 25,160,256 respectively. On P4 machine, HSSD

takes 98.62 and 155.06 seconds on average to compute the total variations for the same

datasets. The same phenomenon was also found when HSSD was running on SGI Altix

machine. The average time to compute the total variations for the dataset of size

16,777,216 is twice the time to compute the total variations for the dataset of size

8,388,608. Table 5 shows the average time to compute the total variations on different

machines, and Figure 7 illustrates the time trend. The time in the table shows a clear

advantage in using the proposed approach.

It is important to note that the significant disparity of time to compute the total

variations between VSSD and HSSD is due to the capability of VSSD to reuse the same

count values once the count values are computed. As a result, VSSD tends to have a

constant time when computing total variations even when the dataset sizes are varied. On

32

Page 45: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

the other hand, HSSD must scan the datasets each time the total variations are computed.

Thus, the time to compute a total variation is linear to the cardinality of the datasets.

Table 5. The average time to compute the total variations under different machines.

Dataset Sizes

Average Time to Compute the Total Variations(Seconds)

HSSD VSSD

AMD P4 SGI Altix AMD

1,048,576 5.30 6.14 6.79 0.0004

2,097,152 10.58 12.27 13.84 0.0004

4,194,304 18.40 24.73 27.64 0.0004

8,388,608 36.85 50.15 55.10 0.0004

16,777,216 79.86 98.62 109.76 0.0004

25,160,256 132.17 155.06 164.95 0.0004

020406080

100120140160180

0 2 4 6 8 10 12 14 16 18 20 22 24Number of Tuples (x 1024̂ 2)

Time

(Sec

onds

)

VSSD on AMD

HSSD on AMD

HSSD on P4

HSSD on SGI Altix

Figure 7. Time trend for computing the total variations.

Our second observation is to compare the time to load the datasets into memory.

We discover that when the datasets are organized in P-tree vertical structure, the time to

33

Page 46: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

load the datasets is more efficient than when the datasets are organized in horizontal

structure, see Table 6. The reason is that when the datasets are organized in P-tree vertical

structure, they are stored in binary. Hence, they can be loaded efficiently. On the other

hand, when the datasets are organized in horizontal structure, which is not in binary format,

it takes more time to load them to memory.

Table 6. Loading time comparison.

Dataset Sizes

Average Loading Time (Seconds)

Loading P-trees and Count Values Loading Horizontal Datasets

AMD P4 SGI Altix AMD P4 SGI Altix

1,048,576 0.11 0.04 0.04 31.65 24.12 25.92

2,097,152 0.25 0.09 0.06 63.22 48.21 51.96

4,194,304 0.47 0.16 0.12 118.61 97.87 103.98

8,388,608 0.95 0.35 0.26 243.84 202.69 208.61

16,777,216 1.87 0.67 0.55 489.59 389.96 415.43

25,160,256 3.33 0.95 0.82 784.57 588.27 625.33

4.4. Conclusion

In this chapter, we have introduced a vertical approach for computing total variation

and evaluated its performance. The results show that VSSD is fast and scalable to compute

total variation on very large datasets. The independency of COUNT operations to the input

value makes the computation of total variation using vertical approach extremely fast. The

proposed approach is scalable due to the use of P-tree vertical structure, which structures

the data vertically and processes it horizontally through logical AND or OR operations.

34

Page 47: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

CHAPTER 5. SMART-TV: AN EFFICIENT AND SCALABLE

NEAREST NEIGHBOR BASED CLASSIFIER2

5.1. Introduction

Classification on large datasets has become one of the most important research

priorities in data mining due to the large volume of data currently available. Classification

involves predicting the class label of newly encountered objects using feature attributes of a

set of pre-classified objects.

K-nearest neighbor (KNN) classifier is the most commonly used neighborhood

classification due to its simplicity, robustness, and good performance. Given a training set,

KNN classifier does not build a model in advance like decision tree induction [24], Neural

Network [34], and Support Vector Machine [8, 9, 10], instead it invests all the effort for

classification until a new instance arrives. The classification decision is then made locally

based on the features of the new instances. KNN classifier searches for the most similar

objects in the training set and assigns a class label to the new instance based on the

plurality of category of the k-nearest neighbors. The similarity or closeness between the

training objects and the new instance is determined using a distance measure, e.g. Euclidian

distance. Studies have shown that KNN classifier has shown good performance on various

datasets. However, when the training set is very large i.e. millions of objects, the

classification time increases linearly.

In this chapter, we propose an efficient and scalable nearest neighbor classification

algorithm, called SMART-TV. The proposed algorithm finds the candidates of neighbors

2 This chapter is a modified version of a published paper which appeared in the Proceedings of the 21st ACM Symposium on Applied Computing (Data Mining Track) (SAC-06), pp. 536-540, Dijon, France, April 23-27, 2006, with a slightly different title.

35

Page 48: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

by forming a total variation contour around the unclassified object. The objects within the

contour are then considered as the superset of nearest neighbors. These set of neighbors are

identified efficiently using P-tree range query algorithm without having to scan the total

variation values of the training objects. The proposed algorithm further prunes the neighbor

set using a novel pruning technique, so-called the dimensional projections. After pruning,

the k-nearest neighbors are searched from the pruned set, and then, we let them vote to

determine the class label of the unclassified object.

In the processing phase, the total variation function is applied to each training

object, and derived P-trees of these functional values are created. The derived P-trees are

used to efficiently determine the superset of neighbors in the contour. We empirically show

that the proposed algorithm is efficient and scalable to large datasets. In particular, a

dataset of size up to ninety six million is used to evaluate the run time and scalability of the

proposed algorithm.

The remainder of the chapter is organized as follows: in Section 5.2, we discuss the

graph of the total variations. In Section 5.3, we delineate the proposed algorithm in detail,

followed by two illustrative examples of the pruning technique in Section 5.4. We briefly

discuss the weighting functions used for voting in Section 5.5. We report the performance

analysis in Section 5.6, and finally, we summarize the conclusion remarks in Section 5.7.

5.2. Hyper Parabolic Graph of Total Variations

Let R(A1,…,Ad, C) be a training space, and X(A1,…,Ad) = R[A1,…,Ad] be the features

subspace, and TV(X, a) be the total variation of X about a. The total variations graph is a

hyper parabolic graph that always minimize at the mean (). The following proof will show

that the total variations graph is always minimized over the mean.

36

Page 49: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

Let

The above equation clearly shows that f(a) is parabolic in each dimension, ai. Now,

we examine the first partial derivative of f(a), such that , to determine the minimum

value of f(a) by fixing the dimension.

Let be the total number of objects in X. The summation can be simplified as

, thus

From the above observation, it is clear that when . Therefore, is always

minimized over the mean in all dimensions.

Figure 8 illustrates the total variations graph of equally distributed data objects in a 2-

dimensional space. From the graph, it is clear that the minimum value is at the mean.

37

Page 50: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

020

4060

80

0

20

40

60

80500

1000

1500

2000

2500

3000

Figure 8. Graph of .

Now let . Recall that the total variation is defined to be:

Since , we obtain the following equation:

38

Page 51: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

Figure 9 shows the graph of . The shape of the graph is

exactly similar to the shape of as illustrated in Figure 8. The only

difference is that when a = , the value of the function f(a)=0.

020

4060

80

0

20

40

60

800

10

20

30

40

Figure 9. Graph of .

5.3. The Proposed Algorithm

The proposed algorithm finds the candidates of neighbors by creating a total

variation contour around the unclassified object. The objects within the contour are

considered as the superset of nearest neighbors (candidate set). These neighbors are then

39

Page 52: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

pruned before the k-nearest neighbors are searched from the set. After pruning, the

k-nearest neighbors vote to determine the class label of the unclassified object.

Let R(A1,…,Ad, C) be a training space, and X(A1,…,Ad) = R[A1,…,Ad] be the feature

subspace, TV(X, a) be the total variation of the feature subspace X about a, and f(a) be the

function defined as In the preprocessing phase,

the function is applied to each training object, and derived P-trees of those functional

values are created to incorporate a fast and efficient way to determine the candidates of

neighbors. Since in large training sets the values of can be very large and representing

these large values in binary will require unnecessarily large number of bits, we define

to reduce the bit width.

We observe that the gradient of g at a = , by fixing at

dimension. We find that the gradient is zero if and only if a=, and the gradient length

depends only on the length of vector . This indicates that the isobars are hyper-

circles. Note that to avoid singularity when a=, we add a constant 1 to function f(a) such

that The graph of g(a) is shown in Figure 10.

The proposed algorithm consists of two phases: preprocessing and classifying. In

the preprocessing phase, the process is conducted only once while in the classifying phase,

the processes are repeated for every unclassified object.

40

Page 53: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

020

4060

80

0

20

40

60

800

1

2

3

4

Figure 10. Graph of .

5.3.1. Preprocessing Phase

In the preprocessing phase, we compute g(x), x X, and create derived P-trees of

the functional values g(x). The derived P-trees are stored together with the P-trees of the

dataset. The complexity of this computation is O(n) since the computation is applied to all

objects in X. Furthermore, because the vector mean is used in the

function, then the vector must be initially computed.

We compute the vector mean efficiently using P-tree vertical structure. An

aggregate function COUNT is used to count the number of 1s in the bit pattern such that

the sum of 1s in each vertical bit pattern is acquired first. The following formula shows

how to compute the element of vector mean at dimension vertically:

The complexity to compute the vector mean is O(db), where d is the number of

dimensions and b is the bit width.

41

Page 54: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

5.3.2. Classifying Phase

In the classifying phase, the steps are repeated for each unclassified object. We

summarize the steps in the classifying phase as follows:

1. Determine vector , where a is the new object and is the vector mean of the

features space X.

2. Given an epsilon of the contour (e > 0), determine two vectors located in the lower and

upper side of a by moving the e unit inward toward along vector and moving

the e unit outward away from a. Let b and c be the two vectors in the lower and upper

side of a respectively, then vector b and c can be determined using the following

equations:

3. Calculate g(b) and g(c) such that g(b) g(a) g(c) and determine the interval [g(b),

g(c)] that creates a contour over the functional line. The contour mask of the interval is

created efficiently using the P-tree range query algorithm without having to scan the

functional values one by one. The mask is a bit pattern containing 1s and 0s, where bit

1 indicates that the object is in the contour while 0 indicates otherwise. The objects

within the pre-image of the contour in the original feature space are considered as the

superset of neighbors (e neighborhood of a or Nbrhd(a, e)).

4. Prune the neighborhood using the dimensional projections.

42

Page 55: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

5. Find the k-nearest neighbors from the pruned set by measuring the Euclidian distance

x pruned Nbrhd(a, e), = .

6. Let the k-nearest neighbors vote using a weighted vote to determine the class label of

the unclassified object.

5.3.3. Detailed Description of the Proposed Algorithm

Consider objects in 2-dimensional space. Initially, the algorithm determines the

vector . Subsequently, the two vectors b and c located in the lower and upper side of

a are determined. The interval at the functional line will

form a contour, and the objects within the pre-image of the contour in the original feature

space are considered as the superset of neighbors (Figure 11).

Figure 11. The pre-image of the contour of interval [g(b), g(c)] creates a Nbrhd(a, e).

The mask of the superset of neighbors (the candidate set) is created efficiently using

the P-tree range query algorithm without the need to scan the functional values one by one.

43

Pre-image of the total variation contour

g(c)

e-contour a

g(b)

b c

x

y

g

Page 56: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

We summarize the P-tree range query algorithm in Figure 12, and the algorithm to create a

contour mask in Figure 13.

ALGORITHM: RangeQuery(v)INPUT: Pb-1, ..., P1, P0

OUTPUT: Ptree(PT > v)

LET v=vb-1 ... v1 v0

k=0while(vk)

k = k + 1if(k<b) PT = Pk

for i=k+1 to b-1 if(vi) PT = PT & Pi

else PT = PT | Pi

endfor

RETURN PT

Figure 12. P-tree range query algorithm.

ALGORITHM: ContourMask(lower, upper)INPUT: Derived P-tree Pb-1, ..., P1, P0

OUTPUT: Ptree mask of the contour

PU = RangeQuery(upper)PL = RangeQuery(lower)

RETURN PL & PU

Figure 13. Algorithm to create a contour mask.

Since the total variation contour is annular around the mean, the candidate set may

contain neighbors that are actually far from the unclassified object a, e.g. located within the

44

Page 57: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

contour but in the opposite side of a. Therefore, a pruning technique is needed to eliminate

the superfluous neighbors that may present in the candidate set.

In the proposed algorithm, the pruning technique that uses dimensional projections

is introduced. For each dimension, a dimensional projection contour is created around the

vector element of the unclassified object in that dimension. The size of the contour is

specified by moving the e unit away from the element of vector a in that dimension on both

sides as illustrated in Figure 14. The same epsilon previously used to determine vector b

and c is used again in this case.

Figure 14. An illustration of the dimensional projection contour.

The dimensional projection requires no additional derived P-trees since the training

set is already represented in P-trees vertical structure. The P-trees in each dimension can be

used directly at no extra cost, and the objects within the contour can be identified

efficiently using the same contour mask algorithm summarized in Figure 13.

A parameter, called MS (manageable size of the candidate set), is required for

pruning. This parameter specifies the upper bound of neighbors in the candidate set so that,

45

ax

ye e

g

Page 58: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

when the manageable size of neighbors is reached, the pruning process will be terminated,

and the number of neighbors in the candidate set is considered small enough to be scanned

to search for the k-nearest neighbors.

The pruning technique consists of two major steps. First, it obtains the count of

neighbors in the pre-image of the total variation contour (candidate set) relative to a

particular dimension. The rationale is to maintain the neighbors that are predominant (close

to the unclassified object) in most dimensions so that, when the Euclidian distance is

measured (step 5 of the classifying phase), they are the true closest neighbors. The process

of obtaining the count starts from the first dimension. The dimensional projection contour

around the unclassified object, a, is formed, and the contour mask is created. Again, the

mask is simply a bit pattern containing 1s and 0s, where bit 1 indicates that the object

belongs to the candidate set when projected on that dimension while 0 indicates otherwise.

The contour mask is then AND-ed with the mask of the pre-image of the total variation

contour, and the total number of 1s is counted. Note that no neighbors are pruned at this

point; only the count of 1s is obtained. The process continues for all dimensions, and at the

end of the process, the counts are sorted in descending order.

The second step of the pruning is to intersect each dimensional projection contour

with the candidate set. The intersection starts based on the order of the count. The

dimension with the highest count is intersected first, followed by the dimension with the

second most count, and so forth. In each intersection, the number of neighbors in the

candidate set is updated. From the implementation perspective, this intersection is simply a

logical AND operation between the mask of the total variation contour and the mask of the

dimensional projection contour. The second step of the pruning technique continues until a

manageable size of neighbors is reached or all dimensional projection contours have been

46

Page 59: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

intersected with the candidate set. Figure 15 summarizes the pseudo code of the pruning

algorithm, and Figure 16 illustrates the intersection between the pre-image of the total

variation contour and the dimensional projection contours.

ALGORITHM: pruning()INPUT: Pb-1, ..., P1, P0, PC, a, e, MSOUTPUT: PC - the mask of pruned candidate set//a is the unclassified object//MS is the manageable size//PC is the mask of the candidate set//PX is the mask of the dimensional projectioni = 0while(i < TOTAL_DIMENSION) doPX = ContourMask(ai - e, ai + e)tc = COUNT(PC & PX)if(tc != 0)

countArray.add(tc, PX)endifi = i + 1

endwhilesort.countArray(descending on the count)i = 0while(i < LENGTH(countArray)) doPX = countArray.second() //get dim proj masktc = COUNT(PC & PX)if(tc != 0) PC = PC & PX

if(tc < MS)break

endifendifi = i + 1

endwhile

RETURN PC

Figure 15. Pruning algorithm.

47Projection on y- dimension

g(c)

e-contour a

g(b)

b c

x

y

g

Projection on x-dimension

Page 60: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

Figure 16. Pruning the neighbor set using dimensional projections.

5.4. Illustrative Examples of the Pruning Technique

The following example will illustrate how the neighbors in the candidate set are

pruned. Assume that there are two classes P and Q, each of which contains 15 and 10

points, respectively. The distribution of the points is depicted in Figure 17. The unclassified

object is denoted by a character a, and the two vectors b and c along vector (a - ) that will

form the total variation contour are also shown in the figure.

Assume that the points in the pre-image of the total variation contour (candidate set)

are {p2, p4, p6, p7, p8, p9, p11, p14, q2, q4, q7, q9}. In the first step of the pruning, the count of

neighbors in the total variation contour relative to each dimension is obtained. In this

example, the count of neighbors in the candidate set when projected on X dimension is 5,

i.e. {p2, p6, p7, p8, p14}, and the count of neighbors in the candidate set when projected on Y

dimension is 4, i.e. {p8, p14, q4, q9}. In the second step of the pruning, the candidate set is

intersected with the dimensional projection contour of X dimension because the number of

neighbors on this dimension is predominant in the candidates set. The pruned neighbor set

will be as follows:

48

Page 61: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

Candidate set = {p2, p4, p6, p7, p8, p9, p11, p14, q2, q4, q7, q9} {p2, p6, p7, p8, p14}

= {p2, p6, p7, p8, p14}

Figure 17. Pruning example 1.

If in this example the manageable size of neighbors was set to 5, the pruning will

terminate because the manageable size of neighbors is reached, and the final pruned

neighbor set will be {p2, p6, p7, p8, p14}. However, if the manageable size of neighbors was

set to 3, the pruning continues. The candidate set is further intersected with the dimensional

projection contour of Y dimension. If this is the case, the pruned neighbor set is as follows:

Candidate set = {p2, p6, p7, p8, p14} {p8, p14, q4, q9}

= {p8, p14}

p15p14

p9

p8

p7p6

p3

p4p2p1

p5

p10

p11p12

p13

q10

q9

q8

q7q6

q5q4

q3

q2

q1

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

a

cb

X(ax+

e)

X(ax-e)

Y(ay+e)

Y(ay-e)

Pre-image of the total variation contour

49

Page 62: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

Let us consider another example as illustrated in Figure 18. In this example, the

unclassified object, a, is located close to the points in class Q. If the same e-contour is used,

then the initial points in the candidate set will be the same as in the previous example.

Figure 18. Pruning example 2.

In the first step of the pruning, the count of neighbors in the total variation contour

relative to each dimension is obtained. In this example, the count of neighbors in the

candidate set when projected on X dimension is 4, i.e., {q2, q4, q7, q9}, and the count of

neighbors in the candidate set when projected on Y dimension is 5, i.e., {q2, q4, q7, q9, p14}.

p13

p12p11

p10

p5

p1 p2p4

p3

p6

p7

p8

p9

p14

p15

q1

q2

q3

q4

q5

q6q7

q8

q9

q10

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

a

c

b

Pre-image of the total variation contour

Y(ay+e)

Y(ay - e)

X(ax-

e)

X(ax+

e)

50

Page 63: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

Since dimension Y has predominant neighbors in the candidate set, in the second

step of the pruning, the neighbors in this dimension are intersected with the neighbors in

the candidate set. After intersection, the pruned neighbor set is as follows:

Candidate set = {p2, p4, p6, p7, p8, p9, p11, p14, q2, q4, q7, q9} {p14, q2, q4, q7, q9}

= {p14, q2, q4, q7, q9}

Again, if the manageable size of neighbors was set to 5, the pruning will terminate

because the manageable size of neighbors is reached, and the final pruned neighbor set is

{p14, q2, q4, q7, q9}. However, if the manageable size of neighbors was set to 3, the pruning

continues. The candidate set is further intersected with the dimensional projection contour

of X dimension. The final pruned neighbor set is as follows:

Candidate set = {p14, q2, q4, q7, q9} {q2, q4, q7, q9} = {q2, q4, q7, q9}

In this case, although the manageable size of neighbors is not reached, the pruning will also

terminate because all dimensions have been projected.

5.5. Weighting Function

In nearest neighbor classification algorithms, the neighbors that are closer to the

unclassified object should vote more than the far neighbors. Each neighbor should cast vote

with a certain weight depending on the distance of the neighbor to the new sample.

Different weighting functions have been introduced in the literature. According to

Atkenson [35], the requirements on a weighting function is that the maximum value of the

weighting function should be at zero distance and the weight should decrease gradually as

the distance increases. Figures 19-21 illustrate some of the weighting functions adapted

from [35].

51

Page 64: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

0 0.5 1 1.5 2 2.5 3 3.5 40.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Weighting function w = 1/(1+d)

Distance (d)

Wei

ght

Figure 19. Weighting function .

0 0.5 1 1.5 2 2.5 3 3.5 4-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1Weighting function w = 1-d

Distance (d)

Wei

ght

Figure 20. Weighting function .

52

Page 65: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

0 0.5 1 1.5 2 2.5 3 3.5 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Weighting function w = exp(-d*d)

Distance (d)

Wei

ght

Figure 21. Weighting function .

In the proposed algorithm, we use as the weighting function to influence

the vote of the k-nearest neighbors. The weighting function is a Gaussian function that

gives a nice drop off based on the distance of the neighbors to the unclassified object. The

closer the neighbors the higher is the weight and vise versa.

5.6. Performance Analysis

We report the performance analysis in this section. The analysis was performed on

an Intel Pentium 4 CPU 2.6 GHz machine with 3.8GB RAM, running Red Hat Linux

version 2.4.20-8smp. We compared the proposed algorithm with the classical KNN

algorithm, the P-tree based KNN (P-KNN) using HOBBIT similarity, and the P-KNN

using EIN-ring neighborhood. The KNN algorithm was locally implemented, and it

sequentially scans the training space to find the k-nearest neighbors. All algorithms were

53

Page 66: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

implemented in the C++ programming language. For the proposed algorithm, P-Tree API

was also incorporated in the implementation.

In this performance analysis, two main aspects were analyzed: 1) The running time

(scalability) of the algorithms, and 2) the classification accuracy. While some algorithms

sacrifice the accuracy for speed, or vise versa, in this performance evaluation, we will also

demonstrate that the proposed algorithm is not only fast and scalable, but also has good

classification accuracy. The accuracy is measured using F score, a common score used to

measure the classification accuracy [1]. The F score is defined as follows:

where P is the precision and R is the recall. The precision measures the ratio of correct

assignment of a class and the total number of objects assigned to that class, whereas recall

measures the ratio of correct assignment of a class and the actual number of objects in that

class. The F score further takes the ratio of these two measurements and has a score in the

range of 0 to 1. The higher the score, the better the classification accuracy is.

5.6.1. Datasets

We evaluated the algorithms using several datasets. Some of the datasets were taken

from the Repository of Machine Learning Databases at the University of California, Irvine

(UCI) [36]. The datasets in this repository are regarded as the benchmark datasets to

evaluate machine learning and data-mining algorithms. We also incorporated other real-life

54

Page 67: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

datasets, such as Remote Sense Imaginary (RSI), OPTICS, and Iris datasets. The

description of each dataset is as follows:

RSI dataset: This dataset is a set of aerial photographs from the Best Management

Plot (BMP) of Oakes Irrigation Test Area (OITA) near Oakes, North Dakota, taken in

1998. The images contain three bands: red, green, and blue. Each band has values in the

range of 0 and 255 which, in binary, can be represented using 8 bits. The corresponding

synchronized data for soil moisture, soil nitrate, and crop yield were also used, and the crop

yield was selected as the class attribute. Combining all the bands and synchronized data, a

dataset with 6 dimensions (5 feature attributes and 1 class attribute) was obtained.

To simulate different classes, the crop yield was divided into four different

categories: low yield having intensity between 0 and 63, medium low yield having intensity

between 64 and 127, medium high yield having intensity between 128 and 191, and high

yield having intensity between 192 and 255. Three synthetic datasets were generated to

study the scalability and running time of the proposed algorithm. The cardinality of these

datasets varies from 32 to 96 million.

KDDCUP 1999 dataset: This dataset is the network intrusion dataset used in

KDDCUP 1999 [37]. The dataset contains more than 4.8 million samples from the TCP

dump. Each sample identifies a type of network intrusion. We selected six types of

intrusion, Normal, IP Sweep, Neptune, Port Sweep, Satan, and Smurf, each of which

contains at least 10,000 samples. The distribution of data in each class is tabulated in Table

7. A total of 32 numerical attributes were found after discarding the categorical attributes.

We selected randomly 120 samples, 20 samples for each class, for the testing sets.

Wisconsin Diagnostic Breast Cancer (WDBC) [36]: This dataset contains 569

diagnosed breast cancer patients with 30 real-valued features. The dataset was donated by

55

Page 68: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

Nick Street in November 1995. The task is to predict two types of diagnoses as either

Benign (B) or Malignant (M). The distribution of data is 357 Benign and 212 Malignant.

Table 7. Class distribution of KDDCUP dataset.

Class Number of Objects

Normal 972,780

IP Sweep 12,481

Neptune 1,072,017

Port Sweep 10,413

Satan 15,892

Smurf 2,807,886

OPTICS dataset [38]: OPTICS dataset was originally used for clustering problems.

It has eight different clusters, and two of them are embedded clusters. The dataset contains

8,000 points in 2-dimensional space. We carefully added a class label to each data point

based on the original clusters and labeled as CL-1, CL-2, CL-3, CL-4, CL-5, CL-6, CL-7,

and CL-8. We randomly selected 80 points, 10 points for each class, for the testing sets.

Iris dataset [39]: The Iris plants dataset was created by R.A. Fisher. The dataset is

very popular in the machine learning community. The task is to classify Iris plants into one

of three Iris plants varieties: Iris setosa, Iris versicolor, and Iris virginica. The dataset

contains 150 instances (50 instances in each class) and represents in a 4-dimensional space

(sepal length, sepal width, petal length, and petal width). Iris setosa is linearly separable to

the other classes. We randomly selected 30 samples for the testing sets.

We normalized all datasets to prevent attributes with initially large ranges from

outweighing attributes with smaller ranges [40]. According to Hans [1], data normalization

is often used for methods involving distance measurements. Normalization scales the

56

Page 69: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

values in the attribute to a small range and makes each attribute have equal emphasis and

same range. In this work, we used min-max normalization technique to normalize attribute

values, which after normalization the values will be in the range of 0 and 1. The min-max

normalization is defined as follows:

5.6.2. Parameterization

The proposed algorithm requires three parameters. We will discuss each parameter

in this section.

1. Epsilon (e) is a positive value (e > 0) that specifies the width of the total variation and

dimensional projection contours. The epsilon should be specified accordingly because

if it is too big, then the number of neighbors included in the total variation contour can

be very large. A large epsilon also means a wide dimensional projection contour. A

very wide contour can cause the objects that are far away from the unclassified object

being included in the contour. On the other hand, if the epsilon is specified too small,

the number of neighbors in the total variation contour can be very few. Thus, to specify

the epsilon, it is suggested that during learning phase, the epsilon is observed a couple

times until the best classification result is achieved.

2. The manageable size of neighbors (MS) is the parameter used for pruning. This

parameter is one of the termination conditions of the pruning step. Note that the

pruning terminates when the manageable size of neighbors is reached or all dimensions

have been examined. It is suggested that the value for this parameter is not too large

because k-nearest neighbors are searched from the pruned neighbor set by scanning

them one by one. A value in the range of 200 - 2000 perhaps can be used for this

57

Page 70: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

parameter since scanning that many neighbors using the-state-of-the-art PC available

today can be very fast.

3. Number of nearest neighbor (k) specifies the number of nearest neighbors that cast a

vote to determine the class label of the unclassified object.

5.6.3. Classification Accuracy Comparison

We examined the classification accuracy using KDDCUP 1999, Wisconsin

Diagnostics Breast Cancer (WDBC), OPTICS, and Iris datasets. We used 5-folded cross-

validation evaluation model for all datasets. We randomly divided the datasets into disjoint

training and testing subsets, 5 different times. The algorithms were tested using each

disjoint subset, and the performance results were averaged over all evaluations. We

compared SMART-TV algorithm with PKNN using HOBBIT, P-KNN using Equal Interval

Neighborhood (EIN-ring), and KNN with linear search.

Tables 8-10 summarize the classification accuracy on the KDDCUP dataset for

k = 3, k = 5, and k = 7. We discovered that all algorithms produced good classification

accuracy. In terms of speed, SMART-TV is faster than the other algorithms. SMART-TV

takes about 9.93 seconds on an average to classify, while P-KNN with HOBBIT and KNN

take approximately 30.79 and 271.50 seconds, respectively. We used e = 0.005 and

MS = 1000 for SMART-TV. It is important to note that P-KNN with EIN-ring was not able

to run successfully on this dataset due to the memory allocation error.

Tables 11-13 show the classification accuracy on the WDBC dataset for k = 3,

k = 5, and k = 7. The WDBC dataset has 30 real-valued features. The task is to predict the

diagnoses of the breath cancer. The results show that SMART-TV, P-KNN using EIN-ring,

58

Page 71: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

and KNN perform better than P-KNN using HOBBIT. We used e = 0.2 and MS = 50 for

SMART-TV.

Table 8. Classification accuracy on the KDDCUP dataset for k = 3.

ClassClassification Accuracy

SMART-TV P-KNN HOBBIT KNN

Normal 0.83 0.91 0.87

IP Sweep 0.95 0.97 1.00

Neptune 1.00 0.97 0.98

Port Sweep 0.97 0.89 0.95

Satan 0.83 0.82 0.80

Smurf 0.97 0.91 1.00

Table 9. Classification accuracy on the KDDCUP dataset for k = 5.

ClassClassification Accuracy

SMART-TV P-KNN HOBBIT KNN

Normal 0.82 0.91 0.87

IP Sweep 0.95 0.98 1.00

Neptune 1.00 0.98 0.97

Port Sweep 0.97 0.89 0.95

Satan 0.80 0.82 0.80

Smurf 0.97 0.91 1.00

Table 10. Classification accuracy on the KDDCUP dataset for k = 7.

59

Page 72: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

ClassClassification Accuracy

SMART-TV P-KNN HOBBIT KNN

Normal 0.82 0.91 0.87

IP Sweep 0.95 0.97 1.00

Neptune 1.00 0.97 0.98

Port Sweep 0.97 0.89 0.97

Satan 0.80 0.82 0.82

Smurf 0.97 0.91 1.00

Table 11. Classification accuracy on the WDBC dataset for k = 3.

ClassAccuracy

SMART-TV P-KNN HOBBIT

P-KNN EIN-Ring KNN

Benign 0.96 0.70 0.96 0.98

Malignant 0.96 0.23 0.96 0.98

Table 12. Classification accuracy on the WDBC dataset for k = 5.

ClassAccuracy

SMART-TV P-KNN HOBBIT

P-KNN EIN-Ring KNN

Benign 0.95 0.68 0.96 0.97

Malignant 0.95 0.10 0.96 0.97

Table 13. Classification accuracy on the WDBC dataset for k = 7.

ClassAccuracy

SMART-TV P-KNN HOBBIT

P-KNN EIN-Ring KNN

Benign 0.95 0.68 0.98 0.97

Malignant 0.95 0.10 0.98 0.97

60

Page 73: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

Tables 14-16 show the classification accuracy on the OPTICS dataset for k = 3,

k = 5, and k = 7. We used e = 0.1 and MS = 200 for SMART-TV. The result shows that

P-KNN with HOBBIT is slightly more accurate than the other algorithms on classes CL-5

and CL-6. These two classes are the embedded classes, i.e., the class located inside another

class, but with different density. Some of the testing instances from these two classes were

classified incorrectly by the algorithms. However, in general, all algorithms performed

equally well.

Tables 17-19 summarize the classification accuracy on Iris dataset. Iris setosa is the

only class in the dataset that is linearly separable to the other classes. SMART-TV, P-KNN

using EIN-ring, and KNN classified this class accurately without any error. When P-KNN

uses HOBBIT as the similarity metric, it missed some of the classes. For SMART-TV, we

used e = 0.2 and MS = 20.

Table 14. Classification accuracy comparison on the OPTICS dataset for k = 3.

ClassClassification Accuracy

SMART-TV with P-tree

P-KNN HOBBIT

P-KNN EIN-Ring KNN

CL-1 1.00 1.00 1.00 1.00

CL-2 1.00 1.00 1.00 1.00

CL-3 1.00 1.00 1.00 1.00

CL-4 1.00 1.00 1.00 1.00

CL-5 0.94 0.96 0.93 0.94

CL-6 0.94 0.96 0.93 0.94

CL-7 1.00 1.00 1.00 1.00

CL-8 1.00 1.00 1.00 1.00

Table 15. Classification accuracy comparison on the OPTICS dataset for k = 5.

61

Page 74: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

ClassClassification Accuracy

SMART-TV with P-tree

P-KNN HOBBIT

P-KNN EIN-Ring KNN

CL-1 1.00 1.00 1.00 1.00

CL-2 1.00 1.00 1.00 1.00

CL-3 1.00 1.00 1.00 1.00

CL-4 1.00 1.00 1.00 1.00

CL-5 0.95 0.96 0.94 0.95

CL-6 0.95 0.96 0.94 0.95

CL-7 1.00 1.00 1.00 1.00

CL-8 1.00 1.00 1.00 1.00

Table 16. Classification accuracy comparison on the OPTICS dataset for k = 7.

ClassClassification Accuracy

SMART-TV with P-tree

P-KNN HOBBIT

P-KNN EIN-Ring KNN

CL-1 1.00 1.00 1.00 1.00

CL-2 1.00 1.00 1.00 1.00

CL-3 1.00 1.00 1.00 1.00

CL-4 1.00 1.00 1.00 1.00

CL-5 0.95 0.95 0.94 0.95

CL-6 0.95 0.95 0.94 0.95

CL-7 1.00 1.00 1.00 1.00

CL-8 1.00 1.00 1.00 1.00

Table 17. Classification accuracy on the Iris dataset for k = 3.

62

Page 75: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

ClassClassification Accuracy

SMART-TV P-KNN HOBBIT

P-KNN EIN-Ring KNN

Iris setosa 1.00 0.93 1.00 1.00

Iris versicolor 0.93 0.88 0.96 0.93

Iris virginica 0.93 0.91 0.96 0.93

Table 18. Classification accuracy on the Iris dataset for k = 5.

ClassClassification Accuracy

SMART-TV P-KNN HOBBIT

P-KNN EIN-Ring KNN

Iris setosa 1.00 0.90 1.00 1.00

Iris versicolor 0.95 0.84 0.94 0.94

Iris virginica 0.95 0.92 0.94 0.94

Table 19. Classification accuracy on the Iris dataset for k = 7.

ClassClassification Accuracy

SMART-TV P-KNN HOBBIT

P-KNN EIN-Ring KNN

Iris setosa 1.00 0.84 1.00 1.00

Iris versicolor 0.96 0.71 0.94 0.94

Iris virginica 0.96 0.92 0.94 0.94

We summarize the average classification accuracy of all datasets in Tables 20-22.

From the tables, it can be seen that the classification accuracy of SMART-TV is very

comparable to that of the KNN algorithm for all datasets. The performance of P-KNN

using HOBBIT degrades significantly when classifying the dataset with high dimensions as

shown in the WDBC dataset.

Table 20. Average classification accuracy for k = 3.

63

Page 76: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

DatasetClassification Accuracy

SMART-TV P-KNN HOBBIT

P-KNN EIN KNN

KDDCUP 0.93 0.91 - 0.93

OPTICS 0.99 0.99 0.99 0.99

IRIS 0.95 0.91 0.97 0.95

WDBC 0.96 0.47 0.96 0.98

Table 21. Average classification accuracy for k = 5.

DatasetClassification Accuracy

SMART-TV P-KNN HOBBIT

P-KNN EIN KNN

KDDCUP 0.92 0.92 - 0.93

OPTICS 0.99 0.99 0.99 0.99

IRIS 0.97 0.89 0.96 0.96

WDBC 0.95 0.39 0.96 0.97

Table 22. Average classification accuracy for k = 7.

DatasetClassification Accuracy

SMART-TV P-KNN HOBBIT

P-KNN EIN KNN

KDDCUP 0.92 0.91 - 0.94

OPTICS 0.99 0.99 0.99 0.99

IRIS 0.97 0.82 0.96 0.96

WDBC 0.95 0.39 0.98 0.97

5.6.4. Classification Time Comparison

We compared the performance in terms of speed using RSI datasets. As mentioned

previously, the cardinality of these datasets varies from 32 to 96 million. In this

64

Page 77: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

experiment, we compared the run time of SMART-TV against SMART-TV with scan, P-

KNN with HOBBIT metric, and a brute-force KNN that sequentially searches for the k-

nearest neighbors from the entire training set. Note that SMART-TV with scan is a slightly

different version of SMART-TV. SMART-TV with scan does not have derived P-trees of

the total variation values of the training objects and determines the superset of neighbors in

the total variation contour by scanning the values of the training objects one by one. While

scanning, the relative indexes of the objects within the contour are stored in an array. We

applied the same pruning technique for SMART-TV with scan because after the scan is

completed, a candidate mask can be created using the current version of P-Tree API by

passing the array that holds the indexes of objects within the total variation contour as the

parameter. However, as we will see in the experimental result, even though the scan is

conducted on a single dimension, i.e. the total variation values of the training objects,

SMART-TV with scan takes more time when compared to the SMART-TV with P-tree that

uses P-tree range query algorithm to determine the objects within the contour.

Figure 22 shows the run time and scalability comparison of the algorithms. The

complete run time of each algorithm is summarized in Table 23. We learned that SMART-

TV with P-tree and P-KNN are very comparable in terms of speed. Both algorithms are

faster than the other algorithms. For example, P-KNN takes 12.37 seconds on an average to

classify using a dataset of size 96 million, while SMART-TV with P-tree takes 17.42

seconds. For the same dataset, SMART-TV with scan takes about 106.76 seconds to

classify and KNN takes 891.58 seconds. KNN with sequential scan is almost three orders

of magnitude slower than SMART-TV with P-tree and P-KNN algorithms. In addition,

KNN also requires more time and more resources to upload the datasets into memory.

Since the datasets in the form of horizontal structure could not be loaded entirely into

65

Page 78: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

memory at once, for KNN, the datasets were loaded partially. From this observation, it is

clear that when the cardinality increases, the classification time of KNN increases linearly.

Conversely, the algorithms that employ P-tree vertical data structure are much faster.

SMART-TV with scan is faster than KNN because it only scans a single dimension and

compares the functional values of the unclassified object with the functional values of the

training objects. Besides, the pruning technique is also incorporated in the algorithm. On

the other hand, KNN has to scan and compute the Euclidian distance at the same time.

However, SMART-TV with scan is slower than P-KNN and SMART-TV with P-tree.

0 32000000 64000000 960000000

100

200

300

400

500

600

700

800

900

Dataset Cardinality

Cla

ssifi

catio

n Ti

me

(Sec

onds

/Sam

ple)

SMART-TV w/ P-TreeSMART-TV w/ ScanP-KNN with HOBitKNN

Figure 22. Run time and scalability comparison on the RSI dataset.

Table 23. Run time and scalability comparison on the RSI dataset.

DatasetTime in Seconds

SMART-TV SMART-TV P-KNN KNN

66

Page 79: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

with P-tree with Scan HOBBIT Classifying Loading

32M 5.77 109.31 4.49 296.90 553

64M 11.66 218.11 8.28 593.71 1024

96M 17.42 324.03 12.37 891.58 1536

k = 5, e = 0.5, and MS = 1000.

Table 24 shows the preprocessing time taken by SMART-TV. We can see from the

table that most of the time was consumed to compute the functional values g(a) of the

training objects. However, this is a one time process that on an average only takes

approximately 0.000098 seconds. We believe that this preprocessing time can be amortized

when the number of unclassified objects being classified is very large.

Table 24. Preprocessing time of SMART-TV algorithm on the RSI dataset.

Dataset(Time in Seconds)

Computing Vector Mean

Computing Functional Values

32M 1.42 3,128.75

64M 2.75 6,257.49

96M 3.81 9,386.24

We summarize the classification time of each algorithm Table 25 when classifying

the KDDCUP, OPTICS, IRIS, and WDBC datasets. For large training set like KDDCUP

dataset, KNN is more than 2 orders of magnitude slower than the vertical nearest neighbor

algorithms (Figure 23). SMART-TV and P-KNN with HOBBIT are fast and very

comparable. P-KNN using EIN-ring takes more time to classify because it has to build the

ring around the unclassified which requires many logical AND operations. Moreover, for

low dimension datasets such as RSI datasets, P-KNN using HOBBIT is faster than

SMART-TV algorithm.

67

Page 80: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

Table 25. Average classification time.

Dataset

Classification Time (Seconds)

SMART-TV with P-tree

SMART-TV with scan

P-KNN HOBBIT

P-KNN EIN KNN

KDDCUP 9.93 15.02 30.79 - 271.5

OPTICS 0.022 0.027 0.002 0.480 0.061

IRIS 0.003 0.004 0.007 2.540 0.002

WDBC 0.024 0.026 3.570 61.490 0.030

0

50

100

150

200

250

300

KDDCUP Dataset

Clas

sifi

cati

on T

ime

(Sec

onds

/Sam

ple)

Smart-TV with P-tree

Smart-TV with Scan

P-KNN HOBBIT

KNN

Figure 23. Classification time on the KDDCUP dataset.

5.7. Conclusion

In this chapter, we have proposed a new nearest neighbor based classification

algorithm that efficiently finds the candidates of neighbors by creating a total variation

contour around the unclassified object. The objects within the contour are considered as the

superset of nearest neighbors (candidate set) and further pruned before the k-nearest

68

Page 81: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

neighbors are searched from the set. After pruning, the k-nearest neighbors vote to

determine the class label of the unclassified object. We conclude from this work that a

scalable and high accuracy of nearest neighbor classification algorithm has been

discovered.

We have also introduced a pruning technique that uses the dimensional projections.

We believe that this novel pruning technique can only be incorporated efficiently when the

training set is represented in P-tree vertical data structure. When the training set is not

represented in P-tree vertical structure, then one must scan each dimension over and over to

count the number of neighbors that are in the candidate set from each dimension. In large

datasets, such approach is impractical and inefficient to perform.

One observed limitation of the proposed pruning technique is that in high

dimensional datasets, the number of dimensions that needs to be examined will be great. In

such a case, the pruning will take more time.

We have conducted performance evaluation in terms of speed, scalability, and

classification accuracy. We found that the proposed algorithm is fast and scalable to very

large datasets. We conclude that in terms of speed and scalability, the proposed algorithm

is comparable to the other vertical nearest neighbor algorithms. In terms of classification

accuracy, the proposed algorithm is very comparable to that of the classical KNN classifier.

69

Page 82: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

CHAPTER 6. THE APPLICATION OF THE PROPOSED

ALGORITHM IN IMAGE CLASSIFATION3

6.1. Introduction

The recent emergence of digital images makes the organization of images into

semantic categories for effective browsing and retrieval an interesting and challenging

problem. In small image repositories, manual image categorization to some extent can be

used to label images. But, in large image repositories, such approach becomes impractical.

Different classification algorithms have been proposed in the literature to categorize

digital images, such as Bayesian classifier [41], Support Vector Machine [42], Neural

Network [34], and k-nearest neighbor classifiers (KNN) [43]. KNN classifiers are

commonly used for classification due to its simplicity and good performance. One potential

drawback of KNN classifiers is that finding the k-nearest neighbors can be time-consuming

when image repositories are very large i.e. contains million of images.

Various techniques have been proposed to accelerate the k-nearest search in large

image repositories, including the use of indexes and tree structures such as k-d trees. The

algorithm proposed in [12] improves the k-nearest neighbor search for image retrieval. The

algorithm decomposes each training dimension and maintains it in a separate table. The

first several dimensions are scanned horizontally to find the partial distance between the

query image and the data in the repository. Good acceleration time was reported when the

algorithm was used to search for the nearest neighbors. However, the time for scanning the

dimensions will be significant when the database is very large.

3 This chapter is a modified version of a published paper which appeared in the Proceedings of the 1st IEEE International Workshop on Multimedia Databases and Data Management 2006 (IEEE MDDM 06) , Atlanta, Georgia, USA, April 8, 2006, with a slightly different title.

70

Page 83: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

Another line of research is to reduce image feature dimensions to alleviate intensive

distance computation involved. A technique such as Principal Component Analysis (PCA)

is commonly used. Multi-resolution feature representation has also been investigated. As

opposed to dimensional reduction approach, in multi-resolution approach, the feature

vectors of images are represented in multiple resolutions. Similarity search starts at the

low-resolution level. If the distance is greater than the minimum distance bound at this

level, the candidates are removed without calculating the full-resolution distance. This

approach can reduce the computation complexity dramatically [44].

This chapter is intended to study the proposed algorithm, SMART-TV, when

applied to an image classification task. In image classification, the images are represented

in a high dimensional space, constructed from color distribution, image texture, image

structure (shape) or a combination [45]. In this work, we use color and texture features to

represent the images. The combination of these two features creates a 70-dimensional

feature vector for each image. P-trees are then generated from these feature vectors.

The empirical experimentations on Corel dataset [46] show that SMART-TV works

well in high dimensional dataset. In addition, the classification accuracy is also high and

very comparable to that of the KNN classifier.

6.2. Image Preprocessing

Color and texture features are explored in this work. These features are extracted

from the original pixel representation of the images and used as image representatives. We

used MATLAB software to extract both color and texture features.

For the color feature, we created a 54-dimension global color histogram (6x3x3) in

HSV color space such that the Hue component (a.k.a. the gradation of the color) is

71

Page 84: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

partitioned into 6 bins; the Saturation (grayness) and Value (brightness) components are

divided into 3 bins each. The combination of each bin produces a 54-dimensional global

color histogram. No universal ratio has been defined for this partitioning parameter. The

main consideration is always on the computational cost and performance tradeoff. The

HSV color model is preferable over an alternative model, such as RGB, because it is a

uniform color model with color values in the range of 0 to 1 (normalized).

We extracted the texture feature of the images using Gabor filters. The MATLAB

codes for Gabor filtering and texture feature extraction were downloaded from

http://vision.ece.ucsb.edu/texture/software/ [47]. Two parameters are needed for generating

the filters. The first one is the scale, and the second one is the orientation. The scale catches

the scale of the texture while orientation catches the orientation, or direction, of the texture

in the image. We used 2 scales and 4 orientations (default values) given in the codes. The

combination of 2 scales and 4 orientations produced 8 filters. We filtered each image using

those 8 filters, and took the mean and standard deviation of the pixels of the filtered image.

Because we have 8 filters, and for each filter, the mean and standard deviation of the

filtered image are computed, a total of 16 texture features were extracted from each image.

We adopted this feature-extraction approach entirely from [47].

Finally, the combination of the 54-dimension global color histogram and 16 texture

features produced 70-dimensional feature vectors for each image. We normalized these

features using Gaussian normalization so that each dimension has a range between 0 and 1

to put equal emphasis on each feature [45]. Then, P-tree vertical data representations are

generated from these normalized features.

72

Page 85: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

The same preprocessing discussed in Section 5.3.1 is applied for the training image

set. In this work, the function is used again. A

constant 1 is added to avoid a singularity of the logarithmic transformation when .

First, the vector mean of the training set is determined. Then, the functional values

of each training image are computed, and derived P-trees of these functional values are

created. The same classifying algorithm defined in Section 5.3.2 is used for classification.

6.3. Experimental Results

In order to demonstrate the efficiency and effectiveness of the proposed algorithm

in image classification, we conducted several experiments. The experiments were

performed on an Intel Pentium 4 CPU 2.6 GHz machine with 3.8GB RAM running Red

Hat Linux version 2.4.20-8smp. We had exclusive access to this machine, so the issue that

there might be other users inconsistently slowing the computer down is irrelevant.

We compared the proposed algorithm with the same KNN classifier used in the

previous chapter. We used general-purpose Corel images for the performance evaluations.

(http://wang.ist.psu.edu/docs/related/) [46]. The dataset has 10 categories each of which

contains 100 images. Figure 24 shows some of the images in each class.

The classification accuracy comparison was performed on this dataset directly.

However, for scalability and run time comparison, we cannot use the dataset directly

because the number of images is very small. Thus, we randomly generated several larger

datasets from the original Corel dataset. The size of these synthetic datasets varies from

100,000 to 500,000 images.

73

Page 86: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

Africa People and Village Elephant

Beach Flower

Building Horse

Bus Mountain & Glacier

Dinosaur Food

Figure 24. The classes of the images.

6.3.1. An Example on Corel Dataset

In Figure 25, an example on how SMART-TV algorithm classifies a new image on

Corel dataset is visualized. In this example, we used k = 5, e = 0.023, and MS = 10. The

number of images selected as the candidates of neighbors is 64. Since the manageable size

of neighbors is specified to 10 and the number of neighbors in the candidate set is greater

than MS, the candidate set is pruned. Only 9 images remain in the candidate set after

pruning.

74

Page 87: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

7.3148 7.2675 7.2873 7.3035 7.2958 7.2563 7.3148

7.2737 7.2899 7.3010 7.2700 7.3246 7.3190 7.2899

7.3062 7.2739 7.3171 7.2664 7.2670 7.2852 7.3046

7.2526 7.3046 7.2668 7.2808 7.2832 7.2690 7.2668 1.1056

7.2778 7.3112 7.2684 7.3002 7.3132 7.2844 7.2832 1.0694

7.3020 7.2637 7.3033 7.2544 7.3071 7.2718 7.2690 0.6623

7.3115 7.3052 7.2898 7.3244 7.2813 7.2959 7.2778 0.8973

7.2607 7.2973 7.3221 7.2969 7.2724 7.3214 7.3112 0.4479

7.2736 7.2719 7.2966 7.2628 7.3135 7.3060 7.2598

7.2735 7.3128 7.2598 7.3185 7.2565 7.2548

7.2641 7.2861 7.3223 7.2836

Figure 25. Example using Corel dataset with pruning.

Let a be the new image, b and c be the two vectors located in the lower and upper

side of a. The algorithm computes the functional values of b and c. For this discussion, let

us assume that the functional value of a is also computed, i.e., 7.2890, and the functional

75

New image

Pruned Set

K-nearest neighbors

Images in the candidate set

The category of the new image is:

Dinosaur

Vote histogram

Dinosaurs

Page 88: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

values of b and c are 7.2507 and 7.3267, respectively. The functional value of a should be

between the functional values of b and c such that g(b) g(a) g(c). In this example, we

can see that the condition is true. Moreover, the images that are selected as the candidates

should have functional values in the range of [7.2507, 7.3268]. For clarity, we write the

functional values of each image in the candidate and pruned sets under the images. For

each image in the k-nearest neighbors set, we write the pair wise Euclidian distance of the

image and the new image.

After the interval is found, the algorithm determines the candidates of neighbors

using P-tree range query algorithm, summarized in Figure 12. After pruning, since k = 5,

the five nearest neighbors are searched from the pruned set. These k-nearest neighbors are

shown in the last column of the figure. Each of the neighbors is given a weighted vote

based on the Gaussian weighting function so that the closest neighbor will have the highest

weight and the weight decreases gradually as the distance increases.

6.3.2. Classification Accuracy

The classification accuracy is measured using F-score as discussed previously in the

Chapter 5. We used a variant of 5-folded cross-validation to test the accuracy. We

randomly produced five different disjoint training and testing sets. For each testing set, 50

images were randomly selected so that each class contains 5 images. The remainder 950

images were used as the training set. The accuracy results are averaged over all disjoint

subsets. Table 26 summarizes the classification accuracy using different k. We used e = 0.1

and MS = 300 for SMART-TV.

Table 26. Classification accuracy comparison using k = 3, k = 5, and k = 7.

Category SMART-TV with P-Tree KNN

76

Page 89: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

k = 3 k = 5 k = 7 k = 3 k = 5 k = 7African People and

Village 0.72 0.80 0.73 0.84 0.81 0.81

Beach 0.70 0.70 0.71 0.75 0.85 0.80

Building 0.61 0.62 0.60 0.55 0.65 0.69

Buses 0.83 0.85 0.83 0.85 0.90 0.88

Dinosaur 0.94 0.96 0.96 0.94 0.94 0.94

Elephant 0.56 0.70 0.70 0.60 0.62 0.66

Flower 0.81 0.80 0.80 0.90 0.94 0.96

Horse 0.91 0.90 0.88 0.96 0.96 0.96

Mountain and Glacier 0.65 0.63 0.70 0.68 0.73 0.66

Food 0.81 0.85 0.85 0.75 0.76 0.75

Average Accuracy 0.75 0.78 0.78 0.78 0.82 0.81

The results show that both KNN and SMART-TV perform equally well for most of

the classes. Only for class Building, Elephant, and Mountain and Glacier the accuracy is

under 70%. From these results, the same trend of accuracy can be seen clearly. When KNN

produces high accuracy for classes such as Dinosaur, Horse, and Buses, SMART-TV also

produces high accuracy. Thus, we conclude that the classification accuracy SMART-TV is

very comparable to KNN classifier, even in high dimensional dataset like the image

categorization problem.

6.3.3. Classification Time Comparison

The classification time comparison was performed using several synthetic datasets

with different cardinality. These datasets were randomly generated from the original Corel

images. The cardinality of these datasets is varied from 100,000 to 500,000. We evaluated

the classification time using k = 3, k = 5, and k = 7. However, since the same time trends

were found, we only show the time for classification k = 5. This similar time trend

77

Page 90: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

occurred because most of the time for nearest neighbor classifiers is consumed when

searching for the nearest neighbors. Hence, varying k a little bit does not make much

difference on the overall classification time.

Figure 26 shows that SMART-TV is fast in categorizing images when compared to

KNN. For the largest dataset containing 500,000 images and 70 feature attributes, SMART-

TV with P-trees takes approximately 1.813 seconds on an average to classify, while KNN

takes about 25.874 seconds. SMART-TV with scan is slower than the SMART-TV with

P-tree since it scans the functional values of the training images to find the candidates of

neighbors. SMART-TV with scan takes about 4.415 seconds on an average to classify.

From this observation, we are convinced that the use of P-trees and the range query

algorithm to find the candidates of neighbors can speed up the classification time

significantly.

We summarize the preprocessing time taken by SMART-TV algorithm on the

image sets containing 70 feature attributes in Table 27. Note that the preprocessing phase is

a one-time process. Vector mean can be computed very fast using P-tree vertical structure.

For the largest image set, the time to compute all functional values of the training images is

491.33 seconds or about 8.2 minutes. The actual time to compute the functional values of a

single image is approximately 0.00098 seconds.

78

Page 91: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

100,000 200,000 300,000 400,000 500,0000

5

10

15

20

25

30

Image Set Cardinality

Cla

ssifi

catio

n Ti

me

(Sec

onds

/Sam

ple)

SMART-TV w/ P-treeSMART-TV w/ ScanKNN

Figure 26. Classification time, k =5, e = 0.01, and MS=1000.

Table 27. Preprocessing time of SMART-TV algorithm on the Corel dataset.

Datasets(Total Images)

(Time in Seconds)Computing Vector

MeanComputing Functional

Values100,000 0.02 98.00

200,000 0.04 197.18

300,000 0.07 295.93

400,000 0.10 302.94

500,000 0.12 491.33

6.4. Conclusion

In this chapter, we have demonstrated through some experimentation that SMART-

TV is an efficient and effective nearest neighbor based classification algorithm that can be

used for image classification. While some algorithms suffer when classifying high

79

Page 92: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

dimensional datasets, in this work, we showed that SMART-TV performs well on high

dimensional datasets. In addition, the performance evaluation on general-purpose Corel

images shows that for large image repositories, SMART-TV has greater speed

improvement when compared to the naïve KNN classifier that uses brute force approach to

find the nearest neighbors. The classification accuracy of SMART-TV for image

categorization is also very comparable to that of KNN classifier.

Image retrieval is an interesting and challenging problem. One of the future

directions of this work is to apply the same approach used in SMART-TV algorithm for

image retrieval. The ability of SMART-TV algorithm to filter the candidates of neighbors

opens a window of opportunities to apply the same idea in image retrieval. We have two

reasons for this.

1. The images that are not really relevant to the query image can be eliminated efficiently

by forming the total variation contour.

2. The pruning technique then prunes the candidates so that the ranking algorithm will be

performed on a small number of images. For image classification problem, we have

demonstrated that our approach works well in finding the right candidates. Hence, we

believe that it will also work for image retrieval.

80

Page 93: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

CHAPTER 7. INTEGRATING THE PROPOSED METHOD

INTO DATAMIMETM

7.1. Introduction

DataMIMETM is a vertical data-mining prototype based on P-tree technology [48]. It

is a client-server based system that provides different data-mining functionalities, such as

association rule mining, classification, and clustering. The system can be accessed from the

following URL: http://midas.cs.ndsu.nodak.edu/~datasurg/datamime. The architecture of

DataMIMETM was designed efficiently to provide flexibility for new vertical data-mining

algorithms to be added. The data-mining applications, data capturing, and data integration

to P-tree vertical format are executed on the server-side while the interaction with users is

facilitated from the system’s graphical user interface.

On the server-side, new algorithms are added into the Data Mining Algorithm

(DMA) layer [49]. The Data Capture and Integration (DCI) layer is responsible for data

capturing and integration. On the client-side, DMA and DCI layers are also available. The

client-side DMA gathers the required information for mining, sends it to the server-side

DMA, and waits for a response. The client-side DCI collects datasets and metadata, and

sends them to the server-side DCI.

SMART-TV has been integrated into DataMIMETM as a proof-of-concept of a new

classification algorithm that uses P-tree technology. As a classification algorithm, SMART-

TV is grouped together with the currently developed classification algorithms, such as

P-KNN, PINE, P-Bayesian, and P-SVM. Similar to the other classification algorithm user

interfaces, the SMART-TV user interface allows users to specify the required parameters

and to predict a class label of a single sample or a bulk of unclassified samples.

81

Page 94: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

7.2. Server-Side Components

In order to integrate into DataMIMETM, all new classification algorithms must

implement the methods defined in the PredictionModel class [49]. SMART-TV

implemented this class as well. The PredictionModel is an interface containing five

important methods. The first method is the predict method. In this method, specific

implementation of how an algorithm predicts a class label for a given sample is written.

Different algorithms can have different implementations, but the parameters passed through

this method should be the same. The second method is vote_histogram. This method

collects all class label items and their corresponding vote values, and later used in the

client-side classification interface for vote histogram visualization. The setPTreeSet

method is the simplest method, which only sets a given P-tree set as the training set. The

last two methods, setClassLabel and getClassLabel, are the methods

responsible for specifying the class label attribute and returning the class label attribute

respectively.

The new module is integrated into the server-side through a

Predictor class. All the mining keys (parameters) required are defined

in the MiningKeys class and used in the Predictor class. As for the

SMART-TV module, the following mining keys were defined in the

MiningKey class:

static string EPS_VALUE;

static string MS_VALUE;

We did not define the K_VALUE mining key because this key has already

been defined previously by the other classification algorithms. Figure 27

82

Page 95: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

shows part of SMART-TV code segments in the Predictor class.

Figure 27. Code segments of SMART-TV in the Predictor class.

7.3. Client-Side Components

The communication between the server-side and client-side is

established by the passing the mining keys defined in the algorithm. The

mining keys specified in the server-side are also defined in the

MiningKeys class in the client-side, and they should appear with the

same name.

7.4. Graphical User Interface

We grouped the SMART-TV algorithm together with the other classification

algorithms currently present in the system. Figure 28 shows a snap shot of SMART-TV

83

Predictor::Predictor(Properties *pr, const DataPath& dp):pSet.load(dp.ptree_dir() + id);PTreeInfo pInfo = pSet.getPTreeInfo();class_index = pInfo.getAttributeIndex(class_label);::else if(alg == "SMART-TV"){ string ks = request->getProperty(MiningKeys::K_VALUE);string ms = request->getProperty(MiningKeys::MS_VALUE);string ep = request->getProperty(MiningKeys::EPS_VALUE);int k = atoi(ks.c_str());int mansize = atoi(ms.c_str()); double eps = atof(ep.c_str()); pModel = new SmartTVModel(pSet,k,eps,mansize,

dp.ptree_dir() + id); pModel->setClassLabel(class_label);

}

Page 96: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

graphical user interface in DataMIMETM. The graphical user interface of SMART-TV is

mostly similar with the other classification algorithms. The only difference is in the input

parameters.

Figure 28. Graphical user interface for mining with SMART-TV algorithm.

As shown in the figure, the parameters for the SMART-TV algorithm are the

number of k-nearest neighbors, the epsilon of the contour, and the manageable size

required for pruning. The default values of each parameter are given as the guideline for

the users.

84

Page 97: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

Figure 29 shows the graphical user interface of the classification results. The results

are presented in a table format and can be saved into a file by pressing the right button of

the mouse on the table.

Figure 29. Graphical user interface showing the classification results.

Figure 30 shows the graphical user interface of the vote histogram. The vote

histogram is displayed for each selected sample in the table. The histogram will be changed

automatically when the users scroll down the table and select the next sample. The

85

Page 98: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

Performance tab in the panel summarizes the average classification time per sample and

the Vote Table tab shows the vote values of each class of each corresponding sample.

Figure 30. Graphical user interface showing the vote histogram.

Figure 31 shows the graphical user interface of the performance results of the 10%

holdout validation. In this validation model, 10% of the training samples are randomly

selected and considered as the testing set. The rest of the samples are the training set. Note

that SMART-TV has to compute the functional values of each training object first and then

86

Page 99: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

classifies each of the samples in the testing set. The accuracy of the validation will be

displayed in the performance part.

Figure 31. Graphical user interface showing the performance of a validation.

87

Page 100: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

CHAPTER 8. CONCLUSION AND FUTURE WORK

8.1. Conclusion

This dissertation focuses on the scalability of classification algorithm. The work is

motivated from the fact that the state-of-the-art nearest neighbor based classification

algorithms are not scalable to very large datasets.

This dissertation is mainly based on several research projects and papers published

in [2, 3, 4, 5, 6]. In one of the projects, we proposed a vertical approach to compute set

squared distance that measures the total variation of a set of objects about a given object in

large datasets. We discovered that the COUNT operations in the vertical set square distance

are independent to the input value. Thus, executing the COUNT operations in advance and

retaining the count values is the best approach to expedite the computation of total

variation. The empirical results have shown that the vertical approach to compute total

variation is extremely fast and scalable to very large datasets as opposed to the horizontal

approach regardless of the specification of the machines.

We have extended the use of vertical total variation to classification and proposed a

novel nearest neighbor based classification algorithm, called SMART-TV. The proposed

algorithm efficiently filters the candidates of nearest neighbors by forming a total variation

contour around the unclassified object. The objects within the contour are considered as the

superset of nearest neighbors and are identified efficiently using P-tree range query

algorithm without having to scan the total variation values of the training objects one by

one. Because the candidate set may contain neighbors that are not really close to the

unclassified object, a pruning technique that uses dimensional projections was proposed.

The pruning technique uses the basic P-trees of each dimension directly. After pruning, the

88

Page 101: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

k-nearest neighbors are searched from the pruned neighbor set. One observed limitation of

the proposed pruning technique is that in high dimensional datasets, the number of

dimensions that needs to be examined will be great. In such a case, the overall

classification time may increase a little bit.

We have conducted extensive performance evaluations in terms of speed,

scalability, and classification accuracy. The results were analyzed thoroughly and can be

summarized as follows:

1. In terms of speed and scalability, we found that the proposed algorithm is very

comparable to the other vertical nearest neighbor algorithms, but outperforms the speed

of the classification algorithms that use a scanning approach.

2. In terms of classification accuracy, the proposed algorithm is very comparable to that of

KNN classification algorithm. We also found that the proposed algorithm classifies

well on the datasets with unbalanced number of objects in the class, such as the

KDDCUP dataset. Also, the proposed algorithm classifies well on high dimensional

datasets, such as the Corel image and WDBC datasets.

We also have studied and tested the proposed algorithm in image classification

problem. The empirical results on a general-purpose Corel images show that the proposed

algorithm is fast and scalable to classify images, when compared to KNN classifier. In the

experimentations, we again discovered that the classification accuracy of the proposed

algorithm is very comparable to that of the KNN classifier.

In summary, this dissertation addresses the scalability issues in classification. We

conclude that a scalable and highly accurate nearest neighbor classification algorithm has

been discovered. The proposed algorithm employs P-tree vertical data representation, one

choice of vertical representation that has been experimentally proven to address the curse

89

Page 102: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

of scalability and to facilitate efficient data mining over large datasets. We are convinced

that the proposed algorithm can be used in many different classification problems

employing large datasets.

8.2. Future Work

As for the vertical total variation algorithm, one of our future directions is to

develop an efficient way to update the COUNT values without having to compute all of

them when some values in the dataset are updated, e.g., some values in particular

dimensions. In the current version of the algorithm, when some values in the set are

changed, regardless of whether the changes occurred in one dimension or many, all count

values must be recomputed. In the cases where the dimensions of the changes can be well

identified, the algorithm should only update the count values in those dimensions. The

other count values should remain unchanged. In large datasets, this strategy can save a lot

of time.

One future direction for the proposed algorithm is to devise a strategy for

automatically providing the epsilon parameter based on the inherent features of the training

set and the unclassified object. In the current implementation, a single global epsilon is

used for all unclassified objects. Although the experiments demonstrated that, with a single

global epsilon, good classification accuracy can be achieved, it is even better if the epsilon

can be adjusted based on inherent features of the unclassified object so that the number of

neighbors filtered in the candidate set will be equally balanced for each unclassified object.

In the current implementation, the non-closed k-nearest neighbor set is used to

determine the class label of the unclassified object. The non-closed k-nearest neighbor set

is managed in a heap structure similar to the KNN algorithm used for comparison. The

90

Page 103: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

experimental results show that the accuracy is about the same as the classical KNN

algorithm. It varies a little bit depending upon which the tie kth nearest neighbor is picked.

For future work, we would like to observe whether considering a closed k-nearest neighbor

set in SMART-TV algorithm will give better classification accuracy.

Another good direction for future work is to observe the proposed algorithm in

different popular domains, such as bioinformatics and unstructured data (text). Text is

known to have very large dimensions because each word (term) in the document is

considered as one dimension. It is very interesting and challenging to analyze the concepts

of the proposed algorithm when applied to those domains. Another great possibility is to

adopt the same concept for retrieval problems. The ability of the proposed algorithm to

filter the candidates of neighbors provides a great possibility to quickly determine the

superset of the most relevant objects to the query object. The objects that are not quite

relevant to the query object can be eliminated efficiently. The ranking algorithm is then

performed on the pruned objects in the relevant set. We have demonstrated that our

approach works very well in finding the right candidates of neighbors in classification

problems. Therefore, we are also convinced that the same idea will work for retrieval

problems.

We also see a window of opportunities for the vertical total variation to be used in

clustering and outlier detection analysis. The combination the total variation and

dimensional projection contours perhaps can be used to discover special types of clusters in

the space, such as projective and oblique clusters.

91

Page 104: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

REFERENCES

[1] J. Han and M. Kamber, “Data Mining Concepts and Techniques,” 2nd edition, Morgan

Kaufmann Publishers, San Francisco, CA, 2006.

[2] T. Abidin, A. Perera, M. Serazi, and W. Perrizo, “A Vertical Approach to Computing

Set Squared Distance,” International Journal of Computers and their Applications

(IJCA), vol. 13, no. 2, pp. 94-102, June 6, 2006.

[3] T. Abidin and W. Perrizo, “SMART-TV: A Fast and Scalable Nearest Neighbor Based

Classifier for Data Mining,” Proceedings of the 21st ACM Symposium on Applied

Computing (SAC-06), pp. 536-540, Dixon, France, April 23-27, 2006.

[4] T. Abidin, A. Dong, H. Li, and W. Perrizo, “Efficient Image Classification on

Vertically Decomposed Data,” Proceedings of the 1st IEEE International Workshop

on Multimedia Databases and Data Management (MDDM-06), Atlanta, Georgia,

April 8, 2006.

[5] T. Abidin, A. Perera, M. Serazi, and W. Perrizo, “Vertical Set Squared Distance: A

Fast and Scalable Technique to Compute Total Variation in Large Datasets,”

Proceedings of the 20th ISCA International Conference on Computers and Their

Applications (CATA-05), pp. 60-65, New Orleans, Louisiana, March 16-18, 2005.

[6] T. Abidin and W. Perrizo, “An Alternative Arrangement of Symmetric Datasets for

Vertical Clustering Algorithms,” Proceedings of the 21st ISCA International

Conference on Computers and their Applications (CATA-06), Seattle, Washington,

March 23-25, 2006.

[7] Sorcerer Expedition, http://www.sorcerer2expedition.org/version1/HTML/main.htm,

February 6, 2006.

92

Page 105: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

[8] V. Vapnik, “The Nature of Statistical Learning Theory,” Springer-Verlag Publisher,

New York, NY, 1995.

[9] H. Byun and S.W. Lee, “A Survey on Pattern Recognition Applications of Support

Vector Machines,” International Journal of Pattern Recognition and Artificial

Intelligence, 17(3), pp. 459-486, 2003.

[10] O.L. Mangasarian and D.R. Musicant, “Langarian Support Vector Machine,” Journal

of Machine Learning Research, vol. 1, pp. 161-177, 2001.

[11] T.M. Cover and P.E. Hart. “Nearest Neighbor Pattern Classification,” Journal IEEE

Transactions on Information Theory, IT 13, pp. 21-27, 1967.

[12] A.P. Vries, N. Mamoulis, N. Nes, and M. Kersten, “Efficient k-NN Search on

Vertically Decomposed Data,” Proceedings of the ACM SIGMOD, pp. 322-333,

2002.

[13] ANN: A Library for Approximate Nearest Neighbor Searching,

http://www.cs.umd.edu/~mount/ANN/, January 2006.

[14] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate Generation,”

Proceedings of the ACM International Conference on Management of Data

(SIGMOD), Dallas, TX, 2000.

[15] J. Gray, “The Next Database Revolution,” Proceedings of the 10th ACM SIGMOD, pp.

1-4, Paris, 2004.

[16] R. Jin and G. Agrawal, “A Middleware for Developing Parallel Data Mining

Implementations,” Proceedings of the 1st SIAM Conference in Data Mining, April

2001.

[17] M. Serazi, “A Super-Max Data Mining Benchmark by Vertically Structuring Data,”

Ph.D. Thesis, North Dakota State University, Fargo, ND, 2005.

93

Page 106: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

[18] W. Perrizo, “Peano Count Tree Technology Lab Notes,” Technical Report

NDSU-CS-TR-01-1, North Dakota State University, Computer Science Department,

http://www.cs.ndsu.nodak.edu/~perrizo/classes/785/pct.html, 2001.

[19] Data Mining Tutorials, http://www.eruditionhome.com/datamining/overview.html,

March 1, 2006.

[20] R.T. Ng and J. Han, “CLARANS: A Method for Clustering Objects for Spatial Data

Mining,” IEEE Transactions on Knowledge and Data Engineering, vol. 14(5), pp.

1003-1016, September/October 2002.

[21] M. Kantardzic, “Data Mining: Concepts, Models, Methods, and Algorithms,” IEEE

Press, John Willey and Sons, Inc., New Jersey, 2003.

[22] J. R. Quinlan, “Induction on Decision Trees,” Machine Learning, vol. 1. pp. 81-106,

1986.

[23] S. Mitra and T. Acharya, “Data Mining: Multimedia, Soft Computing, and

Bioinformatics,” John Wiley and Sons, Inc., New Jersey, 2003.

[24] K. Alsabti, S. Ranka and V. Singh, “CLOUDS: A Decision Tree Classifier for Large

Datasets,” Proceedings of the ACM SIGKDD, pp. 2-8, 1998.

[25] D. Hand, H. Mannila, and P. Smyth, “Principles of Data Mining,” MIT Press,

Massachusetts, 2001.

[26] M. Khan, Q. Ding, and W. Perrizo, “K-nearest Neighbor Classification on Spatial Data

Stream Using P-trees,” Proceedings of the Pacific-Asia Conference on Knowledge

Discovery and Data Mining, pp. 517-528, Taipei, Taiwan, May 2002.

[27] W. Perrizo, Q. Ding, A. Denton, K. Scott, Q. Ding, and M. Khan, “PINE – Podium

Incremental Neighbor Evaluator for Classifying Spatial Data,” Proceedings of the

94

Page 107: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

ACM Symposium on Applied Computing, pp. 503-508, Melbourne, FL, August

2003.

[28] A. Perera, T. Abidin, M. Serazi, G. Hamer, and W. Perrizo, “Vertical Set Squared

Distance Based Clustering without Prior Knowledge of K,” Proceedings of the 14th

International Conference on Intelligent and Adaptive Systems and Software

Engineering (IASSE-05), pp. 72-77, Toronto, Canada, July 20-22, 2005.

[29] R. Syamala, T. Abidin, and W. Perrizo, “Clustering Microarray Data Based on Density

and Shared Nearest Neighbor Measure,” Proceedings of the 21st ISCA International

Conference on Computers and their Applications (CATA-06), Seattle, Washington,

March 23-25, 2006.

[30] I. Rahal, D. Ren, and W. Perrizo, “A Scalable Vertical Model for Mining Association

Rules,” Journal of Information & Knowledge Management (JIKM), vol.3, no. 4,

pp. 317-329, 2004.

[31] D. Ren, B. Wang, and W. Perrizo, “RDF: A Density-Based Outlier Detection Method

using Vertical Data Representation,” Proceedings of the 4th IEEE International

Conference on Data Mining (ICDM-04), pp. 503-506, November 1-4, 2004.

[32] W. Perrizo, M. Serazi, A. Perera, Q. Ding, and V. Malakhov, “P-Tree API Reference

Manual,” Technical Report NDSU-CSOR-TR-04-1, North Dakota State University

Fargo, ND, 2004.

[33] Q. Ding, M. Khan, A. Roy and W. Perrizo, “The P-tree Algebra,” Proceedings of

ACM Symposium on Applied Computing, pp. 426-431, Madrid, Spain, March 2002.

[34] I. Claude, R. Winzenrieth, P. Pouletaut, and J. Boulanger, “Contour Features for

Colposcopic Image Classification by Artificial Neural Networks,” Proceedings of the

95

Page 108: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

16th International Conference on Pattern Recognition (ICPR'02), vol. 1, pp. 10771,

2002.

[35] C. G. Atkeson, A. W., Moore, and S. Schaal, “Locally Weighted Learning,” Artificial

Intelligence Review, vol. 11, no. 1-5, pp. 11-73, 1997.

[36] D.J. Newman, S. Hettich, S., C.L. Blake, C.L., and C.J. Merz, “UCI Repository of

Machine Learning Databases,” http://www.ics.uci.edu/~mlearn/MLRepository.html,

Irvine, CA, University of California, Department of Information and Computer

Science, 1998.

[37] S. Hettich and S. Bay, “The UCI KDD Archive http://kdd.ics.uci.edu,” University of

California, Irvine, CA, Department of Information and Computer Science, 1999.

[38] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, “OPTICS: Ordering Points

to Identify the Clustering Structure,” Proceedings of the ACM SIGMOD, pp. 49-60,

1999.

[39] Iris Dataset, http://www.ailab.si/orange/doc/datasets/iris.htm, July 3, 2005.

[40] R.J. Roiger and M.W. Geatz, “Data Mining: A Tutorial-Based Primer,” Addison

Wesley, NY, 2003.

[41] A. Vailaya, M. Figueiredo, A. K. Jain, and H. Zhang, “Image Classification for

Content-Based Indexing,” IEEE Transaction on Image Processing, vol. 10, no. 1, pp.

117-139, 2001.

[42] O. Chapelle, P. Haffner, and V. Vapnik, “SVM for Histogram Based Image

Classification”, IEEE Transactions on Neural Networks, 10(5), pp. 1055-1064, 1999.

[43] D. Masip and J. Vitrià, “Feature Extraction for Nearest Neighbor Classification:

Application to Gender Recognition,” International Journal of Intelligent Systems, vol.

20 (5), pp. 561-576, 2005.

96

Page 109: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

[44] J. Hafner, H. S. Sawhney, W. Equitz, M. Flickner, and W. Niblack, “Efficient Color

Histogram Indexing for Quadratic Form Distance Functions,” IEEE Transactions

Pattern Analysis Machine Intelligence, vol. 17, pp. 729-736, 1995.

[45] Q. Iqbal and J. Aggarwal, “Combining Structure, Color and Texture for Image

Retrieval: A Performance Evaluation,” Proceedings of the 16th International

Conference on Pattern Recognition, vol. 2, pp. 438-443, Quebec City, Canada, 2002.

[46] J. Z. Wang, J. Li, and G. Wiederhold, “SIMPLIcity: Semantics-Sensitive Integrated

Matching for Picture Libraries,” IEEE Transaction on Pattern Analysis ad Machine

Intelligence, vol. 23 (9), pp. 947-963, 2001.

[47] B. S. Manjunath and W. Y. Ma, “Texture Features for Browsing and Retrieval of

Image Data,” IEEE Transaction on Pattern Analysis and Machine Intelligence, vol.

18(8), pp. 837-842, 1996.

[48] M. Serazi, A. Perera, Q. Ding, V. Malakhov, I. Rahal, F. Pan, D. Ren, W. Wu, and W.

Perrizo, “DataMIMETM,” Proceedings of the ACM International Conference on

Management of Data (SIGMOD), pp. 923-924, Paris, France, June 2004.

[49] W. Perrizo, M. Serazi, A. Perera, Q. Ding, and V. Malakhov, “DataMIMETM

Developer Manual,” Technical Report NDSU-CSOR-TR-04-2, North Dakota State

University Fargo, ND, 2004.

97

Page 110: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

APPENDIX

A.1. SmartTVApp Class

/*************************************** * Program: SmartTVApp.cpp * * Author : Taufik Abidin * DataSURG Research Group at CS NDSU * ***************************************/ #include "SmartTV.h"#include "Util.h"#include <time.h>#include <iostream>#include <PTreeInfo.h>#include <PTreeSet.h>#include <MetaFileParser.h>#include <DataFeeder.h>#include <RelationalDataFeeder.h>#include <TiffDataFeeder.h>#include <BasicPt.h>#include <string_token_iterator.h>

using namespace std;

int main(int argc, char **argv){ string dataFolder = argv[1]; string testingData = dataFolder + argv[2]; string metafile = dataFolder + argv[3]; string ptree_set_id= argv[4]; string ks = argv[5]; string ep = argv[6]; string msize= argv[7]; string mode = argv[8]; int k = atoi(ks.c_str()); int mansize = atoi(msize.c_str()); double eps = atof(ep.c_str()); DIR *dir = opendir(ptree_set_id.c_str()); clock_t starttime, endtime; try {

PTreeSet ps;double loadingPtree = 0;cout<<"\nStart SMART-TV classifier 2.0 using PTree..."<<endl;if(!dir){

starttime = clock(); MetaFileParser mParser(metafile); mParser.setDataRoot(dataFolder); DataFeeder *rFeeder; string ext = mParser.getFileExtension(); if(ext=="tiff" || ext=="tif")

98

Page 111: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

rFeeder = new TiffDataFeeder(mParser); else if(ext=="data")

rFeeder = new RelationalDataFeeder(mParser); ps.feed(rFeeder); ps.store(ptree_set_id); endtime = clock(); double gentime = ((double) abs(endtime - starttime)) / CLOCKS_PER_SEC; cout<<"Generating Ptrees... "<<gentime<<" seconds"<<endl; delete rFeeder; } else { cout<<"Loading Ptree set..."<<endl; starttime = clock(); ps.load(ptree_set_id); endtime = clock(); loadingPtree = ((double) abs(endtime-starttime)) / CLOCKS_PER_SEC; cout<<"Done... "<<loadingPtree<<" seconds"<<endl; }

PTreeInfo pi = ps.getPTreeInfo();cout<<"Number of Ptrees: "<<pi.num_ptrees()<<endl;cout<<"Number of dimension: "<<pi.degree()<<endl;cout<<"Number of cardinality: "<<pi.cardinality()<<endl;SmartTV smarttv(ps,k,eps,mansize);

smarttv.setPTreeSetID(ptree_set_id);smarttv.setClassLabel("class_label");starttime = clock(); smarttv.getTVs(); endtime = clock();double loadingTime = ((double) abs(endtime - starttime)) / CLOCKS_PER_SEC;if(mode=="-l"){ // learning using testing set

vector <string> classDom = smarttv.getClassDomain();double TP[classDom.size()];double FP[classDom.size()];double PX[classDom.size()]; for(int i=0;i<classDom.size();i++){

TP[i] = 0; FP[i] = 0; PX[i] = 0; } int N = 0; double totalTime = 0.0; ifstream teststream(testingData.c_str()); while(!teststream.eof()){ string line = ""; string actualClass; string predictedClass = ""; getline(teststream, line); if(line=="") continue; string_token_iterator tok(line, ", "), end; vector<string> v(tok, end); N++; Tuple a = pi.to_tuple(v); int actualClassIndex = pi.degree()-1; int type = (a.get(actualClassIndex))->type();

99

Page 112: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

starttime = clock(); Item *predictedItem = smarttv.predict(a); if(type==Type::UNSIGNED_INT){ UsignIntItem *va =

dynamic_cast<UsignIntItem*>(a.get(actualClassIndex)); actualClass = to_string(va->value()); UsignIntItem *pc =

dynamic_cast<UsignIntItem*>(predictedItem); predictedClass = to_string(pc->value()); } else if(type==Type::SING_CAT){ SingCatItem *va =

dynamic_cast<SingCatItem*>(a.get(actualClassIndex)); actualClass = va->value(); SingCatItem *pc =

dynamic_cast<SingCatItem*>(predictedItem); predictedClass = to_string(pc->value()); } endtime = clock(); double predictedTime = ((double) (endtime - starttime)) /

CLOCKS_PER_SEC; totalTime += predictedTime; int pos = -1; int getPos = 0; int updatedPX = 0; for(int i=0;i<classDom.size();i++){ if((!updatedPX)&&(classDom.at(i)==actualClass)){ PX[i]++; updatedPX = 1; } if((!getPos)&&(classDom.at(i)==predictedClass)){ pos = i; getPos = 1; } } if(actualClass == predictedClass) TP[pos]++; else FP[pos]++; } cout<<"\nk: "<<k<<endl; cout<<"eps: "<<eps<<endl; cout<<"ManSize: "<<mansize<<endl; cout<<"Testset: "<<testingData<<endl; cout<<"Total new samples: "<<N<<endl; cout<<"Total prediction time: "<<totalTime<<" seconds"<<endl; cout<<"Time/sample: "<<totalTime/N<<" seconds"<<endl; cout<<"Time for loading PTrees, RCs, and TVs: "<<loadingPtree

+ loadingTime<<" seconds"<<endl; for(int i=0;i<classDom.size();i++){ cout<<"\nClass: "<<classDom.at(i)<<endl; cout<<"|X| = "<<PX[i]<<endl; cout<<"TP = "<<TP[i]<<endl; cout<<"FP = "<<FP[i]<<endl; double R = TP[i]/PX[i]; double P = TP[i]/(TP[i]+FP[i]); cout<<"Recall = "<<R<<endl; cout<<"Precision = "<<P<<endl; cout<<"F = "<<(2*P*R)/(P+R)<<endl;

100

Page 113: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

} teststream.close(); } else{ int N = 0; double totalTime = 0.0; ifstream teststream(testingData.c_str()); string result = argv[2]; ofstream outpstream((result + ".result").c_str()); while(!teststream.eof()){ string line = ""; string predictedClass = ""; getline(teststream, line); if(line=="") continue; string_token_iterator tok(line, ", "), end; vector<string> v(tok, end); N++; Tuple a = pi.to_tuple(v); int actualClassIndex = pi.degree()-1; int type = (a.get(actualClassIndex))->type();

starttime = clock(); Item *predictedItem = smarttv.predict(a); if(type==Type::UNSIGNED_INT){ UsignIntItem *pc =

dynamic_cast<UsignIntItem*>(predictedItem); predictedClass = to_string(pc->value()); } else if(type==Type::SING_CAT){

SingCatItem *pc = dynamic_cast<SingCatItem*>(predictedItem);

predictedClass = to_string(pc->value()); } outpstream<<a<<" "<<predictedClass<<endl; endtime = clock(); double predictedTime = ((double) (endtime - starttime)) /

CLOCKS_PER_SEC; totalTime += predictedTime; } outpstream<<"\nk: "<<k<<endl; outpstream<<"eps: "<<eps<<endl; outpstream<<"ManSize: "<<mansize<<endl; outpstream<<"Total new samples: "<<N<<endl; outpstream<<"Prediction time: "<<totalTime<<" seconds"<<endl; outpstream<<"Time/sample: "<<totalTime/N<<" seconds"<<endl; outpstream<<"Time for loading PTrees, RCs, and TVs: "

<<loadingPtree + loadingTime<<" seconds"<<endl; cout<<"Done..."<<endl; teststream.close(); outpstream.close(); } } catch(const exception& ex){ cout<<ex.what()<<endl; } return (EXIT_SUCCESS);}

101

Page 114: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

A.2. SmartTV Header Class

/*************************************** * Program: SmartTV.h * * Author : Taufik Abidin * DataSURG Research Group at CS NDSU * ***************************************/ #ifndef _SMART_TV_H_#define _SMART_TV_H_

#include <vector>#include <PTreeSet.h>#include <Tuple.h>#include <BasicPt.h>#include <boost/dynamic_bitset.hpp>#include "PredictionModel.h"

typedef boost::dynamic_bitset<> boost_bitset;typedef struct heap { size_t idx; double val;}Heap;

class SmartTV : public PredictionModel{ public: SmartTV(PTreeSet& ps, int ks, double eps, int mansize);

// must be implemented when inherit PredictionModel // PredictionModel is the model for the DataMIME virtual Item* predict(const Tuple& t) throw (FailException); virtual vector<pair<Item*, double> > vote_histogram(const

Tuple& t) throw (FailException); virtual void setPTreeSet(const PTreeSet& pset); virtual void setClassLabel(const string& cl)

throw (UnknownAttribute); virtual string getClassLabel()const; vector<string> getClassDomain(); void setPTreeSetID(const string& ptree_set_id)

{ptreeSetID = ptree_set_id;} void getTVs() throw (FailException);

private:int k; const int MS;

double epsilon; double lenAminusMean;

int classType; int numdimension; string classLabel; size_t classIndex; string ptreeSetID; vector<double> mean;

102

Page 115: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

vector<size_type> partition;

vector<pair<Item*, double> >votes; PTreeSet& ps; PTreeSet pstv; PTreeInfo pi; PTreeInfo pitv; Item* winner(); double getMean(int i); double f(const vector<double>& a); void vote(const Heap nearestNeighbors[], int k)

throw (FailException); vector<double> tupleToVector(const Tuple& t); double L2(const vector<double>& x, const vector<double>& a); void createHeap(Heap *heap, size_t i, size_t newidx,

double newval); void adjustHeap(Heap *heap, size_t pos, size_t heapsize); BasicPt getTVContourMask(const double& fb, const double& fc); BasicPt getXiCandidateMask(const int& i, const double& val,

const double& eps); void generatePTreeTV() throw (FailException); void createTVMetadata(const double& max)throw (FailException); vector<double> vectorDifferent(const vector<double>& x,

const vector<double>& y); vector<vector<double> > getRingVectors(double& eps,

const vector<double>& aMinusMean, const vector<double>& a);

};#endif

103

Page 116: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

A.3. SmartTV Class

/*************************************** * Program: SmartTV.cpp * * Author : Taufik Abidin * DataSURG Research Group at CS NDSU * * ***************************************/ #include "SmartTV.h"#include "Util.h"#include "SingCatAttributeInfo.h"#include "UsignIntAttributeInfo.h"#include "SignIntAttributeInfo.h"#include "UsignDoubleAttributeInfo.h"#include "SignDoubleAttributeInfo.h"#include <PTreeInfo.h>#include <boost/dynamic_bitset.hpp>#include <iostream>#include <fstream>#include <time.h>#include <algorithm>

/** * Constructor */SmartTV::SmartTV(PTreeSet& ps, int ks, double eps, int mansize):

ps(ps), k(ks), MS(mansize){ pi = ps.getPTreeInfo(); numdimension = pi.degree()-1; size_type row = 0; partition.resize(pi.cardinality());

for(vector<size_type>::iterator it = partition.begin(); it!=partition.end(); it++, row++)

*it = row;

epsilon = (eps < 0 ? 0.01 : eps); clock_t starttime, endtime; starttime = clock(); for(int i=0; i<numdimension; i++) // get vector mean mean.push_back(getMean(i)); endtime = clock();

cout <<"Done computing mean..., "<<((double)(endtime-starttime))/CLOCKS_PER_SEC<<" secs"<<endl;

}

/** * Set PTree set */void SmartTV::setPTreeSet(const PTreeSet& pset){ ps = pset;}

104

Page 117: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

/** * Get class domains */vector<string> SmartTV::getClassDomain(){ vector<string> classDomain; if(classType==Type::SING_CAT) {

SingCatAttributeInfo *sInfo =dynamic_cast<SingCatAttributeInfo*>(&(pi.getAttributeInfo(classLabel)));classDomain = sInfo->getDomain();

} else

cout<<"not a supported class label in getClassDomain"<<endl; return classDomain;}

/** * Get the total variations of the training objects */void SmartTV::getTVs() throw (FailException){ try{ clock_t starttime, endtime;

string tv_ptree_set_id = ptreeSetID + "/ptrees_hdtv"; DIR *dir = opendir(tv_ptree_set_id.c_str()); if(!dir){

cout <<"Computing HDTVs..."<<endl;string tvs = ptreeSetID + "/hdtvs.data"; ofstream oftvs(tvs.c_str());double max = 0.0;starttime = clock();

for(vector<size_type>::iterator it = partition.begin();it!=partition.end();it++){Tuple x = ps.getTuple(*it); double fx = f(tupleToVector(x));oftvs<<fx<<endl; if(max < fx)

max = fx; } endtime = clock(); oftvs.close();cout<<"Done..., "<<((double)(endtime-starttime))/

CLOCKS_PER_SEC<<" seconds"<<endl; createTVMetadata(max);

generatePTreeTV(); } else{ starttime = clock(); pstv.load(tv_ptree_set_id); endtime = clock(); cout <<"Done..., "<<((double)(endtime-starttime))/

CLOCKS_PER_SEC<<" seconds"<<endl; } pitv = pstv.getPTreeInfo();

}

105

Page 118: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

catch(const exception& e){ throw FailException(e.what()); }}

/** * Create metadata for TV PTrees */void SmartTVNoScan::createTVMetadata(const double& max) throw (FailException){ try{ string metafile = ptreeSetID + "/metahdtv.xml"; ofstream tvmeta(metafile.c_str()); tvmeta<<"<?xml version=\"1.0\" encoding=\"UTF-8\"?>"<<endl; tvmeta<<"<datasetinfo>"<<endl; tvmeta<<" <description>"<<endl; tvmeta<<" <title>"<<endl; tvmeta<<" <line>Metadata TVs</line>"<<endl; tvmeta<<" </title>"<<endl; tvmeta<<" </description>"<<endl; tvmeta<<" <cardinality>"<<pi.cardinality()<<"

</cardinality>"<<endl; tvmeta<<" <delimiter>comma</delimiter>"<<endl; tvmeta<<" <data_file name=\"hdtvs.data\">"<<endl; tvmeta<<" <attribute>"<<endl; tvmeta<<" <name>tv</name>"<<endl; tvmeta<<" <type>double</type>"<<endl; tvmeta<<" <domain>"<<endl; tvmeta<<" <lower>0</lower>"<<endl; tvmeta<<" <upper>"<<max<<"</upper>"<<endl; tvmeta<<" </domain>"<<endl; tvmeta<<" <precision>3</precision>"<<endl; tvmeta<<" </attribute>"<<endl; tvmeta<<" </data_file>"<<endl; tvmeta<<"</datasetinfo>"<<endl; } catch(const exception& e){ throw FailException(e.what()); }}

/** * Generaate TV PTrees */void SmartTVNoScan::generatePTreeTV() throw (FailException){ try{ clock_t starttime, endtime; string tv_ptree_set_id = ptreeSetID + "/ptrees_hdtv"; string metafile = ptreeSetID + "/metatv.xml"; DIR *dir = opendir(tv_ptree_set_id.c_str()); if(dir){ cout<<"Directory is already exist..."<<endl; throw IdExistsException("Fail to create dir..."); closedir(dir);

106

Page 119: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

} else{ starttime = clock(); MetaFileParser mParser(metafile); mParser.setDataRoot(ptreeSetID + "/"); DataFeeder *rFeeder; rFeeder = new RelationalDataFeeder(mParser); cout<<"Feed TV Ptree set..."<<endl; pstv.feed(rFeeder); cout<<"Storing TV Ptree set..."<<endl; pstv.store(tv_ptree_set_id); endtime = clock(); double gentime = ((double) abs(endtime - starttime)) /

CLOCKS_PER_SEC; cout<<"Generating TV Ptrees, done... "<<gentime<<"

seconds"<<endl; delete rFeeder; } } catch(const exception& e){ throw FailException(e.what()); }}

/** * f(a), the function */double SmartTV::f(const vector<double>& a){ double len = 0; for(int i=0; i<numdimension; i++) len += pow((a[i] - mean[i]),2); return log(pi.cardinality() * len + 1); }

/** * Get vector different */vector<double> SmartTV::vectorDifferent(const vector<double>& x,

const vector<double>& y){ double sum = 0; vector<double> z(numdimension); for(int i=0; i<numdimension; i++){ z[i] = x[i]-y[i]; sum += z[i]*z[i]; } lenAminusMean = sqrt(sum); return z;}

107

Page 120: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

/** * Get vectors at the lower ring and upper ring of a */

vector<vector<double> > SmartTV::getRingVectors(double& eps, const vector<double>& aMinusMean, const vector<double>& a){

double upper; vector<double> b(numdimension); vector<double> c(numdimension); vector<vector<double> > z(2,b); if(((pi.getAttributeInfo(0)).type()==Type::UNSIGNED_DOUBLE)||

((pi.getAttributeInfo(0)).type()==Type::SIGNED_DOUBLE)){ DoubleAttributeInfo* doubleAttInfo =

dynamic_cast<DoubleAttributeInfo*>(&(pi.getAttributeInfo(0))); upper = doubleAttInfo->upper(); } else{ IntAttributeInfo* intAttInfo =

dynamic_cast<IntAttributeInfo*>(&(pi.getAttributeInfo(0))); upper = intAttInfo->upper(); } double min = (1 - eps/lenAminusMean); double plus = (1 + eps/lenAminusMean); for(int i=0; i<numdimension; i++){ b[i] = min * aMinusMean[i] + mean[i]; c[i] = plus * aMinusMean[i] + mean[i]; } z[0] = b; z[1] = c; return z; }

/** * SMART-TV prediction */Item *SmartTV::predict(const Tuple& a) throw (FailException) { try { int pos = 0; int heapSize; int paramK = k; votes.clear(); int candidates = 0; double eps = epsilon; vector<size_type> contourPoints; vector<vector<double> > ringVectors; vector<double> newSample = tupleToVector(a); vector<double> aMinusMean = vectorDifferent(newSample, mean); eps = (eps > lenAminusMean ? lenAminusMean : eps); //Pruning by means of Dimensional Projections while(!candidates){ ringVectors = getRingVectors(eps, aMinusMean, newSample); double fb = f(ringVectors[0]); double fc = f(ringVectors[1]);

108

Page 121: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

BasicPt pn = getTVContourMask(fb, fc); contourPoints = pstv.getAllTupleIndices(pn); int candSize = contourPoints.size(); cout<<"TV candidates: "<<candSize<<endl;

if(candSize>MS){ pos = 0; vector<pair<unsigned,int> > maxCand; pn = pn(ps.createDerivedPTree(contourPoints)); while(pos<numdimension){ BasicPt pp = getXiCandidateMask(pos, newSample[pos],

eps); unsigned count = ps.count(pn & pp); if(count!=0) maxCand.push_back(pair<unsigned,int>(count,pos)); pos++; } if(maxCand.size()>0){ sort(maxCand.begin(),maxCand.end()); vector<pair<unsigned, int> >::reverse_iterator it =

maxCand.rbegin(); for(; it!=maxCand.rend();it++){ pos = (*it).second; BasicPt pp(getXiCandidateMask(pos, newSample[pos],

eps)); unsigned count = ps.count(pn & pp); if(count!=0){ pn = pn & pp; if(count<MS) break; } } } contourPoints = ps.getAllTupleIndices(pn); cout<<"Pruned: "<<contourPoints.size()<<endl; } candidates = contourPoints.size(); if(!candidates) eps *= 2; } if(paramK==0) heapSize = contourPoints.size(); else{ if(contourPoints.size() < paramK) paramK = contourPoints.size(); heapSize = paramK; }

// measure the distance Heap nearestNeighbors[heapSize]; for(int i=0;i<contourPoints.size();i++){ Tuple x = ps.getTuple(contourPoints[i]); double distance = L2(tupleToVector(x), newSample); if(i<heapSize) createHeap(nearestNeighbors,i,contourPoints[i],distance); else{ if(distance < nearestNeighbors[0].val){

109

Page 122: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

nearestNeighbors[0].val = nearestNeighbors[heapSize-1].val;

nearestNeighbors[0].idx = nearestNeighbors[heapSize-1].idx;

nearestNeighbors[heapSize-1].val = distance; nearestNeighbors[heapSize-1].idx = contourPoints[i]; for(int t=(heapSize/2)-1;t>=0;t--) adjustHeap(nearestNeighbors,t,heapSize); } } } epsilon = eps; vote(nearestNeighbors,heapSize); } catch(const exception& e){ cout<<e.what()<<endl; } return winner();}

/** * Get TV contour mask given fb (lower than fa) and fc (upper than fa) */BasicPt SmartTV::getTVContourMask(const double& fb, const double& fc){ BasicPt pu(pitv); BasicPt pl(pitv); double lowerBound = fb; double upperBound = fc;

// make sure that the contour range is not out of boundDoubleAttributeInfo* doubleAttInfo =dynamic_cast<DoubleAttributeInfo*>(&(pitv.getAttributeInfo(0)));if(upperBound > doubleAttInfo->upper())

upperBound = doubleAttInfo->upper(); if(lowerBound < doubleAttInfo->lower())

lowerBound = doubleAttInfo->lower(); if(lowerBound > doubleAttInfo->upper())

lowerBound = doubleAttInfo->upper(); // get the mask of all values that are less than upperBound

boost_bitset bits = doubleAttInfo->encode(new UsignDoubleItem(upperBound)); int nonZeroBit = 0;

if(bits.any()){ while(!bits[nonZeroBit]) nonZeroBit++; } int first = true; for(int j=nonZeroBit; j<pitv[0].binaryLength(); j++){ if(first){ pu = pu(0,j); first = false; continue; }

110

Page 123: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

if(bits[j]==true) pu = pu & pu(0,j);

else pu = pu | pu(0,j);

} // get the mask of all values that are greater than lowerBound bits = doubleAttInfo->encode(new UsignDoubleItem(lowerBound)); nonZeroBit = 0; if(bits.any()){ while(!bits[nonZeroBit]) nonZeroBit++;

}

first = true; for(int j=nonZeroBit; j<pitv[0].binaryLength(); j++){ if(first){ pl = pl(0,j); first = false; continue; } if(bits[j]==true) pl = pl & pl(0,j); else pl = pl | pl(0,j); } return pl & !pu;}

/** * Get neighborhood mask of the dimensional projection */BasicPt SmartTV::getXiCandidateMask(const int& i, const double& val,

const double& eps){ BasicPt pu(pi); BasicPt pl(pi); boost_bitset bits; IntAttributeInfo* intAttInfo; DoubleAttributeInfo* doubleAttInfo; double upperBound; double lowerBound; int intUpperBound; int intLowerBound;

// make sure that the contour range is not out of boundif(((pi.getAttributeInfo(i)).type()==Type::UNSIGNED_DOUBLE)||((pi.getAttributeInfo(i)).type()==Type::SIGNED_DOUBLE)){

doubleAttInfo = dynamic_cast<DoubleAttributeInfo*>(&(pi.getAttributeInfo(i)));

upperBound = val + eps; lowerBound = val - eps; if(upperBound > doubleAttInfo->upper()) upperBound =

doubleAttInfo->upper(); if(lowerBound < doubleAttInfo->lower()) lowerBound =

doubleAttInfo->lower(); if(lowerBound > doubleAttInfo->upper()) lowerBound =

111

Page 124: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

doubleAttInfo->upper(); } else{ intAttInfo =

dynamic_cast<IntAttributeInfo*>(&(pi.getAttributeInfo(i))); intUpperBound = (int)(val + eps); intLowerBound = (int)(val - eps); if(intUpperBound > intAttInfo->upper()) intUpperBound =

intAttInfo->upper(); if(intLowerBound < intAttInfo->lower()) intLowerBound =

intAttInfo->lower(); if(intLowerBound > intAttInfo->upper()) intLowerBound =

intAttInfo->upper(); } int nonZeroBit = 0; if(((pi.getAttributeInfo(i)).type()==Type::UNSIGNED_DOUBLE)||

((pi.getAttributeInfo(i)).type()==Type::SIGNED_DOUBLE)) bits = doubleAttInfo->encode(new UsignDoubleItem(upperBound)); else bits = intAttInfo->encode(new UsignIntItem(intUpperBound));

if(bits.any()){ while(!bits[nonZeroBit]) nonZeroBit++; } int first = true; for(int j=nonZeroBit; j<pi[i].binaryLength(); j++){ if(first){ pu = pu(i,j); first = false; continue; } if(bits[j]==true) pu = pu & pu(i,j); else pu = pu | pu(i,j);

}

nonZeroBit = 0; if(((pi.getAttributeInfo(i)).type()==Type::UNSIGNED_DOUBLE)||

((pi.getAttributeInfo(i)).type()==Type::SIGNED_DOUBLE)) bits = doubleAttInfo->encode(new UsignDoubleItem(lowerBound)); else bits = intAttInfo->encode(new UsignIntItem(intLowerBound));

if(bits.any()){ while(!bits[nonZeroBit]) nonZeroBit++; }

first = true; for(int j=nonZeroBit; j<pi[i].binaryLength(); j++){ if(first){ pl = pl(i,j); first = false; continue;

112

Page 125: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

} if(bits[j]==true) pl = pl & pl(i,j); else pl = pl | pl(i,j); } return pl & !pu; }

/** * Vote for winner */void SmartTV::vote(const Heap nearestNeighbors[], int k) throw (FailException){ try{ if(classType==Type::SING_CAT){ SingCatAttributeInfo *sInfo =

dynamic_cast<SingCatAttributeInfo*>(&(pi.getAttributeInfo(classLabel)));

vector<string> classDom = sInfo->getDomain(); double vote[classDom.size()]; for(int i=0;i<classDom.size();i++)

vote[i] = 0.0; for(int i=0;i<k;i++){ Tuple closestNeighbor =

ps.getTuple(nearestNeighbors[i].idx); SingCatItem *vclosestNeighbor =dynamic_cast<SingCatItem*>

(closestNeighbor.get(classIndex)); for(int j=0;j<classDom.size();j++){ if(classDom[j]==vclosestNeighbor->value()){ vote[j] += exp(-1 * (nearestNeighbors[i].val) *

(nearestNeighbors[i].val)); break; } } } Item *classDomItem; for(int i=0;i<classDom.size();i++){ classDomItem = new SingCatItem(classDom[i]); votes.push_back(pair<Item *,

double>(classDomItem,vote[i])); }

} else throw FailException("not supported type in voting"); } catch(const exception& e){ throw FailException(e.what()); } }

113

Page 126: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

/** * Get the winner */Item* SmartTV::winner(){ pair<Item*, double> max; vector<pair<Item*, double> >::const_iterator it = votes.begin(); for( ; it != votes.end(); ++it){ if((*it).second > max.second) max = *it; } return max.first;}

/** * Calculate the vote histogram and return it */vector<pair<Item*, double> > SmartTV::vote_histogram(const Tuple& t)

throw (FailException){ try{ predict(t); } catch(const exception& e){ throw FailException(e.what()); } return votes;}

/** * Convert a tuple to a vector */vector<double> SmartTV::tupleToVector(const Tuple& t){ vector<boost_bitset> bt= pi.encodeS(t); vector<double> tupleVector(pi.degree()-1); // exclude class label for(int i=0; i<numdimension; i++){ double tval = 0;

for(int j=pi[i].binaryLength()-1; j>=0; j--)

tval += pow(2.0,j) * bt[i][j]; if(((pi.getAttributeInfo(i)).type()==Type::UNSIGNED_DOUBLE)|| ((pi.getAttributeInfo(i)).type()==Type::SIGNED_DOUBLE)){ DoubleAttributeInfo *doubleAttInfo =

dynamic_cast<DoubleAttributeInfo*>(&(pi.getAttributeInfo(i)));

for(int k=0;k<doubleAttInfo->precision();k++) tval *= 0.1; } tupleVector[i] = tval; } return tupleVector;}

114

Page 127: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

/** * Set a class label */void SmartTV::setClassLabel(const string& cl) throw (UnknownAttribute){ classLabel = cl; classIndex = pi.getAttributeIndex(cl); classType = (pi.getAttributeInfo(cl)).type();}

/** * Get a class label */string SmartTV::getClassLabel()const{ return classLabel; }

/** * Compute Euclidian distance L2(x,a) of two vectors: x and a */double SmartTV::L2(const vector<double>& x, const vector<double>& a){ double sum = 0.0; for(int i=0; i<numdimension; i++) sum += pow((x[i] - a[i]),2); return sqrt(sum);}

/** * Create a maximum heap: the maximum value is in the root of the heap* and the heap contains the smallest values. */void SmartTV::createHeap(Heap *heap, size_t i, size_t newidx,

double newval){ heap[i].val = newval; heap[i].idx = newidx; while((i>0) && (heap[(i-1)/2].val < newval)) { heap[i].val = heap[(i-1)/2].val; heap[i].idx = heap[(i-1)/2].idx; i = (i-1)/2; } heap[i].val = newval; heap[i].idx = newidx; }

115

Page 128: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

/** * Adjust the maximum heap */void SmartTV::adjustHeap(Heap *heap, size_t pos, size_t heapsize){ double val = heap[pos].val; size_t idx = heap[pos].idx; int i = 2*(pos+1)-1; while(i<=heapsize-1){ if((i<heapsize-1) && (heap[i].val < heap[i+1].val)){ i++; } if(val >= heap[i].val) break; heap[(i-1)/2].val = heap[i].val; heap[(i-1)/2].idx = heap[i].idx; i = 2*(i+1)-1; } heap[(i-1)/2].val = val; heap[(i-1)/2].idx = idx;}

/** * Compute vector mean */double SmartTV::getMean(int i){ BasicPt p(pi); double sum = 0.0; int attType = (pi.getAttributeInfo(i)).type(); if((attType==Type::SIGNED_DOUBLE)||(attType==Type::SIGNED_INT)){

for(int j=pi[i].binaryLength()-2; j>=0; j--){ sum += pow(2.0,j) * ps.count(p(i,j) &

!p(i,pi[i].binaryLength()-1))-pow(2.0,j) * ps.count(p(i,j) & p(i,pi[i].binaryLength()-1));

} }

else{for(int j=pi[i].binaryLength()-1; j>=0; j--)

sum += pow(2.0,j) * ps.count(p(i,j)); }

if((attType==Type::SIGNED_DOUBLE)||(attType==Type::UNSIGNED_DOUBLE)){

DoubleAttributeInfo *doubleAttInfo = dynamic_cast<DoubleAttributeInfo*>(&(pi.getAttributeInfo(i)));

for(int t=0;t<doubleAttInfo->precision();t++) sum = sum * 0.1; } return sum / pi.cardinality();}

116

Page 129: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

117

Page 130: Dissertation - North Dakota State Universitycs.ndsu.edu/~perrizo/saturday/Taufik_abidin_Dissertation.doc · Web viewThe second file contains the count of , and the third file contains

A.4. Makefile

## # Makefile # Author: Taufik Abidin # DataSURG Research Group at CS NDSU # #

PTREE_LIB = $(PTREE_HOME)/lib/ptreeAPI2.aPTREE_INC = $(PTREE_HOME)/includeBOOST_HOME = $(HOME)/usr/boost_1_29_0XML_FLAGS = `xml2-config --cflags`XML_LIBS = `xml2-config --libs`

CC = c++C_FLAGS = -g -c

SMART_TV = SmartTV.o

all: SmartTVApp

## SMART-TV object#SmartTV.o: SmartTV.cpp SmartTV.h

$(CC) -pg $(C_FLAGS) $*.cpp -o $@ -I$(PTREE_INC) -I$(BOOST_HOME) $(XML_FLAGS)

## SMART-TV Application#SmartTVApp: clean $(SMART_TV)

$(CC) -g -o $@ [email protected] $(SMART_TV) $(PTREE_LIB) -I$(PTREE_INC) -I$(BOOST_HOME) $(XML_FLAGS) $(XML_LIBS)

## Clean#clean:

@rm -f SmartTV.o SmartTVApp.o SmartTVApp@rm -f *~

118