Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
VERTICAL TOTAL VARIATION FOR DEVELOPING A
SCALABLE NEAREST NEIGHBOR CLASSIFIER
A Dissertation Submitted to the Graduate Faculty
of theNorth Dakota State University
of Agricultural and Applied Science
By
Taufik Fuadi Abidin
In Partial Fulfillment of the Requirements for the Degree of
DOCTOR OF PHILOSOPHY
Major Department: Computer Science
May 2006
Fargo, North Dakota
This page is intentionally left blank for approval sheet
ii
ABSTRACT
Abidin, Taufik Fuadi, Ph.D., Department of Computer Science, College of Science and Mathematics, North Dakota State University, May 2006. Vertical Total Variation for Developing a Scalable Nearest Neighbor Classifier. Major Professor: Dr. William Perrizo.
Recent advances in computer power, network, information storage, and multimedia
have led to a proliferation of stored data in various domains, such as bioinformatics, image
analysis, the World Wide Web, networking, banking, and retailing. This explosive growth
of data has opened the need for developing efficient and scalable data-mining techniques
that are capable of processing and analyzing large datasets. In data mining, classification is
one of the important functionalities. Classification involves predicting the class label of
newly encountered objects using feature attributes of a set of pre-classified objects. The
classification result can be used to understand the existing objects in the dataset and to
understand how new objects are grouped. In this dissertation, we focus our work on
classification, more precisely on a scalable classification algorithm. We propose an
efficient and scalable nearest neighbor classification algorithm that efficiently filters the
candidates of neighbors by creating a total variation contour around the unclassified object.
The objects within the contour are considered as the superset of nearest neighbors. These
neighbors are identified efficiently using P-tree range query algorithm without having to
scan the total variation values of the training objects one by one. The proposed algorithm
further prunes the neighbor set by means of dimensional projections. After pruning, the
k-nearest neighbors are searched from the pruned neighbor set. The proposed algorithm
uses P-tree vertical data structure, one choice of vertical representation that has been
experimentally proven to address the curse of scalability and to facilitate efficient data
mining over large datasets. An efficient and scalable Vertical Set Squared Distance (VSSD)
is used to compute total variation of a set of objects about a given object. The efficiency
iii
and scalability of the proposed algorithm are demonstrated empirically through
experimentation using both real-world and synthetic datasets. The application of the
proposed algorithm in image categorization is also discussed. Finally, the step-by-step
integration of the proposed algorithm into DataMIMETM as a prototype of a new nearest
neighbor classification algorithm that uses P-tree technology is also reported.
iv
ACKNOWLEDGMENTS
First of all, I would like to express my truthful gratitude and appreciation to my
adviser and research supervisor, Dr. William Perrizo, for his strong support, constructive
comments, suggestions, and encouragement that has brought me into a high level of
research accomplishment and made me successfully complete the highest degree.
I would also like to gratefully acknowledge my supervisory committee, Dr. D.
Bruce Erickson, Dr. Akram Salah, and Dr. Xiwen Cai, for their valuable advice and
comments. Also thanks to Amal Perera and Masum Serazi for their friendship and much
help in assisting me understand the P-tree API. I am also grateful to Ranapratap Syamala
for his willingness in checking the language. Last but not least, many thanks also go to all
DataSURG members for numerous stimulating research discussions.
v
DEDICATION
To my wonderful children, Alif, Zafir, and Jaza, and my lovely wife, Ridha, whose
love and patience have instilled in me the spirit to complete my doctoral degree.
To my father, Abidin, and my mother, Salmah, who have waited for so long for this
time to come, and my sisters, Rina, Nurul, and their families, who have always been the
true supporter in my life.
vi
TABLE OF CONTENTS
ABSTRACT .......................................................................................................................iii
ACKNOWLEDGMENTS......................................................................................................v
DEDICATION.......................................................................................................................vi
LIST OF TABLES..................................................................................................................x
LIST OF FIGURES...............................................................................................................xi
CHAPTER 1. INTRODUCTION..........................................................................................1
CHAPTER 2. CLASSIFICATION........................................................................................7
2.1. Overview.............................................................................................................7
2.2. Classification Algorithms...................................................................................8
2.1.1. Support Vector Machine.............................................................................8
2.1.2. Naïve Bayesian Classifiers.........................................................................9
2.1.3. Decision Tree Classifiers..........................................................................10
2.1.4. K-Nearest Neighbor Classifiers................................................................11
CHAPTER 3. P-TREE VERTICAL DATA STRUCTURE................................................14
3.1. Introduction.......................................................................................................14
3.2. The Construction of P-Trees.............................................................................15
3.3. P-Tree Operations.............................................................................................15
CHAPTER 4. VERTICAL APPROACH FOR COMPUTING TOTAL VARIATION.....18
4.1. Introduction.......................................................................................................18
4.2. The Proposed Approach....................................................................................19
4.2.1. Vertical Set Squared Distance..................................................................19
4.2.2. Retaining Count Values............................................................................22
4.2.3. Complexity Analysis.................................................................................25
4.3. Performance Analysis.......................................................................................28
4.3.1. Datasets.....................................................................................................28
4.3.2. Run Time and Scalability Comparison.....................................................30
4.4. Conclusion........................................................................................................33
CHAPTER 5. SMART-TV: AN EFFICIENT AND SCALABLE NEAREST NEIGHBOR BASED CLASSIFIER....................................................................................34
5.1. Introduction.......................................................................................................34
vii
5.2. Hyper Parabolic Graph of Total Variations......................................................35
5.3. The Proposed Algorithm...................................................................................39
5.3.1. Preprocessing Phase..................................................................................40
5.3.2. Classifying Phase......................................................................................41
5.3.3. Detailed Description of the Proposed Algorithm......................................42
5.4. Illustrative Examples of the Pruning Technique...............................................47
5.5. Weighting Function..........................................................................................50
5.6. Performance Analysis.......................................................................................52
5.6.1. Datasets.....................................................................................................53
5.6.2. Parameterization.......................................................................................56
5.6.3. Classification Accuracy Comparison........................................................57
5.6.4. Classification Time Comparison...............................................................64
5.7. Conclusion........................................................................................................68
CHAPTER 6. THE APPLICATION OF THE PROPOSED ALGORITHM IN IMAGE CLASSIFATION..................................................................................................................69
6.1. Introduction.......................................................................................................69
6.2. Image Preprocessing.........................................................................................70
6.3. Experimental Results........................................................................................72
6.3.1. An Example on Corel Dataset...................................................................73
6.3.2. Classification Accuracy............................................................................75
6.3.3. Classification Time Comparison...............................................................77
6.4. Conclusion........................................................................................................78
CHAPTER 7. INTEGRATING THE PROPOSED METHOD INTO DATAMIMETM......80
7.1. Introduction.......................................................................................................80
7.2. Server-Side Components..................................................................................81
7.3. Client-Side Components...................................................................................82
7.4. Graphical User Interface...................................................................................82
CHAPTER 8. CONCLUSION AND FUTURE WORK.....................................................87
8.1. Conclusion........................................................................................................87
8.2. Future Work......................................................................................................89
REFERENCES.....................................................................................................................91
APPENDIX ......................................................................................................................97
A.1. SmartTVApp Class...........................................................................................97
A.2. SmartTV Header Class...................................................................................101
viii
A.3. SmartTV Class................................................................................................103
A.4. Makefile..........................................................................................................116
ix
LIST OF TABLES
Table Page
1. Example dataset................................................................................................................22
2. The count values of each class....................................................................................24
3. The specification of the machines..............................................................................28
4. Time for VSSD to compute all count values..............................................................31
5. The average time to compute the total variations under different machines..............32
6. Loading time comparison...........................................................................................33
7. Class distribution of KDDCUP dataset......................................................................55
8. Classification accuracy on the KDDCUP dataset for k = 3........................................58
9. Classification accuracy on the KDDCUP dataset for k = 5........................................58
10. Classification accuracy on the KDDCUP dataset for k = 7........................................59
11. Classification accuracy on the WDBC dataset for k = 3............................................59
12. Classification accuracy on the WDBC dataset for k = 5............................................59
13. Classification accuracy on the WDBC dataset for k = 7............................................59
14. Classification accuracy comparison on the OPTICS dataset for k = 3.......................60
15. Classification accuracy comparison on the OPTICS dataset for k = 5.......................61
16. Classification accuracy comparison on the OPTICS dataset for k = 7.......................61
17. Classification accuracy on the Iris dataset for k = 3...................................................62
18. Classification accuracy on the Iris dataset for k = 5...................................................62
19. Classification accuracy on the Iris dataset for k = 7...................................................62
20. Average classification accuracy for k = 3..................................................................63
21. Average classification accuracy for k = 5..................................................................63
22. Average classification accuracy for k = 7..................................................................63
23. Run time and scalability comparison on the RSI dataset............................................66
24. Preprocessing time of SMART-TV algorithm on the RSI dataset.............................66
25. Average classification time.........................................................................................67
26. Classification accuracy comparison using k = 3, k = 5, and k = 7.............................76
27. Preprocessing time of SMART-TV algorithm on the Corel dataset...........................78
x
LIST OF FIGURES
Figure Page
1. Example of maximized and non-maximized margins..................................................9
2. A decision tree............................................................................................................11
3. The 1-dimensional P-trees from attribute A1..............................................................17
4. Algorithm to get the count values...............................................................................25
5. Algorithm to compute TV(X, a)..................................................................................27
6. The original image of the RSI dataset........................................................................29
7. Time trend for computing the total variations............................................................32
8. Graph of .........................................................................................37
9. Graph of .......................................................................38
10. Graph of ......................................................................................40
11. The pre-image of the contour of interval [g(b), g(c)] creates a Nbrhd(a, e)...............42
12. P-tree range query algorithm......................................................................................43
13. Algorithm to create a contour mask............................................................................43
14. An illustration of the dimensional projection contour................................................44
15. Pruning algorithm.......................................................................................................46
16. Pruning the neighbor set using dimensional projections............................................47
17. Pruning example 1......................................................................................................48
18. Pruning example 2......................................................................................................49
19. Weighting function ....................................................................................51
20. Weighting function .......................................................................................51
21. Weighting function ....................................................................................52
22. Run time and scalability comparison on the RSI dataset............................................65
23. Classification time on the KDDCUP dataset..............................................................67
24. The classes of the images............................................................................................73
25. Example using Corel dataset with pruning.................................................................74
26. Classification time, k = 5, e = 0.01, and MS = 1000...................................................78
27. Code segments of SMART-TV in the Predictor class................................................82
28. Graphical user interface for mining with SMART-TV algorithm..............................83
xi
29. Graphical user interface showing the classification results........................................84
30. Graphical user interface showing the vote histogram.................................................85
31. Graphical user interface showing the performance of a validation............................86
xii
CHAPTER 1. INTRODUCTION
Recent advances in computer power, network, information storage, and multimedia
have led to a proliferation of stored data in various domains like bioinformatics, image
analysis, the World Wide Web, networking, banking, and retailing. This fast growth of
stored data has opened the need for developing efficient and scalable data-mining
techniques that can extract valuable and interesting information from a large volume of
data.
Data mining, or knowledge discovery in databases, emerges due to the explosive
growth of data. It is a non-trivial process of extracting interesting and potentially valuable
patterns in a large volume of data [1]. Data-mining functionalities can be divided into three
broad categories: association rules mining (ARM), clustering, and classification. ARM
discovers interesting association or correlation relationships among objects in the databases
that match the support and confidence thresholds. A common application of ARM is
market basket research which analyzes the correlation between customers’ purchasing
habits and the data items that the customers purchased. The results of the analysis can help
the decision makers in designing catalogs, arranging shelves, and deciding appropriate
marketing processes.
1
Clustering can be defined as the process of grouping a set of data objects to
discover meaningful clusters such that the objects within the same cluster are more similar
to one another but dissimilar to the objects across clusters. Clustering is also known as
unsupervised learning because the class label of each data instance in the dataset does not
exist. Cluster analysis is useful in understanding the distribution of the data and is often
used as the preprocessing step for other data-mining algorithms operating on the detected
clusters [1].
Classification, in contrast, is a process of assigning class label to the unclassified
objects based on some notion of similarity between the data objects in the training set and
the unclassified object. Because the training set of pre-classified objects is used to
supervise the classification process, classification is also known as supervised learning. The
first step in classification is to build a model or a classifier, and the second step is to predict
the class membership of new data instances using the classifier. Often, before the classifier
is used to classify the real samples, the accuracy of the classifier is estimated. To do this
estimation, a set of testing samples with known class labels that is independent from the
training set is created. The accuracy of the classifier is measured based on the percentage of
the correct assignments of the testing samples.
2
This dissertation is based on the research projects and papers published in
[2, 3, 4, 5, 6]. The work is focused on classification, one of the data-mining functionalities.
Classification is commonly used in various domains such as bioinformatics, image
analysis, spatial databases, and banking. In bioinformatics, a classification model is used to
predict the functions of the newly discovered genes based on the functions of a collection
of well annotated genes. In the Sorcerer II oceanographic expedition headed by J.C. Venter
[7] for example, the group of researchers discovered at least 1,800 new species of bacteria
and more than 1.2 million new genes from about 200 liter samples of ocean water collected
in the Sargasso Sea near Bermuda. These new genes need to be classified in order to
understand their behavior and group. To classify such large number of new genes, efficient
and scalable classification techniques are needed.
Many excellent studies in classification have been conducted. Vapnik [8]
introduced the Support Vector Machine (SVM) classification algorithm that transforms the
input space into a higher dimensional feature space with a nonlinear mapping (kernel
function). With an appropriate mapping, SVM creates a hyperplane (decision boundary)
such that the distance between the hyperplane and the closest samples (support vectors)
from the two classes is maximized. Once the maximum hyperplane is determined, the class
label of the new sample is a matter of deciding on which side of the hyperplane does the
new sample lie. The ability to determine the maximum hyperplane has made SVM often
achieve good classification accuracy. However, SVM is very slow in training because it has
to find the hyperplane with the largest margin, always treat the classification problems as
binary classification problems, and does not scale to train very large datasets [9, 10].
3
Cover et al. [11] introduced what is called the nearest neighbor classifier. Nearest
neighbor classifiers are fascinating because of their simplicity and generality to model a
wide range of problems. The classifiers search the nearest neighbors in the training set in a
brute-force fashion and assign a class label to the unclassified sample based on the plurality
of category of the nearest neighbors. The search is repeated for every new instance. In the
case of large training sets, brute-force search for nearest neighbors is very expensive and
tedious.
The work proposed in [12] introduced an algorithm to accelerate the k-nearest
neighbor search for image retrieval problems. In retrieval problems, the goal is to
determine the nearest neighbors of the query object, while in the classification problems,
the goal is to find the nearest neighbors that will determine the class label of the
unclassified object. In the algorithm, the distance between the query object and data objects
in the training set is accumulated by scanning the dimensional projections one by one. The
assumption is that, after scanning a few of them, the partial distance to the query object is
known and the lower and upper bounds of the complete distance can be estimated.
Subsequently, the data objects that are outside the estimated bounds are pruned. The
process is repeated until the candidate set contains exactly k objects or all dimensions have
been scanned. Good acceleration time was reported to search for the nearest neighbors.
However, when the database is very large, the time to scan the dimensions to estimate the
partial distance will be significant. In addition, the algorithm was designed specifically for
accelerating nearest neighbor search in content-based image retrieval problems.
Another strategy commonly used to accelerate the nearest neighbor search for
classification and retrieval problems is to use additional data structure such as k-d tree. K-d
tree hierarchically decomposes the space into relatively small cells such that the cells
4
contain small number of objects. The objects are then accessed quickly by traversing down
the tree. Studies show that k-d tree can reduce the searching complexity from O(n) to
O(log n) since the object is not exhaustively compared with every object in the space.
However, k-d tree is efficient only for small datasets in the range of thousands to hundreds
of thousands, and its performance degrades in high dimensions [13].
Most of the state-of-the-art classification algorithms use the traditional horizontal
record based structuring to represent the data. The use of traditional horizontal record and
sequential scan based approach are known to scale poorly to very large data repositories
[14]. Jim Gray from Microsoft in his talk at ACM SIGMOD, 2004 [15] emphasized that the
vertical column based structuring as opposed to horizontal row based structuring can speed
up query process and ensures scalability. Jin et al. [16] proposed a solution for scalability
of data mining algorithms through a parallelization approach, which distributes the
processing time on several high performance clusters. However, according to [17],
parallelization approach alone is inadequate to solve the problem of scalability because the
size of data volume increases much faster than the CPU processing time advancement.
In this dissertation, we propose a new and scalable nearest neighbor based
classification algorithm that efficiently filters the candidates of neighbors by means of
vertical total variation. The vertical total variation of a set about a given object is computed
using a new, efficient, and scalable Vertical Set Square Distance (VSSD) algorithm, which
will be discussed in detail in Chapter 4. The proposed algorithm employs P-tree vertical
data structure, one choice of vertical representation that has been experimentally proven to
address the curse of scalability and to facilitate efficient data mining over large datasets.
This vertical data structure was introduced by Perrizo [18].
5
The proposed algorithm is a nearest neighbor based classification algorithm, which
will be discussed in detail in Chapter 5. Unlike the k-nearest neighbor classification
algorithm where the k-nearest neighbors are searched from the entire training set, in the
proposed algorithm, the nearest neighbors are searched from a pruned neighbor set. This
neighbor set is obtained by forming a total variation contour around the unclassified object.
The objects within this contour are considered as the superset of nearest neighbors, which
can be identified efficiently using P-tree range query algorithm without the need to scan the
total variation values of the training objects one by one. An efficient pruning technique
using dimensional projections is also introduced to prune the superfluous neighbors in the
neighbor set so that after pruning, the k-nearest neighbors are searched from a small set of
neighbors. The efficiency, scalability, and effectiveness of the proposed algorithm are
demonstrated empirically using both real-world and synthetic datasets. In particular, the
datasets of size up to ninety six million is used to evaluate the run time and scalability of
the algorithm.
We extend the work by applying the proposed algorithm in image classification
problem. Chapter 6 is an attempt to demonstrate that the proposed algorithm can be used
for image classification task, which in general uses large image repositories and represents
image feature vectors in a high dimensional space. The feature vectors are constructed from
color distribution, image texture, image structure (shape) or their combination. In this
work, we extract both color distribution and image texture features to represent the images
and observe the performance of the proposed algorithm in image classification task.
We integrate the proposed algorithm into DataMIMETM, the P-tree based data
mining system, as a prototype of a new vertical nearest neighbor classification algorithm.
The discussion about the integration process and snap shots of the graphical user interface
6
will be presented in Chapter 7. The code of the algorithm is enclosed in the appendix and
can be downloaded from http://www.cs.ndsu.nodak.edu/~datasurg/codes.php3.
Finally, we conclude this dissertation in Chapter 8 by summarizing the main
contributions of the work and presenting some directions for future work.
7
CHAPTER 2. CLASSIFICATION
2.1. Overview
Classification is one of the data-mining functionalities. Classification involves
predicting the class label of newly encountered objects using feature attributes of a set of
pre-classified objects. The predictive pattern from the classification result can be used to
understand the existing objects in the databases and to understand how new objects are
grouped [19]. The goal of classification is to determine the data membership of the
unclassified objects.
Classification in data mining has much in common with the classification done by
machine learning and statistics communities. The main difference is in the cardinality of
the datasets. In data mining, the cardinality of the data is assumed to be very large,
generated from many resources such as satellite images, sales, and microarray data, which
has now reached terabytes in size [20]. This has made the scalability becomes a major issue
and a precondition for the success of any algorithm nowadays.
Many classification techniques have been introduced in the literature. We will
review some of them in the next section. The classification algorithms can be compared
based on several criteria such as scalability, speed, accuracy, and robustness. In this
dissertation, we compare the proposed algorithm in terms of scalability, speed, and
accuracy. The scalability refers to the ability of the algorithm to run, given large amount of
data, the speed is the reasonable amount of time needed to finish the classification task, and
the accuracy refers to the ability of the algorithm to correctly predict the class membership
of the new instance [1].
8
2.2. Classification Algorithms
In general, classification algorithms can be divided into two subgroups. The first
group is the classification algorithms that construct a model from the set of labeled objects
in the training set. These algorithms are known as eager classifiers. Much of the time in
these algorithms is invested for the learning phase to generate a general model from the
training set, and the classification is just a matter of using the generated model. Examples
of these classification methods are Support Vector Machine, Bayesian classifiers, Neural
Net, and decision tree classifiers. The second group is the lazy classification algorithms that
invest no effort in learning phase, but put all effort in classification phase. The example of
this type of algorithm is the k-nearest neighbor classifiers. The k-nearest neighbor
classifiers find the most similar objects to the new unclassified object in the training set,
and classify the new object into the most common class among these most similar objects.
These most similar objects are usually called the nearest neighbors. In the following
sections, we will briefly summarize some of the classification algorithms.
2.1.1. Support Vector Machine
Support Vector Machine (SVM) is a well-known classification technique [8]. SVM
transforms the input space into a higher dimensional feature space with kernel function and
creates a hyper plane that separates the binary classes such that the distance between the
hyper plane and the support vectors in each class is maximized (Figure 1). SVM has been
validated experimentally to often achieve good accuracy. However, SVM does not scale to
very large training set, and the overall performance of the SVM algorithm largely depends
on the choice of the kernel function. One example case of expensive training of SVM can
be found in [10]. In this work, SVM algorithm takes about 2.85 hours to learn from a
9
training set of 1000 points in 2 dimensional spaces (checkerboard dataset), running on 400
MHz Pentium II Xeon machine with 2 Gigabytes of memory.
Figure 1. Example of maximized and non-maximized margins.
2.1.2. Naïve Bayesian Classifiers
Naïve Bayesian classifiers are statistical classifiers based on the Bayesian theorem
[21]. The class label of the new instance is predicted based on the probability to which
class the new instance should belong. Let X be the new instance whose class label is
unknown, H be the hypothesis such that the instance X belongs to a specific class C, and
is the prior probability of H for any instance, the objective is to determine ,
the posterior probability of H given X using the probabilities P(H), P(X), and P(X|H). The
posterior probability can be estimated using the Bayesian theorem:
)()()|()|(
XPHPHXPXHP
The naïve Bayesian classifiers take the above relation to estimate a class label of a
new instance, X. Let be the class labels in the given training set. The class
10
label of X can be estimated using the highest conditional probability, )|( XCP i , such that
, where as follows:
P(X) can be considered as a constant for all classes while can be estimated from the
training set, i.e., the number of samples in class divided by the total number of training
samples. The naïve Bayesian classifiers take an assumption of conditional independence
between the attributes and estimate as the product of the probability of attribute
value in the new instance, X, that belongs to class Ci.
Although the naïve Bayesian classifiers have the minimum error rate compared to
the other classifiers, in practice, this is not always the case because of inaccuracies in the
assumption of attributes and class-conditional independence [21]. The computational cost
of the naïve Bayesian classifiers lies on the complexity to compute the probability values
which can be very expensive in large training sets and dimensions.
2.1.3. Decision Tree Classifiers
The concept of decision trees was initially introduced by Quinlan [22]. In a decision
tree classifier, the training set is split into smaller subsets based on attribute values using a
split rule. The internal nodes of the tree represent the decision rules while the leaf nodes of
the tree represent the predicted class labels. The unclassified sample is classified by
traversing the tree starting at the root. An evaluation about the attribute of the unclassified
sample is made at each internal node to determine the next branch. The class label of the
unclassified sample is the leaf node where the tree traversal ends. A simple example of a
decision tree adopted from [23] is shown in Figure 2.
11
Figure 2. A decision tree.
Most of the decision tree algorithms such as ID3, C4.5 and CART work well for
relatively small datasets. Their efficiency becomes questionable when applied to real-world
large datasets since the training set will not fit into memory. More recent decision tree
algorithm such as CLOUD [24] proposed a decision tree algorithm for large datasets and
introduced a new mechanism for splitting the attributes. However, the proposed splitting
method requires at least one pass over the training set, which also can be very expensive as
the training set size grows.
2.1.4. K-Nearest Neighbor Classifiers
Assigning classes for unclassified sample based on the nearest neighboring samples
has been investigated since 1967 [11]. The classification scheme can be summarized as
follows: Given training set of data objects in d-dimensional spaces, the k-nearest neighbor
(KNN) classifiers assign a category to an unclassified object based on the plurality of
categories of the k-nearest neighbors.
KNN classifiers do not build the model in advance. Instead, they search for the
k-nearest neighbors directly from the training set. The closeness is defined in terms of a
distance function, e.g. Euclidian distance. KNN classifiers often produce good
classification accuracy on some cases. However, a potential drawback of KNN classifiers is
12
High Risk
High Risk Low Risk
Car Type
Age
Low Risk
30< 20 20 < 30
Sports Family
that the classification time will be significant when the size of the training set is very large.
Searching through the training set to find the k-nearest neighbors can be a time-consuming
process. The complexity to find the k-nearest neighbors in brute force manner takes
O(nm) for each unclassified sample since the classifiers have to visit each of the n training
objects and perform m operations to calculate the distance [25]. This high complexity
makes the approach impractical for applications involving very large datasets.
Many excellent studies have been done to make KNN classifiers scalable, such as
those reported in [26, 27]. Khan et al. [26] introduced a P-tree based k-nearest neighbor
classifier (P-KNN), which uses P-tree vertical structure to accelerate the classification time
and uses Higher Order Bit Similarity (HOBBIT) as the similarity metric. As the name
implies, the similarity is measured based the number of consecutive bits in the higher order
position that are in common. Formally, HOBBIT similarity between integers A and B is
defined as follows:
HOBBIT(A, B) = max{s | 0 i s ai = bi}
where ai and bi are the ith bits position of integer A and B. The distance between A and B is
then defined as:
where n is the number of dimensions, and m is the number of bits used to represent the
integer values A and B. In order to use the HOBBIT metric, all dimensions must be
represented using the same number of bits.
The closed k-nearest neighbor set (closed-KNN set) was also introduced in P-KNN
algorithm. The closed-KNN set is a superset of k-nearest neighbor set that takes all
equidistant neighbors within the k distance. According to Khan, the closed-KNN set can
13
improve classification accuracy [26]. P-tree technology requires no additional computation
to determine the closed-KNN set. On the other hand, the classical KNN classifiers require
an additional scan over the training set to find the closed-KNN set. This additional scan can
be very expensive when the training set is very large.
P-KNN works as follows: It builds a neighborhood ring around the unclassified
sample and successively expands the ring until at least k-nearest neighbors are found. The
ring expansion grows by ignoring one least significant bit at a time. Experiments show that
P-KNN is fast and accurate in spatial data. However, the neighborhood ring produced by
HOBBIT similarity cannot evenly expand from the unclassified object when a bit is
ignored, which can consequently move the center of the ring away from the unclassified
object.
An alternative approach to alleviate the uneven expansion of the neighborhood is to
use an Equal Interval Neighborhood ring (EIN-ring) approach, which builds the ring
around the unclassified object equally. However, this approach has additional
computational overheads, i.e., it requires many logical AND and OR operations to form an
equal neighborhood ring.
Podium Incremental Neighbor Evaluator (PINE) [27] is another P-tree based
k-nearest neighbor classifier. PINE allows all data objects in the training set to vote, but
each of them is weighted in a podium fashion based on their distance to the unclassified
object. In PINE, HOBBIT metric is also used as the distance metric, and the Gaussian
function is used as the podium function.
14
CHAPTER 3. P-TREE VERTICAL DATA STRUCTURE
3.1. Introduction
Vertical data representation represents and processes the data differently from
horizontal data representation. In vertical data representation, the data is structured column
by column and processed horizontally through logical AND or OR operations, while in
horizontal data representation, the data is structured row by row and processed vertically
through scanning or using some notion of index. P-tree vertical data structure [18] is one
choice of vertical data representation. This vertical data structure is used for data
presentation and processing in this dissertation. P-tree vertical data structure was invented
in 2001 and was primarily used for representing spatial data vertically [26, 27]. However,
since then, the P-tree has been intensively exploited in various domains and data mining
algorithms, ranging from classification, clustering, association rule mining to outlier
analysis [3, 4, 28, 29, 30, 31]. In September 2005, P-tree technology was patented in the
United States by North Dakota State University, patent number 6,941,303.
P-tree vertical data structure is a lossless, compressed, and data-mining ready data
structure. P-tree is lossless because the vertical bit-wise partitioning guarantees that the
original data values can be retained completely. P-tree is compressed because when the
segments of bit sequences are either pure-1 or pure-0, they can be represented in a single
bit. P-tree is data-mining ready because it addresses the curse of cardinality or the curse of
scalability, one of the major issues in data mining. P-tree vertical data structure has been
used in various data mining algorithms and has been experimentally proven to have great
potential to address the curse of scalability.
15
3.2. The Construction of P-Trees
P-tree can be formed directly from binary data as well as from categorical and
numerical data. The categorical and numerical data are typically organized in a relational
table containing several attributes. The construction of P-tree vertical data structure is
started by converting the dataset, normally arranged in a relation R(A1, A2,…, Ad) of
horizontal records, into binary. Each attribute in the relation is vertically partitioned
(projected), and for each bit position in the attribute, vertical bit sequences (containing 0s
and 1s) are subsequently created. During partitioning, the relative order of the data is
retained to ensure convertibility. In 0-dimensional P-trees, the vertical bit sequences are left
uncompressed and are not constructed into predicate trees. The size of 0-dimensional
P-trees is equal to the cardinality of the dataset. In 1-dimensional compressed P-trees, the
vertical bit sequences are constructed into predicate trees. In this compressed form, AND
operations can be accelerated. The 1-dimensional P-trees are constructed by recursively
halving the vertical bit sequences and recording the truth of “purely 1-bits” predicate in
each half. A predicate 1 indicates that the bits in that half are all 1s, and a predicate 0
indicates otherwise. To indicate the P-tree, two subscripts are used. The first subscript
indicates the attribute to which the P-tree belongs, and the second subscript indicates the bit
position of that attribute. Consider Figure 3 to get some insights on how 1-dimensional
P-trees of a single attribute A1 are constructed.
3.3. P-Tree Operations
As opposed to horizontal record structure in which data are processed vertically
through scanning, in P-tree vertical data structure, data are processed horizontally through
logical operations such as AND and OR. These logical operations are extremely fast, and
16
thus, any data mining functionality that facilitates these operations can be performed
extremely fast. COMPLEMENT, a unary operator that flips every bit into its negation can
also be applied on P-trees vertical structure. The range queries, values, or any other patterns
can be obtained using a combination of these Boolean algebraic operators.
Besides AND, OR, and COMPLEMENT, another powerful aggregate operation is a
COUNT. The COUNT operation is very important in P-tree vertical structure, which
counts the number of 1s in the basic or complement P-tree. For example, when using
P-trees P11, P12, and P13 from Figure 3, the count values resulted from COUNT(P11),
COUNT(P12), and COUNT(P13) are 4, 5, and 4 respectively. The COUNT operation has
been implemented in the P-tree API [32]. In fact, it is the main operation exploited in the
Vertical Set Squared Distance algorithm, which will be discussed in Chapter 4. Detailed
information about the P-tree data structure and its operations can also be found in [33].
17
Figure 3. The 1-dimensional P-trees from attribute A1.
18
1 0 00 1 00 1 01 1 11 0 10 0 11 1 00 1 1
A1
42275163
A1
10011010
P13
01110011
P12
00011101
P11
0 1
0
0 0
0 1 0 1
P12
1 0 0 1 1 0 1 0
0
0 0
0 0 0 0
P13
0 1 0 1
0
0 0
0 0 1 0
P11
CHAPTER 4. VERTICAL APPROACH FOR COMPUTING
TOTAL VARIATION1
4.1. Introduction
The rapid growth of data poses great challenges and generates an urgent need for
efficient and scalable algorithms that can deal with massive datasets. In this chapter, we
propose a vertical approach for computing set squared distance that measures the total
variation of a set of objects about a given object in vector space. The total variation is very
useful in classification, which will be demonstrated in Chapter 5, and clustering to identify
the cluster boundary and determine the cluster membership [28]. Set squared distance,
defined as , measures the total variation or the cumulative squared
separation of a set of vectors in X about a given vector a, denoted as TV(X, a).
Since scalability is becoming a major issue nowadays due to the availability of large
volume of datasets, any new algorithms should be able to handle large datasets. In this
chapter, we focus on the scalability of the proposed approach. The proposed approach
employs P-tree vertical data structure that organizes the data vertically and processes it
horizontally through fast and efficient logical AND, OR, or NOT operations. Using P-tree
vertical data structure, the need for repeatedly scan the dataset, as commonly done in
horizontal record-based approach, can be avoided. We will demonstrate the scalability of
the proposed approach empirically through several experiments.
Throughout the chapter, we use the term “VSSD” to refer to Vertical Set Squared
Distance, a vertical approach for computing total variation, and use the term “HSSD” to
1 This chapter is a modified version of a paper which appears in the International Journal of Computers and their Applications (IJCA), vol. 13, no. 2, pp. 94-102, June 6, 2006.
19
refer to Horizontal Set Squared Distance, a horizontal approach for computing total
variation. The horizontal approach uses a scanning approach to compute the total variation.
4.2. The Proposed Approach
4.2.1. Vertical Set Squared Distance
Let R(A1, A2, …, Ad) be a relation in d-dimensional space. A numerical value, v, of
attribute Ai can be written in b bits binary representation as follows:
, where can either be 0 or 1 (1)
The first subscript corresponds to the attribute to which v belongs, and the second
subscript corresponds to the bit order. The summation in the right-hand side of the equation
is equal to the numerical value of v in base 10.
Now let x be a vector in d-dimensional space. The binary representation of x in b
bits can be written as follows:
(2)
Let X be a set of vectors in a relation R, xX, and a be the vector being examined.
The total variation of X about a, denoted as TV(X, a), can be measured quickly and scalably
using vertical set squared distance as follows:
(3)
20
(4)
Commuting the summation of xX further inside to first process all vectors that belong to
X vertically, and then, process each attribute horizontally, we get
(5)
Let PX be a P-tree mask of set X that can quickly identify data objects in X. PX is a
bit pattern containing 1s and 0s, where bit 1 indicates that an object at that bit position
belongs to X, while 0 indicates otherwise. Using the mask, equation (5) can be simplified
by substituting with . Recall the aggregate COUNT
operation will count the number of bit 1 in the pattern. Hence, the simplified form of
equation (5) can be written as follows:
(6)
Similarly for terms T2 and T3, we derive the solution for the terms as shown in
equation (7) and (8), respectively.
(7)
(8)
21
Hence, VSSD is defined to be
(9)
Now, let us consider X as the relation R itself. In this case, the mask PX can be
removed since all objects now belong to R. Then, the equation (9) can be written as
(10)
where |X| is the cardinality of R.
Furthermore, note that the aggregate function COUNT in both equations (9) and
(10) are independent from the input vector a. The independency of COUNT operations is
really an advantage because once the count values are obtained they can be reused every
time the total variation is computed. This reusability will expedite the computation of total
variation significantly.
4.2.2. Retaining Count Values
We will discuss the strategy to retain the count values in this section. Let us
consider an example dataset containing 10 data points as shown in Table 1. The dataset has
two feature attributes: X and Y, and a class attribute containing two values: C1 and C2. The
binary values of each data point are included in the table for clarity. The last two columns
on the right-side of the table are the masks of each class, denoted as PX1 and PX2. In this
example, we want to measure the total variations of each class about a given point, and
22
thus, the dataset is subdivided into two sets. The first set is a collection of data points in
class C1, and the second set is a collection of data points in class C2.
Table 1. Example dataset.
X Y CLASS X in Binary Y in Binary PX1 PX2
7 6 C1 111 110 1 0
2 6 C2 010 110 0 1
6 3 C1 110 011 1 0
3 3 C2 011 011 0 1
3 4 C2 011 100 0 1
7 5 C1 111 101 1 0
7 2 C1 111 010 1 0
4 5 C2 100 101 0 1
1 4 C2 001 100 0 1
6 5 C2 110 101 0 1
The count values are stored separately in three files. Assume that the dataset are
subdivided into several sets, then the first file contains the count values of COUNT(PX).
The second file contains the count of , and the third file
contains the count values of operations.
Conversely, if the dataset is considered as a single set, the first file contains the cardinality
of the dataset. The second file contains the count of basic P-trees ,
and the third file contains the count of . The count
values in each file are organized accordingly in appropriate order. For example, for
23
COUNT(PX) operation, the count value of the first set is stored first, followed by the count
value of the second set, and so forth. Similarly for , the count
values of the first set are stored first, followed by the next set, and so forth. For each set,
the total number of count values is equal to summation of bit length of each dimension.
This number is needed to correctly retrieve the values of each set when a total variation is
computed. The same strategy is also used for storing the count values of
. For each set, the total number of count values is
equal to the summation of bit length squared in each dimension. Table 2 summarizes the
count values of class C1 and C2, and Figure 4 shows the algorithm to get the count values.
Table 2. The count values of each class.
CLASS i j kCOUNT
CLASS i j kCOUNT
PX^Pij^Pik PX^Pij PX PX^Pij^Pik PX^Pij PX
C1 1 2 2 4 4 4 C2 1 2 2 2 2 6
1 4 1 1
0 3 0 0
1 2 4 4 1 2 1 4
1 4 1 4
0 3 0 2
0 2 3 3 0 2 0 3
1 3 1 2
24
0 3 0 3
2 2 2 2 2 2 2 2 5 5
1 1 1 1
0 1 0 2
1 2 1 3 1 2 1 2
1 3 1 2
0 1 0 1
0 2 1 2 0 2 2 3
1 1 1 1
0 2 0 3
Subscript i represents index of attributes while subscript j and k represent bit-position.
ALGORITHM: GetCounts()INPUT: Ptree set Pi(n-1), ..., Pi1, Pi0
OUTPUT: Count values stored in c3, c2, c1// n is the number of attributes// b is the bitwidth// px is the ptree mask of set X
for i=0 to n-1for j=0 to b-1 for k=0 to b-1 rc = COUNT(p(i,j) & p(i,k) & px)c3.insert(rc)
endfor rc = COUNT(p(i,j) & px) c2.insert(rc)
endforendforc1.insert(COUNT(px))
Figure 4. Algorithm to get the count values.
25
4.2.3. Complexity Analysis
The cost of VSSD lies in the computation of count values. When datasets are
subdivided into several subsets, the complexity is O(kdb2) where k is the number of
subsets, d is the number of feature dimensions, and b is the average bit length. However,
when the entire dataset is considered as a single set, the complexity is reduced to O(db2).
The choice whether to consider the entire dataset as a single set or subdivide it into several
subsets depends on the situation. In classification tasks, the set will be the entire training
set, whereas in clustering tasks, the sets are the clusters [28].
Moreover, the complexity to compute the total variation using VSSD is a constant
or O(1). This constant complexity is obtained because the same count values can be reused
for any given input values. It is just a matter of taking the right count values and solving
equation (9) or (10), without the aggregate function COUNT anymore. For example, let
a = (2, 3) be the vector being examined and the count values of class C1 and C2 are as listed
in Table 2, the total variation of class C1 about a, denoted as TV(C1, a), can be computed as
follows:
= 24 4 + 23 4 + 22 3 + 23 4 + 22 4 + 21 3 + 22 3 + 21 3 + 20 3 +
24 2 + 23 1 + 22 1 + 23 1 + 22 3 + 21 1 + 22 1 + 21 1 + 20 2 –
2 (2 (22 4 + 21 4 + 20 3) + 3 (22 2 + 21 3 + 20 2)) +
4 (22 + 32)
= 105
Similarly, the total variation of class C2 about a, can be computed as follows:
26
= 24 2 + 23 1 + 22 0 + 23 1 + 22 4 + 21 2 + 22 0 + 21 2 + 20 3 +
24 5 + 23 1 + 22 2 + 23 1 + 22 2 + 21 1 + 22 2 + 21 1 + 20 3 –
2 (2 (22 2 + 21 4 + 20 3) + 3 (22 5 + 21 2 + 20 3)) +
6 (22 + 32)
= 42
Figure 5 shows the algorithm to compute total variation using VSSD. The input of
the algorithm is a set of P-trees while the outputs of the algorithm are the count values.
27
ALGORITHM: TV(X,a)INPUT: PTree set Pi(b-1), ..., Pi1, Pi0
OUTPUT: Count values stored in c3, c2, c1
// n is the number of attributes// b is the bit width// interval2 = n x b// interval3 = n x b2
// X=0 means the entire set, X=1 means 1st class, etc.// C1, C2, and C3 are the arrays recording COUNT values
T1=0T2=0T3=0indexC2 = 0indexC3 = 0for i=0 to n-1
SumA=0 Sum1=0 Sum2=0
for j=0 to b-1 for k=0 to b-1 Sum1=Sum1 + 2(j+k) * C3.at(X * interval3 +
indexC1)indexC3 = indexC3 + 1
endforSumA=SumA + 2j * ai
Sum2=Sum2 + 2j * C2.at(X * interval2 + indexC2)indexC2 = indexC2 + 1
endforT1 = T1 + Sum1T2 = T2 + Sum2 * SumAT3 = T3 + SumA * SumA
endforT2 = T2 * (-2)T3 = C1.at(X)
RETURN (T1+T2+T3)
Figure 5. Algorithm to compute TV(X, a).
4.3. Performance Analysis
In this section, we report the performance analysis. The objective is to compare the
efficiency and scalability between VSSD employing a vertical approach (vertical data
28
structure with horizontal bitwise operations) and HSSD utilizing a horizontal approach
(horizontal data structure with vertical scans). HSSD is defined as shown in equation (3).
Both VSSD and HSSD were implemented using the C++ programming language. The
programming application interface for P-tree vertical technology, P-Tree API [32], was
incorporated in the implementation of VSSD. The performance of both approaches was
observed under several different machine specifications, including an SGI Altix
CC-NUMA machine, as listed in Table 3.
Table 3. The specification of the machines.
Machine Specification Memory
AMD AMD Athlon K7 1.4GHz processor 1.0 GB
P4 Intel P4 2.4GHz processor 3.8 GB
SGI Altix SGI Altix CC-NUMA 12 processors Shared Memory (12 x 4 GB)
4.3.1. Datasets
The datasets were taken from a set of aerial photographs from the Best Management
Plot (BMP) of Oakes Irrigation Test Area (OITA) near Oakes, North Dakota. Latitude and
longitude are 970 42'18"W, taken in 1998. The images contain three bands: red, green, and
blue reflectance values. The values are between 0 and 255, which in binary numbers can be
represented using 8 bits. The original image is of size 1024x1024 pixels (having cardinality
of 1,048,576) and depicted in Figure 6. Corresponding synchronized data for soil moisture,
soil nitrate, and crop yield were also used. Crop yield was selected to be a class attribute.
Combining all bands and synchronized data, we obtained a dataset with 6 dimensions
(5 feature attributes and 1 class attribute).
29
Additional datasets with different cardinality were synthetically generated from the
original dataset to study the speed and scalability of the methods. Both speed and
scalability were evaluated with respect to dataset size. Due to the small number of
cardinality obtained from the original dataset (1,048,576 records), we super sampled the
dataset using a simple image processing tool on the original dataset to produce five other
larger datasets, each having cardinality of 2,097,152, 4,194,304 (2048x2048 pixels),
8,388,608, 16,777,216 (4096x4096 pixels), and 25,160,256 (5016x5016 pixels). We
categorized the crop yield attribute into four different categories to simulate various subsets
in the datasets. The categories are: low yield having intensity between 0 and 63, medium
low yield having intensity between 64 and 127, medium high yield having intensity between
128 and 191, and high yield having intensity between 192 and 255.
Figure 6. The original image of the RSI dataset.
30
4.3.2. Run Time and Scalability Comparison
Our first observation was to evaluate the performance of VSSD and HSSD when
running on different machines. We discovered that VSSD is significantly faster than HSSD
on all machines. VSSD takes only 0.0004 seconds on an average to compute the total
variations of each set (low yield, medium low yield, medium high yield, and high yield)
about 5 tested points. As discussed before in Section 4.2.3, the cost for VSSD lies in the
computation of count values. However, this computation is extremely fast because the
COUNT operations are simply counting the number of 1s in the patterns. We discovered
that for each dataset, the aggregate function COUNT were executed 1,280 times, derived
from 4 x 5 x 82, or equal to the complexity of computing count values O(kdb2).
Table 4 summarizes the amount of time needed for VSSD to run all COUNT
operations on different datasets and machines. Notice that when running on AMD machine,
VSSD only needs 0.4 seconds on an average to finish a single COUNT operation for the
dataset of size 25,160,256, while on P4 machine, VSSD only needs 0.15 seconds on an
average to finish a single COUNT operation for the same dataset. The COUNT operation
was even faster when VSSD was running on SGI Altix. It takes 183.81 seconds to complete
all 1,280 COUNT operations, or on an average, it takes only 0.14 seconds to complete a
single COUNT operation.
The computation of total variation is really fast for VSSD once the count values are
obtained. It is a matter of taking the appropriate count values and completing the
summation as shown in equation (9) but without the COUNT operations. We only report
the time to compute the total variations when VSSD was running on AMD machine
because the same time was also found on the other machines.
31
Table 4. Time for VSSD to compute all count values.
Dataset SizesTime
(Seconds)AMD P4 SGI Altix
1,048,576 14.57 5.05 4.39
2,097,152 36.32 11.05 9.19
4,194,304 75.89 24.03 21.73
8,388,608 147.79 49.69 50.25
16,777,216 305.22 97.59 121.73
25,160,256 513.98 192.07 183.81
In contrast, HSSD takes more time to compute the total variations. The time is
linear to the dataset sizes and difference on every machine. For example, on AMD
machine, HSSD takes 79.86 and 132.17 seconds on average to compute the total variations
for the datasets of size 16,777,216 and 25,160,256 respectively. On P4 machine, HSSD
takes 98.62 and 155.06 seconds on average to compute the total variations for the same
datasets. The same phenomenon was also found when HSSD was running on SGI Altix
machine. The average time to compute the total variations for the dataset of size
16,777,216 is twice the time to compute the total variations for the dataset of size
8,388,608. Table 5 shows the average time to compute the total variations on different
machines, and Figure 7 illustrates the time trend. The time in the table shows a clear
advantage in using the proposed approach.
It is important to note that the significant disparity of time to compute the total
variations between VSSD and HSSD is due to the capability of VSSD to reuse the same
count values once the count values are computed. As a result, VSSD tends to have a
constant time when computing total variations even when the dataset sizes are varied. On
32
the other hand, HSSD must scan the datasets each time the total variations are computed.
Thus, the time to compute a total variation is linear to the cardinality of the datasets.
Table 5. The average time to compute the total variations under different machines.
Dataset Sizes
Average Time to Compute the Total Variations(Seconds)
HSSD VSSD
AMD P4 SGI Altix AMD
1,048,576 5.30 6.14 6.79 0.0004
2,097,152 10.58 12.27 13.84 0.0004
4,194,304 18.40 24.73 27.64 0.0004
8,388,608 36.85 50.15 55.10 0.0004
16,777,216 79.86 98.62 109.76 0.0004
25,160,256 132.17 155.06 164.95 0.0004
020406080
100120140160180
0 2 4 6 8 10 12 14 16 18 20 22 24Number of Tuples (x 1024̂ 2)
Time
(Sec
onds
)
VSSD on AMD
HSSD on AMD
HSSD on P4
HSSD on SGI Altix
Figure 7. Time trend for computing the total variations.
Our second observation is to compare the time to load the datasets into memory.
We discover that when the datasets are organized in P-tree vertical structure, the time to
33
load the datasets is more efficient than when the datasets are organized in horizontal
structure, see Table 6. The reason is that when the datasets are organized in P-tree vertical
structure, they are stored in binary. Hence, they can be loaded efficiently. On the other
hand, when the datasets are organized in horizontal structure, which is not in binary format,
it takes more time to load them to memory.
Table 6. Loading time comparison.
Dataset Sizes
Average Loading Time (Seconds)
Loading P-trees and Count Values Loading Horizontal Datasets
AMD P4 SGI Altix AMD P4 SGI Altix
1,048,576 0.11 0.04 0.04 31.65 24.12 25.92
2,097,152 0.25 0.09 0.06 63.22 48.21 51.96
4,194,304 0.47 0.16 0.12 118.61 97.87 103.98
8,388,608 0.95 0.35 0.26 243.84 202.69 208.61
16,777,216 1.87 0.67 0.55 489.59 389.96 415.43
25,160,256 3.33 0.95 0.82 784.57 588.27 625.33
4.4. Conclusion
In this chapter, we have introduced a vertical approach for computing total variation
and evaluated its performance. The results show that VSSD is fast and scalable to compute
total variation on very large datasets. The independency of COUNT operations to the input
value makes the computation of total variation using vertical approach extremely fast. The
proposed approach is scalable due to the use of P-tree vertical structure, which structures
the data vertically and processes it horizontally through logical AND or OR operations.
34
CHAPTER 5. SMART-TV: AN EFFICIENT AND SCALABLE
NEAREST NEIGHBOR BASED CLASSIFIER2
5.1. Introduction
Classification on large datasets has become one of the most important research
priorities in data mining due to the large volume of data currently available. Classification
involves predicting the class label of newly encountered objects using feature attributes of a
set of pre-classified objects.
K-nearest neighbor (KNN) classifier is the most commonly used neighborhood
classification due to its simplicity, robustness, and good performance. Given a training set,
KNN classifier does not build a model in advance like decision tree induction [24], Neural
Network [34], and Support Vector Machine [8, 9, 10], instead it invests all the effort for
classification until a new instance arrives. The classification decision is then made locally
based on the features of the new instances. KNN classifier searches for the most similar
objects in the training set and assigns a class label to the new instance based on the
plurality of category of the k-nearest neighbors. The similarity or closeness between the
training objects and the new instance is determined using a distance measure, e.g. Euclidian
distance. Studies have shown that KNN classifier has shown good performance on various
datasets. However, when the training set is very large i.e. millions of objects, the
classification time increases linearly.
In this chapter, we propose an efficient and scalable nearest neighbor classification
algorithm, called SMART-TV. The proposed algorithm finds the candidates of neighbors
2 This chapter is a modified version of a published paper which appeared in the Proceedings of the 21st ACM Symposium on Applied Computing (Data Mining Track) (SAC-06), pp. 536-540, Dijon, France, April 23-27, 2006, with a slightly different title.
35
by forming a total variation contour around the unclassified object. The objects within the
contour are then considered as the superset of nearest neighbors. These set of neighbors are
identified efficiently using P-tree range query algorithm without having to scan the total
variation values of the training objects. The proposed algorithm further prunes the neighbor
set using a novel pruning technique, so-called the dimensional projections. After pruning,
the k-nearest neighbors are searched from the pruned set, and then, we let them vote to
determine the class label of the unclassified object.
In the processing phase, the total variation function is applied to each training
object, and derived P-trees of these functional values are created. The derived P-trees are
used to efficiently determine the superset of neighbors in the contour. We empirically show
that the proposed algorithm is efficient and scalable to large datasets. In particular, a
dataset of size up to ninety six million is used to evaluate the run time and scalability of the
proposed algorithm.
The remainder of the chapter is organized as follows: in Section 5.2, we discuss the
graph of the total variations. In Section 5.3, we delineate the proposed algorithm in detail,
followed by two illustrative examples of the pruning technique in Section 5.4. We briefly
discuss the weighting functions used for voting in Section 5.5. We report the performance
analysis in Section 5.6, and finally, we summarize the conclusion remarks in Section 5.7.
5.2. Hyper Parabolic Graph of Total Variations
Let R(A1,…,Ad, C) be a training space, and X(A1,…,Ad) = R[A1,…,Ad] be the features
subspace, and TV(X, a) be the total variation of X about a. The total variations graph is a
hyper parabolic graph that always minimize at the mean (). The following proof will show
that the total variations graph is always minimized over the mean.
36
Let
The above equation clearly shows that f(a) is parabolic in each dimension, ai. Now,
we examine the first partial derivative of f(a), such that , to determine the minimum
value of f(a) by fixing the dimension.
Let be the total number of objects in X. The summation can be simplified as
, thus
From the above observation, it is clear that when . Therefore, is always
minimized over the mean in all dimensions.
Figure 8 illustrates the total variations graph of equally distributed data objects in a 2-
dimensional space. From the graph, it is clear that the minimum value is at the mean.
37
020
4060
80
0
20
40
60
80500
1000
1500
2000
2500
3000
Figure 8. Graph of .
Now let . Recall that the total variation is defined to be:
Since , we obtain the following equation:
38
Figure 9 shows the graph of . The shape of the graph is
exactly similar to the shape of as illustrated in Figure 8. The only
difference is that when a = , the value of the function f(a)=0.
020
4060
80
0
20
40
60
800
10
20
30
40
Figure 9. Graph of .
5.3. The Proposed Algorithm
The proposed algorithm finds the candidates of neighbors by creating a total
variation contour around the unclassified object. The objects within the contour are
considered as the superset of nearest neighbors (candidate set). These neighbors are then
39
pruned before the k-nearest neighbors are searched from the set. After pruning, the
k-nearest neighbors vote to determine the class label of the unclassified object.
Let R(A1,…,Ad, C) be a training space, and X(A1,…,Ad) = R[A1,…,Ad] be the feature
subspace, TV(X, a) be the total variation of the feature subspace X about a, and f(a) be the
function defined as In the preprocessing phase,
the function is applied to each training object, and derived P-trees of those functional
values are created to incorporate a fast and efficient way to determine the candidates of
neighbors. Since in large training sets the values of can be very large and representing
these large values in binary will require unnecessarily large number of bits, we define
to reduce the bit width.
We observe that the gradient of g at a = , by fixing at
dimension. We find that the gradient is zero if and only if a=, and the gradient length
depends only on the length of vector . This indicates that the isobars are hyper-
circles. Note that to avoid singularity when a=, we add a constant 1 to function f(a) such
that The graph of g(a) is shown in Figure 10.
The proposed algorithm consists of two phases: preprocessing and classifying. In
the preprocessing phase, the process is conducted only once while in the classifying phase,
the processes are repeated for every unclassified object.
40
020
4060
80
0
20
40
60
800
1
2
3
4
Figure 10. Graph of .
5.3.1. Preprocessing Phase
In the preprocessing phase, we compute g(x), x X, and create derived P-trees of
the functional values g(x). The derived P-trees are stored together with the P-trees of the
dataset. The complexity of this computation is O(n) since the computation is applied to all
objects in X. Furthermore, because the vector mean is used in the
function, then the vector must be initially computed.
We compute the vector mean efficiently using P-tree vertical structure. An
aggregate function COUNT is used to count the number of 1s in the bit pattern such that
the sum of 1s in each vertical bit pattern is acquired first. The following formula shows
how to compute the element of vector mean at dimension vertically:
The complexity to compute the vector mean is O(db), where d is the number of
dimensions and b is the bit width.
41
5.3.2. Classifying Phase
In the classifying phase, the steps are repeated for each unclassified object. We
summarize the steps in the classifying phase as follows:
1. Determine vector , where a is the new object and is the vector mean of the
features space X.
2. Given an epsilon of the contour (e > 0), determine two vectors located in the lower and
upper side of a by moving the e unit inward toward along vector and moving
the e unit outward away from a. Let b and c be the two vectors in the lower and upper
side of a respectively, then vector b and c can be determined using the following
equations:
3. Calculate g(b) and g(c) such that g(b) g(a) g(c) and determine the interval [g(b),
g(c)] that creates a contour over the functional line. The contour mask of the interval is
created efficiently using the P-tree range query algorithm without having to scan the
functional values one by one. The mask is a bit pattern containing 1s and 0s, where bit
1 indicates that the object is in the contour while 0 indicates otherwise. The objects
within the pre-image of the contour in the original feature space are considered as the
superset of neighbors (e neighborhood of a or Nbrhd(a, e)).
4. Prune the neighborhood using the dimensional projections.
42
5. Find the k-nearest neighbors from the pruned set by measuring the Euclidian distance
x pruned Nbrhd(a, e), = .
6. Let the k-nearest neighbors vote using a weighted vote to determine the class label of
the unclassified object.
5.3.3. Detailed Description of the Proposed Algorithm
Consider objects in 2-dimensional space. Initially, the algorithm determines the
vector . Subsequently, the two vectors b and c located in the lower and upper side of
a are determined. The interval at the functional line will
form a contour, and the objects within the pre-image of the contour in the original feature
space are considered as the superset of neighbors (Figure 11).
Figure 11. The pre-image of the contour of interval [g(b), g(c)] creates a Nbrhd(a, e).
The mask of the superset of neighbors (the candidate set) is created efficiently using
the P-tree range query algorithm without the need to scan the functional values one by one.
43
Pre-image of the total variation contour
g(c)
e-contour a
g(b)
b c
x
y
g
We summarize the P-tree range query algorithm in Figure 12, and the algorithm to create a
contour mask in Figure 13.
ALGORITHM: RangeQuery(v)INPUT: Pb-1, ..., P1, P0
OUTPUT: Ptree(PT > v)
LET v=vb-1 ... v1 v0
k=0while(vk)
k = k + 1if(k<b) PT = Pk
for i=k+1 to b-1 if(vi) PT = PT & Pi
else PT = PT | Pi
endfor
RETURN PT
Figure 12. P-tree range query algorithm.
ALGORITHM: ContourMask(lower, upper)INPUT: Derived P-tree Pb-1, ..., P1, P0
OUTPUT: Ptree mask of the contour
PU = RangeQuery(upper)PL = RangeQuery(lower)
RETURN PL & PU
Figure 13. Algorithm to create a contour mask.
Since the total variation contour is annular around the mean, the candidate set may
contain neighbors that are actually far from the unclassified object a, e.g. located within the
44
contour but in the opposite side of a. Therefore, a pruning technique is needed to eliminate
the superfluous neighbors that may present in the candidate set.
In the proposed algorithm, the pruning technique that uses dimensional projections
is introduced. For each dimension, a dimensional projection contour is created around the
vector element of the unclassified object in that dimension. The size of the contour is
specified by moving the e unit away from the element of vector a in that dimension on both
sides as illustrated in Figure 14. The same epsilon previously used to determine vector b
and c is used again in this case.
Figure 14. An illustration of the dimensional projection contour.
The dimensional projection requires no additional derived P-trees since the training
set is already represented in P-trees vertical structure. The P-trees in each dimension can be
used directly at no extra cost, and the objects within the contour can be identified
efficiently using the same contour mask algorithm summarized in Figure 13.
A parameter, called MS (manageable size of the candidate set), is required for
pruning. This parameter specifies the upper bound of neighbors in the candidate set so that,
45
ax
ye e
g
when the manageable size of neighbors is reached, the pruning process will be terminated,
and the number of neighbors in the candidate set is considered small enough to be scanned
to search for the k-nearest neighbors.
The pruning technique consists of two major steps. First, it obtains the count of
neighbors in the pre-image of the total variation contour (candidate set) relative to a
particular dimension. The rationale is to maintain the neighbors that are predominant (close
to the unclassified object) in most dimensions so that, when the Euclidian distance is
measured (step 5 of the classifying phase), they are the true closest neighbors. The process
of obtaining the count starts from the first dimension. The dimensional projection contour
around the unclassified object, a, is formed, and the contour mask is created. Again, the
mask is simply a bit pattern containing 1s and 0s, where bit 1 indicates that the object
belongs to the candidate set when projected on that dimension while 0 indicates otherwise.
The contour mask is then AND-ed with the mask of the pre-image of the total variation
contour, and the total number of 1s is counted. Note that no neighbors are pruned at this
point; only the count of 1s is obtained. The process continues for all dimensions, and at the
end of the process, the counts are sorted in descending order.
The second step of the pruning is to intersect each dimensional projection contour
with the candidate set. The intersection starts based on the order of the count. The
dimension with the highest count is intersected first, followed by the dimension with the
second most count, and so forth. In each intersection, the number of neighbors in the
candidate set is updated. From the implementation perspective, this intersection is simply a
logical AND operation between the mask of the total variation contour and the mask of the
dimensional projection contour. The second step of the pruning technique continues until a
manageable size of neighbors is reached or all dimensional projection contours have been
46
intersected with the candidate set. Figure 15 summarizes the pseudo code of the pruning
algorithm, and Figure 16 illustrates the intersection between the pre-image of the total
variation contour and the dimensional projection contours.
ALGORITHM: pruning()INPUT: Pb-1, ..., P1, P0, PC, a, e, MSOUTPUT: PC - the mask of pruned candidate set//a is the unclassified object//MS is the manageable size//PC is the mask of the candidate set//PX is the mask of the dimensional projectioni = 0while(i < TOTAL_DIMENSION) doPX = ContourMask(ai - e, ai + e)tc = COUNT(PC & PX)if(tc != 0)
countArray.add(tc, PX)endifi = i + 1
endwhilesort.countArray(descending on the count)i = 0while(i < LENGTH(countArray)) doPX = countArray.second() //get dim proj masktc = COUNT(PC & PX)if(tc != 0) PC = PC & PX
if(tc < MS)break
endifendifi = i + 1
endwhile
RETURN PC
Figure 15. Pruning algorithm.
47Projection on y- dimension
g(c)
e-contour a
g(b)
b c
x
y
g
Projection on x-dimension
Figure 16. Pruning the neighbor set using dimensional projections.
5.4. Illustrative Examples of the Pruning Technique
The following example will illustrate how the neighbors in the candidate set are
pruned. Assume that there are two classes P and Q, each of which contains 15 and 10
points, respectively. The distribution of the points is depicted in Figure 17. The unclassified
object is denoted by a character a, and the two vectors b and c along vector (a - ) that will
form the total variation contour are also shown in the figure.
Assume that the points in the pre-image of the total variation contour (candidate set)
are {p2, p4, p6, p7, p8, p9, p11, p14, q2, q4, q7, q9}. In the first step of the pruning, the count of
neighbors in the total variation contour relative to each dimension is obtained. In this
example, the count of neighbors in the candidate set when projected on X dimension is 5,
i.e. {p2, p6, p7, p8, p14}, and the count of neighbors in the candidate set when projected on Y
dimension is 4, i.e. {p8, p14, q4, q9}. In the second step of the pruning, the candidate set is
intersected with the dimensional projection contour of X dimension because the number of
neighbors on this dimension is predominant in the candidates set. The pruned neighbor set
will be as follows:
48
Candidate set = {p2, p4, p6, p7, p8, p9, p11, p14, q2, q4, q7, q9} {p2, p6, p7, p8, p14}
= {p2, p6, p7, p8, p14}
Figure 17. Pruning example 1.
If in this example the manageable size of neighbors was set to 5, the pruning will
terminate because the manageable size of neighbors is reached, and the final pruned
neighbor set will be {p2, p6, p7, p8, p14}. However, if the manageable size of neighbors was
set to 3, the pruning continues. The candidate set is further intersected with the dimensional
projection contour of Y dimension. If this is the case, the pruned neighbor set is as follows:
Candidate set = {p2, p6, p7, p8, p14} {p8, p14, q4, q9}
= {p8, p14}
p15p14
p9
p8
p7p6
p3
p4p2p1
p5
p10
p11p12
p13
q10
q9
q8
q7q6
q5q4
q3
q2
q1
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
a
cb
X(ax+
e)
X(ax-e)
Y(ay+e)
Y(ay-e)
Pre-image of the total variation contour
49
Let us consider another example as illustrated in Figure 18. In this example, the
unclassified object, a, is located close to the points in class Q. If the same e-contour is used,
then the initial points in the candidate set will be the same as in the previous example.
Figure 18. Pruning example 2.
In the first step of the pruning, the count of neighbors in the total variation contour
relative to each dimension is obtained. In this example, the count of neighbors in the
candidate set when projected on X dimension is 4, i.e., {q2, q4, q7, q9}, and the count of
neighbors in the candidate set when projected on Y dimension is 5, i.e., {q2, q4, q7, q9, p14}.
p13
p12p11
p10
p5
p1 p2p4
p3
p6
p7
p8
p9
p14
p15
q1
q2
q3
q4
q5
q6q7
q8
q9
q10
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
a
c
b
Pre-image of the total variation contour
Y(ay+e)
Y(ay - e)
X(ax-
e)
X(ax+
e)
50
Since dimension Y has predominant neighbors in the candidate set, in the second
step of the pruning, the neighbors in this dimension are intersected with the neighbors in
the candidate set. After intersection, the pruned neighbor set is as follows:
Candidate set = {p2, p4, p6, p7, p8, p9, p11, p14, q2, q4, q7, q9} {p14, q2, q4, q7, q9}
= {p14, q2, q4, q7, q9}
Again, if the manageable size of neighbors was set to 5, the pruning will terminate
because the manageable size of neighbors is reached, and the final pruned neighbor set is
{p14, q2, q4, q7, q9}. However, if the manageable size of neighbors was set to 3, the pruning
continues. The candidate set is further intersected with the dimensional projection contour
of X dimension. The final pruned neighbor set is as follows:
Candidate set = {p14, q2, q4, q7, q9} {q2, q4, q7, q9} = {q2, q4, q7, q9}
In this case, although the manageable size of neighbors is not reached, the pruning will also
terminate because all dimensions have been projected.
5.5. Weighting Function
In nearest neighbor classification algorithms, the neighbors that are closer to the
unclassified object should vote more than the far neighbors. Each neighbor should cast vote
with a certain weight depending on the distance of the neighbor to the new sample.
Different weighting functions have been introduced in the literature. According to
Atkenson [35], the requirements on a weighting function is that the maximum value of the
weighting function should be at zero distance and the weight should decrease gradually as
the distance increases. Figures 19-21 illustrate some of the weighting functions adapted
from [35].
51
0 0.5 1 1.5 2 2.5 3 3.5 40.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Weighting function w = 1/(1+d)
Distance (d)
Wei
ght
Figure 19. Weighting function .
0 0.5 1 1.5 2 2.5 3 3.5 4-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1Weighting function w = 1-d
Distance (d)
Wei
ght
Figure 20. Weighting function .
52
0 0.5 1 1.5 2 2.5 3 3.5 40
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Weighting function w = exp(-d*d)
Distance (d)
Wei
ght
Figure 21. Weighting function .
In the proposed algorithm, we use as the weighting function to influence
the vote of the k-nearest neighbors. The weighting function is a Gaussian function that
gives a nice drop off based on the distance of the neighbors to the unclassified object. The
closer the neighbors the higher is the weight and vise versa.
5.6. Performance Analysis
We report the performance analysis in this section. The analysis was performed on
an Intel Pentium 4 CPU 2.6 GHz machine with 3.8GB RAM, running Red Hat Linux
version 2.4.20-8smp. We compared the proposed algorithm with the classical KNN
algorithm, the P-tree based KNN (P-KNN) using HOBBIT similarity, and the P-KNN
using EIN-ring neighborhood. The KNN algorithm was locally implemented, and it
sequentially scans the training space to find the k-nearest neighbors. All algorithms were
53
implemented in the C++ programming language. For the proposed algorithm, P-Tree API
was also incorporated in the implementation.
In this performance analysis, two main aspects were analyzed: 1) The running time
(scalability) of the algorithms, and 2) the classification accuracy. While some algorithms
sacrifice the accuracy for speed, or vise versa, in this performance evaluation, we will also
demonstrate that the proposed algorithm is not only fast and scalable, but also has good
classification accuracy. The accuracy is measured using F score, a common score used to
measure the classification accuracy [1]. The F score is defined as follows:
where P is the precision and R is the recall. The precision measures the ratio of correct
assignment of a class and the total number of objects assigned to that class, whereas recall
measures the ratio of correct assignment of a class and the actual number of objects in that
class. The F score further takes the ratio of these two measurements and has a score in the
range of 0 to 1. The higher the score, the better the classification accuracy is.
5.6.1. Datasets
We evaluated the algorithms using several datasets. Some of the datasets were taken
from the Repository of Machine Learning Databases at the University of California, Irvine
(UCI) [36]. The datasets in this repository are regarded as the benchmark datasets to
evaluate machine learning and data-mining algorithms. We also incorporated other real-life
54
datasets, such as Remote Sense Imaginary (RSI), OPTICS, and Iris datasets. The
description of each dataset is as follows:
RSI dataset: This dataset is a set of aerial photographs from the Best Management
Plot (BMP) of Oakes Irrigation Test Area (OITA) near Oakes, North Dakota, taken in
1998. The images contain three bands: red, green, and blue. Each band has values in the
range of 0 and 255 which, in binary, can be represented using 8 bits. The corresponding
synchronized data for soil moisture, soil nitrate, and crop yield were also used, and the crop
yield was selected as the class attribute. Combining all the bands and synchronized data, a
dataset with 6 dimensions (5 feature attributes and 1 class attribute) was obtained.
To simulate different classes, the crop yield was divided into four different
categories: low yield having intensity between 0 and 63, medium low yield having intensity
between 64 and 127, medium high yield having intensity between 128 and 191, and high
yield having intensity between 192 and 255. Three synthetic datasets were generated to
study the scalability and running time of the proposed algorithm. The cardinality of these
datasets varies from 32 to 96 million.
KDDCUP 1999 dataset: This dataset is the network intrusion dataset used in
KDDCUP 1999 [37]. The dataset contains more than 4.8 million samples from the TCP
dump. Each sample identifies a type of network intrusion. We selected six types of
intrusion, Normal, IP Sweep, Neptune, Port Sweep, Satan, and Smurf, each of which
contains at least 10,000 samples. The distribution of data in each class is tabulated in Table
7. A total of 32 numerical attributes were found after discarding the categorical attributes.
We selected randomly 120 samples, 20 samples for each class, for the testing sets.
Wisconsin Diagnostic Breast Cancer (WDBC) [36]: This dataset contains 569
diagnosed breast cancer patients with 30 real-valued features. The dataset was donated by
55
Nick Street in November 1995. The task is to predict two types of diagnoses as either
Benign (B) or Malignant (M). The distribution of data is 357 Benign and 212 Malignant.
Table 7. Class distribution of KDDCUP dataset.
Class Number of Objects
Normal 972,780
IP Sweep 12,481
Neptune 1,072,017
Port Sweep 10,413
Satan 15,892
Smurf 2,807,886
OPTICS dataset [38]: OPTICS dataset was originally used for clustering problems.
It has eight different clusters, and two of them are embedded clusters. The dataset contains
8,000 points in 2-dimensional space. We carefully added a class label to each data point
based on the original clusters and labeled as CL-1, CL-2, CL-3, CL-4, CL-5, CL-6, CL-7,
and CL-8. We randomly selected 80 points, 10 points for each class, for the testing sets.
Iris dataset [39]: The Iris plants dataset was created by R.A. Fisher. The dataset is
very popular in the machine learning community. The task is to classify Iris plants into one
of three Iris plants varieties: Iris setosa, Iris versicolor, and Iris virginica. The dataset
contains 150 instances (50 instances in each class) and represents in a 4-dimensional space
(sepal length, sepal width, petal length, and petal width). Iris setosa is linearly separable to
the other classes. We randomly selected 30 samples for the testing sets.
We normalized all datasets to prevent attributes with initially large ranges from
outweighing attributes with smaller ranges [40]. According to Hans [1], data normalization
is often used for methods involving distance measurements. Normalization scales the
56
values in the attribute to a small range and makes each attribute have equal emphasis and
same range. In this work, we used min-max normalization technique to normalize attribute
values, which after normalization the values will be in the range of 0 and 1. The min-max
normalization is defined as follows:
5.6.2. Parameterization
The proposed algorithm requires three parameters. We will discuss each parameter
in this section.
1. Epsilon (e) is a positive value (e > 0) that specifies the width of the total variation and
dimensional projection contours. The epsilon should be specified accordingly because
if it is too big, then the number of neighbors included in the total variation contour can
be very large. A large epsilon also means a wide dimensional projection contour. A
very wide contour can cause the objects that are far away from the unclassified object
being included in the contour. On the other hand, if the epsilon is specified too small,
the number of neighbors in the total variation contour can be very few. Thus, to specify
the epsilon, it is suggested that during learning phase, the epsilon is observed a couple
times until the best classification result is achieved.
2. The manageable size of neighbors (MS) is the parameter used for pruning. This
parameter is one of the termination conditions of the pruning step. Note that the
pruning terminates when the manageable size of neighbors is reached or all dimensions
have been examined. It is suggested that the value for this parameter is not too large
because k-nearest neighbors are searched from the pruned neighbor set by scanning
them one by one. A value in the range of 200 - 2000 perhaps can be used for this
57
parameter since scanning that many neighbors using the-state-of-the-art PC available
today can be very fast.
3. Number of nearest neighbor (k) specifies the number of nearest neighbors that cast a
vote to determine the class label of the unclassified object.
5.6.3. Classification Accuracy Comparison
We examined the classification accuracy using KDDCUP 1999, Wisconsin
Diagnostics Breast Cancer (WDBC), OPTICS, and Iris datasets. We used 5-folded cross-
validation evaluation model for all datasets. We randomly divided the datasets into disjoint
training and testing subsets, 5 different times. The algorithms were tested using each
disjoint subset, and the performance results were averaged over all evaluations. We
compared SMART-TV algorithm with PKNN using HOBBIT, P-KNN using Equal Interval
Neighborhood (EIN-ring), and KNN with linear search.
Tables 8-10 summarize the classification accuracy on the KDDCUP dataset for
k = 3, k = 5, and k = 7. We discovered that all algorithms produced good classification
accuracy. In terms of speed, SMART-TV is faster than the other algorithms. SMART-TV
takes about 9.93 seconds on an average to classify, while P-KNN with HOBBIT and KNN
take approximately 30.79 and 271.50 seconds, respectively. We used e = 0.005 and
MS = 1000 for SMART-TV. It is important to note that P-KNN with EIN-ring was not able
to run successfully on this dataset due to the memory allocation error.
Tables 11-13 show the classification accuracy on the WDBC dataset for k = 3,
k = 5, and k = 7. The WDBC dataset has 30 real-valued features. The task is to predict the
diagnoses of the breath cancer. The results show that SMART-TV, P-KNN using EIN-ring,
58
and KNN perform better than P-KNN using HOBBIT. We used e = 0.2 and MS = 50 for
SMART-TV.
Table 8. Classification accuracy on the KDDCUP dataset for k = 3.
ClassClassification Accuracy
SMART-TV P-KNN HOBBIT KNN
Normal 0.83 0.91 0.87
IP Sweep 0.95 0.97 1.00
Neptune 1.00 0.97 0.98
Port Sweep 0.97 0.89 0.95
Satan 0.83 0.82 0.80
Smurf 0.97 0.91 1.00
Table 9. Classification accuracy on the KDDCUP dataset for k = 5.
ClassClassification Accuracy
SMART-TV P-KNN HOBBIT KNN
Normal 0.82 0.91 0.87
IP Sweep 0.95 0.98 1.00
Neptune 1.00 0.98 0.97
Port Sweep 0.97 0.89 0.95
Satan 0.80 0.82 0.80
Smurf 0.97 0.91 1.00
Table 10. Classification accuracy on the KDDCUP dataset for k = 7.
59
ClassClassification Accuracy
SMART-TV P-KNN HOBBIT KNN
Normal 0.82 0.91 0.87
IP Sweep 0.95 0.97 1.00
Neptune 1.00 0.97 0.98
Port Sweep 0.97 0.89 0.97
Satan 0.80 0.82 0.82
Smurf 0.97 0.91 1.00
Table 11. Classification accuracy on the WDBC dataset for k = 3.
ClassAccuracy
SMART-TV P-KNN HOBBIT
P-KNN EIN-Ring KNN
Benign 0.96 0.70 0.96 0.98
Malignant 0.96 0.23 0.96 0.98
Table 12. Classification accuracy on the WDBC dataset for k = 5.
ClassAccuracy
SMART-TV P-KNN HOBBIT
P-KNN EIN-Ring KNN
Benign 0.95 0.68 0.96 0.97
Malignant 0.95 0.10 0.96 0.97
Table 13. Classification accuracy on the WDBC dataset for k = 7.
ClassAccuracy
SMART-TV P-KNN HOBBIT
P-KNN EIN-Ring KNN
Benign 0.95 0.68 0.98 0.97
Malignant 0.95 0.10 0.98 0.97
60
Tables 14-16 show the classification accuracy on the OPTICS dataset for k = 3,
k = 5, and k = 7. We used e = 0.1 and MS = 200 for SMART-TV. The result shows that
P-KNN with HOBBIT is slightly more accurate than the other algorithms on classes CL-5
and CL-6. These two classes are the embedded classes, i.e., the class located inside another
class, but with different density. Some of the testing instances from these two classes were
classified incorrectly by the algorithms. However, in general, all algorithms performed
equally well.
Tables 17-19 summarize the classification accuracy on Iris dataset. Iris setosa is the
only class in the dataset that is linearly separable to the other classes. SMART-TV, P-KNN
using EIN-ring, and KNN classified this class accurately without any error. When P-KNN
uses HOBBIT as the similarity metric, it missed some of the classes. For SMART-TV, we
used e = 0.2 and MS = 20.
Table 14. Classification accuracy comparison on the OPTICS dataset for k = 3.
ClassClassification Accuracy
SMART-TV with P-tree
P-KNN HOBBIT
P-KNN EIN-Ring KNN
CL-1 1.00 1.00 1.00 1.00
CL-2 1.00 1.00 1.00 1.00
CL-3 1.00 1.00 1.00 1.00
CL-4 1.00 1.00 1.00 1.00
CL-5 0.94 0.96 0.93 0.94
CL-6 0.94 0.96 0.93 0.94
CL-7 1.00 1.00 1.00 1.00
CL-8 1.00 1.00 1.00 1.00
Table 15. Classification accuracy comparison on the OPTICS dataset for k = 5.
61
ClassClassification Accuracy
SMART-TV with P-tree
P-KNN HOBBIT
P-KNN EIN-Ring KNN
CL-1 1.00 1.00 1.00 1.00
CL-2 1.00 1.00 1.00 1.00
CL-3 1.00 1.00 1.00 1.00
CL-4 1.00 1.00 1.00 1.00
CL-5 0.95 0.96 0.94 0.95
CL-6 0.95 0.96 0.94 0.95
CL-7 1.00 1.00 1.00 1.00
CL-8 1.00 1.00 1.00 1.00
Table 16. Classification accuracy comparison on the OPTICS dataset for k = 7.
ClassClassification Accuracy
SMART-TV with P-tree
P-KNN HOBBIT
P-KNN EIN-Ring KNN
CL-1 1.00 1.00 1.00 1.00
CL-2 1.00 1.00 1.00 1.00
CL-3 1.00 1.00 1.00 1.00
CL-4 1.00 1.00 1.00 1.00
CL-5 0.95 0.95 0.94 0.95
CL-6 0.95 0.95 0.94 0.95
CL-7 1.00 1.00 1.00 1.00
CL-8 1.00 1.00 1.00 1.00
Table 17. Classification accuracy on the Iris dataset for k = 3.
62
ClassClassification Accuracy
SMART-TV P-KNN HOBBIT
P-KNN EIN-Ring KNN
Iris setosa 1.00 0.93 1.00 1.00
Iris versicolor 0.93 0.88 0.96 0.93
Iris virginica 0.93 0.91 0.96 0.93
Table 18. Classification accuracy on the Iris dataset for k = 5.
ClassClassification Accuracy
SMART-TV P-KNN HOBBIT
P-KNN EIN-Ring KNN
Iris setosa 1.00 0.90 1.00 1.00
Iris versicolor 0.95 0.84 0.94 0.94
Iris virginica 0.95 0.92 0.94 0.94
Table 19. Classification accuracy on the Iris dataset for k = 7.
ClassClassification Accuracy
SMART-TV P-KNN HOBBIT
P-KNN EIN-Ring KNN
Iris setosa 1.00 0.84 1.00 1.00
Iris versicolor 0.96 0.71 0.94 0.94
Iris virginica 0.96 0.92 0.94 0.94
We summarize the average classification accuracy of all datasets in Tables 20-22.
From the tables, it can be seen that the classification accuracy of SMART-TV is very
comparable to that of the KNN algorithm for all datasets. The performance of P-KNN
using HOBBIT degrades significantly when classifying the dataset with high dimensions as
shown in the WDBC dataset.
Table 20. Average classification accuracy for k = 3.
63
DatasetClassification Accuracy
SMART-TV P-KNN HOBBIT
P-KNN EIN KNN
KDDCUP 0.93 0.91 - 0.93
OPTICS 0.99 0.99 0.99 0.99
IRIS 0.95 0.91 0.97 0.95
WDBC 0.96 0.47 0.96 0.98
Table 21. Average classification accuracy for k = 5.
DatasetClassification Accuracy
SMART-TV P-KNN HOBBIT
P-KNN EIN KNN
KDDCUP 0.92 0.92 - 0.93
OPTICS 0.99 0.99 0.99 0.99
IRIS 0.97 0.89 0.96 0.96
WDBC 0.95 0.39 0.96 0.97
Table 22. Average classification accuracy for k = 7.
DatasetClassification Accuracy
SMART-TV P-KNN HOBBIT
P-KNN EIN KNN
KDDCUP 0.92 0.91 - 0.94
OPTICS 0.99 0.99 0.99 0.99
IRIS 0.97 0.82 0.96 0.96
WDBC 0.95 0.39 0.98 0.97
5.6.4. Classification Time Comparison
We compared the performance in terms of speed using RSI datasets. As mentioned
previously, the cardinality of these datasets varies from 32 to 96 million. In this
64
experiment, we compared the run time of SMART-TV against SMART-TV with scan, P-
KNN with HOBBIT metric, and a brute-force KNN that sequentially searches for the k-
nearest neighbors from the entire training set. Note that SMART-TV with scan is a slightly
different version of SMART-TV. SMART-TV with scan does not have derived P-trees of
the total variation values of the training objects and determines the superset of neighbors in
the total variation contour by scanning the values of the training objects one by one. While
scanning, the relative indexes of the objects within the contour are stored in an array. We
applied the same pruning technique for SMART-TV with scan because after the scan is
completed, a candidate mask can be created using the current version of P-Tree API by
passing the array that holds the indexes of objects within the total variation contour as the
parameter. However, as we will see in the experimental result, even though the scan is
conducted on a single dimension, i.e. the total variation values of the training objects,
SMART-TV with scan takes more time when compared to the SMART-TV with P-tree that
uses P-tree range query algorithm to determine the objects within the contour.
Figure 22 shows the run time and scalability comparison of the algorithms. The
complete run time of each algorithm is summarized in Table 23. We learned that SMART-
TV with P-tree and P-KNN are very comparable in terms of speed. Both algorithms are
faster than the other algorithms. For example, P-KNN takes 12.37 seconds on an average to
classify using a dataset of size 96 million, while SMART-TV with P-tree takes 17.42
seconds. For the same dataset, SMART-TV with scan takes about 106.76 seconds to
classify and KNN takes 891.58 seconds. KNN with sequential scan is almost three orders
of magnitude slower than SMART-TV with P-tree and P-KNN algorithms. In addition,
KNN also requires more time and more resources to upload the datasets into memory.
Since the datasets in the form of horizontal structure could not be loaded entirely into
65
memory at once, for KNN, the datasets were loaded partially. From this observation, it is
clear that when the cardinality increases, the classification time of KNN increases linearly.
Conversely, the algorithms that employ P-tree vertical data structure are much faster.
SMART-TV with scan is faster than KNN because it only scans a single dimension and
compares the functional values of the unclassified object with the functional values of the
training objects. Besides, the pruning technique is also incorporated in the algorithm. On
the other hand, KNN has to scan and compute the Euclidian distance at the same time.
However, SMART-TV with scan is slower than P-KNN and SMART-TV with P-tree.
0 32000000 64000000 960000000
100
200
300
400
500
600
700
800
900
Dataset Cardinality
Cla
ssifi
catio
n Ti
me
(Sec
onds
/Sam
ple)
SMART-TV w/ P-TreeSMART-TV w/ ScanP-KNN with HOBitKNN
Figure 22. Run time and scalability comparison on the RSI dataset.
Table 23. Run time and scalability comparison on the RSI dataset.
DatasetTime in Seconds
SMART-TV SMART-TV P-KNN KNN
66
with P-tree with Scan HOBBIT Classifying Loading
32M 5.77 109.31 4.49 296.90 553
64M 11.66 218.11 8.28 593.71 1024
96M 17.42 324.03 12.37 891.58 1536
k = 5, e = 0.5, and MS = 1000.
Table 24 shows the preprocessing time taken by SMART-TV. We can see from the
table that most of the time was consumed to compute the functional values g(a) of the
training objects. However, this is a one time process that on an average only takes
approximately 0.000098 seconds. We believe that this preprocessing time can be amortized
when the number of unclassified objects being classified is very large.
Table 24. Preprocessing time of SMART-TV algorithm on the RSI dataset.
Dataset(Time in Seconds)
Computing Vector Mean
Computing Functional Values
32M 1.42 3,128.75
64M 2.75 6,257.49
96M 3.81 9,386.24
We summarize the classification time of each algorithm Table 25 when classifying
the KDDCUP, OPTICS, IRIS, and WDBC datasets. For large training set like KDDCUP
dataset, KNN is more than 2 orders of magnitude slower than the vertical nearest neighbor
algorithms (Figure 23). SMART-TV and P-KNN with HOBBIT are fast and very
comparable. P-KNN using EIN-ring takes more time to classify because it has to build the
ring around the unclassified which requires many logical AND operations. Moreover, for
low dimension datasets such as RSI datasets, P-KNN using HOBBIT is faster than
SMART-TV algorithm.
67
Table 25. Average classification time.
Dataset
Classification Time (Seconds)
SMART-TV with P-tree
SMART-TV with scan
P-KNN HOBBIT
P-KNN EIN KNN
KDDCUP 9.93 15.02 30.79 - 271.5
OPTICS 0.022 0.027 0.002 0.480 0.061
IRIS 0.003 0.004 0.007 2.540 0.002
WDBC 0.024 0.026 3.570 61.490 0.030
0
50
100
150
200
250
300
KDDCUP Dataset
Clas
sifi
cati
on T
ime
(Sec
onds
/Sam
ple)
Smart-TV with P-tree
Smart-TV with Scan
P-KNN HOBBIT
KNN
Figure 23. Classification time on the KDDCUP dataset.
5.7. Conclusion
In this chapter, we have proposed a new nearest neighbor based classification
algorithm that efficiently finds the candidates of neighbors by creating a total variation
contour around the unclassified object. The objects within the contour are considered as the
superset of nearest neighbors (candidate set) and further pruned before the k-nearest
68
neighbors are searched from the set. After pruning, the k-nearest neighbors vote to
determine the class label of the unclassified object. We conclude from this work that a
scalable and high accuracy of nearest neighbor classification algorithm has been
discovered.
We have also introduced a pruning technique that uses the dimensional projections.
We believe that this novel pruning technique can only be incorporated efficiently when the
training set is represented in P-tree vertical data structure. When the training set is not
represented in P-tree vertical structure, then one must scan each dimension over and over to
count the number of neighbors that are in the candidate set from each dimension. In large
datasets, such approach is impractical and inefficient to perform.
One observed limitation of the proposed pruning technique is that in high
dimensional datasets, the number of dimensions that needs to be examined will be great. In
such a case, the pruning will take more time.
We have conducted performance evaluation in terms of speed, scalability, and
classification accuracy. We found that the proposed algorithm is fast and scalable to very
large datasets. We conclude that in terms of speed and scalability, the proposed algorithm
is comparable to the other vertical nearest neighbor algorithms. In terms of classification
accuracy, the proposed algorithm is very comparable to that of the classical KNN classifier.
69
CHAPTER 6. THE APPLICATION OF THE PROPOSED
ALGORITHM IN IMAGE CLASSIFATION3
6.1. Introduction
The recent emergence of digital images makes the organization of images into
semantic categories for effective browsing and retrieval an interesting and challenging
problem. In small image repositories, manual image categorization to some extent can be
used to label images. But, in large image repositories, such approach becomes impractical.
Different classification algorithms have been proposed in the literature to categorize
digital images, such as Bayesian classifier [41], Support Vector Machine [42], Neural
Network [34], and k-nearest neighbor classifiers (KNN) [43]. KNN classifiers are
commonly used for classification due to its simplicity and good performance. One potential
drawback of KNN classifiers is that finding the k-nearest neighbors can be time-consuming
when image repositories are very large i.e. contains million of images.
Various techniques have been proposed to accelerate the k-nearest search in large
image repositories, including the use of indexes and tree structures such as k-d trees. The
algorithm proposed in [12] improves the k-nearest neighbor search for image retrieval. The
algorithm decomposes each training dimension and maintains it in a separate table. The
first several dimensions are scanned horizontally to find the partial distance between the
query image and the data in the repository. Good acceleration time was reported when the
algorithm was used to search for the nearest neighbors. However, the time for scanning the
dimensions will be significant when the database is very large.
3 This chapter is a modified version of a published paper which appeared in the Proceedings of the 1st IEEE International Workshop on Multimedia Databases and Data Management 2006 (IEEE MDDM 06) , Atlanta, Georgia, USA, April 8, 2006, with a slightly different title.
70
Another line of research is to reduce image feature dimensions to alleviate intensive
distance computation involved. A technique such as Principal Component Analysis (PCA)
is commonly used. Multi-resolution feature representation has also been investigated. As
opposed to dimensional reduction approach, in multi-resolution approach, the feature
vectors of images are represented in multiple resolutions. Similarity search starts at the
low-resolution level. If the distance is greater than the minimum distance bound at this
level, the candidates are removed without calculating the full-resolution distance. This
approach can reduce the computation complexity dramatically [44].
This chapter is intended to study the proposed algorithm, SMART-TV, when
applied to an image classification task. In image classification, the images are represented
in a high dimensional space, constructed from color distribution, image texture, image
structure (shape) or a combination [45]. In this work, we use color and texture features to
represent the images. The combination of these two features creates a 70-dimensional
feature vector for each image. P-trees are then generated from these feature vectors.
The empirical experimentations on Corel dataset [46] show that SMART-TV works
well in high dimensional dataset. In addition, the classification accuracy is also high and
very comparable to that of the KNN classifier.
6.2. Image Preprocessing
Color and texture features are explored in this work. These features are extracted
from the original pixel representation of the images and used as image representatives. We
used MATLAB software to extract both color and texture features.
For the color feature, we created a 54-dimension global color histogram (6x3x3) in
HSV color space such that the Hue component (a.k.a. the gradation of the color) is
71
partitioned into 6 bins; the Saturation (grayness) and Value (brightness) components are
divided into 3 bins each. The combination of each bin produces a 54-dimensional global
color histogram. No universal ratio has been defined for this partitioning parameter. The
main consideration is always on the computational cost and performance tradeoff. The
HSV color model is preferable over an alternative model, such as RGB, because it is a
uniform color model with color values in the range of 0 to 1 (normalized).
We extracted the texture feature of the images using Gabor filters. The MATLAB
codes for Gabor filtering and texture feature extraction were downloaded from
http://vision.ece.ucsb.edu/texture/software/ [47]. Two parameters are needed for generating
the filters. The first one is the scale, and the second one is the orientation. The scale catches
the scale of the texture while orientation catches the orientation, or direction, of the texture
in the image. We used 2 scales and 4 orientations (default values) given in the codes. The
combination of 2 scales and 4 orientations produced 8 filters. We filtered each image using
those 8 filters, and took the mean and standard deviation of the pixels of the filtered image.
Because we have 8 filters, and for each filter, the mean and standard deviation of the
filtered image are computed, a total of 16 texture features were extracted from each image.
We adopted this feature-extraction approach entirely from [47].
Finally, the combination of the 54-dimension global color histogram and 16 texture
features produced 70-dimensional feature vectors for each image. We normalized these
features using Gaussian normalization so that each dimension has a range between 0 and 1
to put equal emphasis on each feature [45]. Then, P-tree vertical data representations are
generated from these normalized features.
72
The same preprocessing discussed in Section 5.3.1 is applied for the training image
set. In this work, the function is used again. A
constant 1 is added to avoid a singularity of the logarithmic transformation when .
First, the vector mean of the training set is determined. Then, the functional values
of each training image are computed, and derived P-trees of these functional values are
created. The same classifying algorithm defined in Section 5.3.2 is used for classification.
6.3. Experimental Results
In order to demonstrate the efficiency and effectiveness of the proposed algorithm
in image classification, we conducted several experiments. The experiments were
performed on an Intel Pentium 4 CPU 2.6 GHz machine with 3.8GB RAM running Red
Hat Linux version 2.4.20-8smp. We had exclusive access to this machine, so the issue that
there might be other users inconsistently slowing the computer down is irrelevant.
We compared the proposed algorithm with the same KNN classifier used in the
previous chapter. We used general-purpose Corel images for the performance evaluations.
(http://wang.ist.psu.edu/docs/related/) [46]. The dataset has 10 categories each of which
contains 100 images. Figure 24 shows some of the images in each class.
The classification accuracy comparison was performed on this dataset directly.
However, for scalability and run time comparison, we cannot use the dataset directly
because the number of images is very small. Thus, we randomly generated several larger
datasets from the original Corel dataset. The size of these synthetic datasets varies from
100,000 to 500,000 images.
73
Africa People and Village Elephant
Beach Flower
Building Horse
Bus Mountain & Glacier
Dinosaur Food
Figure 24. The classes of the images.
6.3.1. An Example on Corel Dataset
In Figure 25, an example on how SMART-TV algorithm classifies a new image on
Corel dataset is visualized. In this example, we used k = 5, e = 0.023, and MS = 10. The
number of images selected as the candidates of neighbors is 64. Since the manageable size
of neighbors is specified to 10 and the number of neighbors in the candidate set is greater
than MS, the candidate set is pruned. Only 9 images remain in the candidate set after
pruning.
74
7.3148 7.2675 7.2873 7.3035 7.2958 7.2563 7.3148
7.2737 7.2899 7.3010 7.2700 7.3246 7.3190 7.2899
7.3062 7.2739 7.3171 7.2664 7.2670 7.2852 7.3046
7.2526 7.3046 7.2668 7.2808 7.2832 7.2690 7.2668 1.1056
7.2778 7.3112 7.2684 7.3002 7.3132 7.2844 7.2832 1.0694
7.3020 7.2637 7.3033 7.2544 7.3071 7.2718 7.2690 0.6623
7.3115 7.3052 7.2898 7.3244 7.2813 7.2959 7.2778 0.8973
7.2607 7.2973 7.3221 7.2969 7.2724 7.3214 7.3112 0.4479
7.2736 7.2719 7.2966 7.2628 7.3135 7.3060 7.2598
7.2735 7.3128 7.2598 7.3185 7.2565 7.2548
7.2641 7.2861 7.3223 7.2836
Figure 25. Example using Corel dataset with pruning.
Let a be the new image, b and c be the two vectors located in the lower and upper
side of a. The algorithm computes the functional values of b and c. For this discussion, let
us assume that the functional value of a is also computed, i.e., 7.2890, and the functional
75
New image
Pruned Set
K-nearest neighbors
Images in the candidate set
The category of the new image is:
Dinosaur
Vote histogram
Dinosaurs
values of b and c are 7.2507 and 7.3267, respectively. The functional value of a should be
between the functional values of b and c such that g(b) g(a) g(c). In this example, we
can see that the condition is true. Moreover, the images that are selected as the candidates
should have functional values in the range of [7.2507, 7.3268]. For clarity, we write the
functional values of each image in the candidate and pruned sets under the images. For
each image in the k-nearest neighbors set, we write the pair wise Euclidian distance of the
image and the new image.
After the interval is found, the algorithm determines the candidates of neighbors
using P-tree range query algorithm, summarized in Figure 12. After pruning, since k = 5,
the five nearest neighbors are searched from the pruned set. These k-nearest neighbors are
shown in the last column of the figure. Each of the neighbors is given a weighted vote
based on the Gaussian weighting function so that the closest neighbor will have the highest
weight and the weight decreases gradually as the distance increases.
6.3.2. Classification Accuracy
The classification accuracy is measured using F-score as discussed previously in the
Chapter 5. We used a variant of 5-folded cross-validation to test the accuracy. We
randomly produced five different disjoint training and testing sets. For each testing set, 50
images were randomly selected so that each class contains 5 images. The remainder 950
images were used as the training set. The accuracy results are averaged over all disjoint
subsets. Table 26 summarizes the classification accuracy using different k. We used e = 0.1
and MS = 300 for SMART-TV.
Table 26. Classification accuracy comparison using k = 3, k = 5, and k = 7.
Category SMART-TV with P-Tree KNN
76
k = 3 k = 5 k = 7 k = 3 k = 5 k = 7African People and
Village 0.72 0.80 0.73 0.84 0.81 0.81
Beach 0.70 0.70 0.71 0.75 0.85 0.80
Building 0.61 0.62 0.60 0.55 0.65 0.69
Buses 0.83 0.85 0.83 0.85 0.90 0.88
Dinosaur 0.94 0.96 0.96 0.94 0.94 0.94
Elephant 0.56 0.70 0.70 0.60 0.62 0.66
Flower 0.81 0.80 0.80 0.90 0.94 0.96
Horse 0.91 0.90 0.88 0.96 0.96 0.96
Mountain and Glacier 0.65 0.63 0.70 0.68 0.73 0.66
Food 0.81 0.85 0.85 0.75 0.76 0.75
Average Accuracy 0.75 0.78 0.78 0.78 0.82 0.81
The results show that both KNN and SMART-TV perform equally well for most of
the classes. Only for class Building, Elephant, and Mountain and Glacier the accuracy is
under 70%. From these results, the same trend of accuracy can be seen clearly. When KNN
produces high accuracy for classes such as Dinosaur, Horse, and Buses, SMART-TV also
produces high accuracy. Thus, we conclude that the classification accuracy SMART-TV is
very comparable to KNN classifier, even in high dimensional dataset like the image
categorization problem.
6.3.3. Classification Time Comparison
The classification time comparison was performed using several synthetic datasets
with different cardinality. These datasets were randomly generated from the original Corel
images. The cardinality of these datasets is varied from 100,000 to 500,000. We evaluated
the classification time using k = 3, k = 5, and k = 7. However, since the same time trends
were found, we only show the time for classification k = 5. This similar time trend
77
occurred because most of the time for nearest neighbor classifiers is consumed when
searching for the nearest neighbors. Hence, varying k a little bit does not make much
difference on the overall classification time.
Figure 26 shows that SMART-TV is fast in categorizing images when compared to
KNN. For the largest dataset containing 500,000 images and 70 feature attributes, SMART-
TV with P-trees takes approximately 1.813 seconds on an average to classify, while KNN
takes about 25.874 seconds. SMART-TV with scan is slower than the SMART-TV with
P-tree since it scans the functional values of the training images to find the candidates of
neighbors. SMART-TV with scan takes about 4.415 seconds on an average to classify.
From this observation, we are convinced that the use of P-trees and the range query
algorithm to find the candidates of neighbors can speed up the classification time
significantly.
We summarize the preprocessing time taken by SMART-TV algorithm on the
image sets containing 70 feature attributes in Table 27. Note that the preprocessing phase is
a one-time process. Vector mean can be computed very fast using P-tree vertical structure.
For the largest image set, the time to compute all functional values of the training images is
491.33 seconds or about 8.2 minutes. The actual time to compute the functional values of a
single image is approximately 0.00098 seconds.
78
100,000 200,000 300,000 400,000 500,0000
5
10
15
20
25
30
Image Set Cardinality
Cla
ssifi
catio
n Ti
me
(Sec
onds
/Sam
ple)
SMART-TV w/ P-treeSMART-TV w/ ScanKNN
Figure 26. Classification time, k =5, e = 0.01, and MS=1000.
Table 27. Preprocessing time of SMART-TV algorithm on the Corel dataset.
Datasets(Total Images)
(Time in Seconds)Computing Vector
MeanComputing Functional
Values100,000 0.02 98.00
200,000 0.04 197.18
300,000 0.07 295.93
400,000 0.10 302.94
500,000 0.12 491.33
6.4. Conclusion
In this chapter, we have demonstrated through some experimentation that SMART-
TV is an efficient and effective nearest neighbor based classification algorithm that can be
used for image classification. While some algorithms suffer when classifying high
79
dimensional datasets, in this work, we showed that SMART-TV performs well on high
dimensional datasets. In addition, the performance evaluation on general-purpose Corel
images shows that for large image repositories, SMART-TV has greater speed
improvement when compared to the naïve KNN classifier that uses brute force approach to
find the nearest neighbors. The classification accuracy of SMART-TV for image
categorization is also very comparable to that of KNN classifier.
Image retrieval is an interesting and challenging problem. One of the future
directions of this work is to apply the same approach used in SMART-TV algorithm for
image retrieval. The ability of SMART-TV algorithm to filter the candidates of neighbors
opens a window of opportunities to apply the same idea in image retrieval. We have two
reasons for this.
1. The images that are not really relevant to the query image can be eliminated efficiently
by forming the total variation contour.
2. The pruning technique then prunes the candidates so that the ranking algorithm will be
performed on a small number of images. For image classification problem, we have
demonstrated that our approach works well in finding the right candidates. Hence, we
believe that it will also work for image retrieval.
80
CHAPTER 7. INTEGRATING THE PROPOSED METHOD
INTO DATAMIMETM
7.1. Introduction
DataMIMETM is a vertical data-mining prototype based on P-tree technology [48]. It
is a client-server based system that provides different data-mining functionalities, such as
association rule mining, classification, and clustering. The system can be accessed from the
following URL: http://midas.cs.ndsu.nodak.edu/~datasurg/datamime. The architecture of
DataMIMETM was designed efficiently to provide flexibility for new vertical data-mining
algorithms to be added. The data-mining applications, data capturing, and data integration
to P-tree vertical format are executed on the server-side while the interaction with users is
facilitated from the system’s graphical user interface.
On the server-side, new algorithms are added into the Data Mining Algorithm
(DMA) layer [49]. The Data Capture and Integration (DCI) layer is responsible for data
capturing and integration. On the client-side, DMA and DCI layers are also available. The
client-side DMA gathers the required information for mining, sends it to the server-side
DMA, and waits for a response. The client-side DCI collects datasets and metadata, and
sends them to the server-side DCI.
SMART-TV has been integrated into DataMIMETM as a proof-of-concept of a new
classification algorithm that uses P-tree technology. As a classification algorithm, SMART-
TV is grouped together with the currently developed classification algorithms, such as
P-KNN, PINE, P-Bayesian, and P-SVM. Similar to the other classification algorithm user
interfaces, the SMART-TV user interface allows users to specify the required parameters
and to predict a class label of a single sample or a bulk of unclassified samples.
81
7.2. Server-Side Components
In order to integrate into DataMIMETM, all new classification algorithms must
implement the methods defined in the PredictionModel class [49]. SMART-TV
implemented this class as well. The PredictionModel is an interface containing five
important methods. The first method is the predict method. In this method, specific
implementation of how an algorithm predicts a class label for a given sample is written.
Different algorithms can have different implementations, but the parameters passed through
this method should be the same. The second method is vote_histogram. This method
collects all class label items and their corresponding vote values, and later used in the
client-side classification interface for vote histogram visualization. The setPTreeSet
method is the simplest method, which only sets a given P-tree set as the training set. The
last two methods, setClassLabel and getClassLabel, are the methods
responsible for specifying the class label attribute and returning the class label attribute
respectively.
The new module is integrated into the server-side through a
Predictor class. All the mining keys (parameters) required are defined
in the MiningKeys class and used in the Predictor class. As for the
SMART-TV module, the following mining keys were defined in the
MiningKey class:
static string EPS_VALUE;
static string MS_VALUE;
We did not define the K_VALUE mining key because this key has already
been defined previously by the other classification algorithms. Figure 27
82
shows part of SMART-TV code segments in the Predictor class.
Figure 27. Code segments of SMART-TV in the Predictor class.
7.3. Client-Side Components
The communication between the server-side and client-side is
established by the passing the mining keys defined in the algorithm. The
mining keys specified in the server-side are also defined in the
MiningKeys class in the client-side, and they should appear with the
same name.
7.4. Graphical User Interface
We grouped the SMART-TV algorithm together with the other classification
algorithms currently present in the system. Figure 28 shows a snap shot of SMART-TV
83
Predictor::Predictor(Properties *pr, const DataPath& dp):pSet.load(dp.ptree_dir() + id);PTreeInfo pInfo = pSet.getPTreeInfo();class_index = pInfo.getAttributeIndex(class_label);::else if(alg == "SMART-TV"){ string ks = request->getProperty(MiningKeys::K_VALUE);string ms = request->getProperty(MiningKeys::MS_VALUE);string ep = request->getProperty(MiningKeys::EPS_VALUE);int k = atoi(ks.c_str());int mansize = atoi(ms.c_str()); double eps = atof(ep.c_str()); pModel = new SmartTVModel(pSet,k,eps,mansize,
dp.ptree_dir() + id); pModel->setClassLabel(class_label);
}
graphical user interface in DataMIMETM. The graphical user interface of SMART-TV is
mostly similar with the other classification algorithms. The only difference is in the input
parameters.
Figure 28. Graphical user interface for mining with SMART-TV algorithm.
As shown in the figure, the parameters for the SMART-TV algorithm are the
number of k-nearest neighbors, the epsilon of the contour, and the manageable size
required for pruning. The default values of each parameter are given as the guideline for
the users.
84
Figure 29 shows the graphical user interface of the classification results. The results
are presented in a table format and can be saved into a file by pressing the right button of
the mouse on the table.
Figure 29. Graphical user interface showing the classification results.
Figure 30 shows the graphical user interface of the vote histogram. The vote
histogram is displayed for each selected sample in the table. The histogram will be changed
automatically when the users scroll down the table and select the next sample. The
85
Performance tab in the panel summarizes the average classification time per sample and
the Vote Table tab shows the vote values of each class of each corresponding sample.
Figure 30. Graphical user interface showing the vote histogram.
Figure 31 shows the graphical user interface of the performance results of the 10%
holdout validation. In this validation model, 10% of the training samples are randomly
selected and considered as the testing set. The rest of the samples are the training set. Note
that SMART-TV has to compute the functional values of each training object first and then
86
classifies each of the samples in the testing set. The accuracy of the validation will be
displayed in the performance part.
Figure 31. Graphical user interface showing the performance of a validation.
87
CHAPTER 8. CONCLUSION AND FUTURE WORK
8.1. Conclusion
This dissertation focuses on the scalability of classification algorithm. The work is
motivated from the fact that the state-of-the-art nearest neighbor based classification
algorithms are not scalable to very large datasets.
This dissertation is mainly based on several research projects and papers published
in [2, 3, 4, 5, 6]. In one of the projects, we proposed a vertical approach to compute set
squared distance that measures the total variation of a set of objects about a given object in
large datasets. We discovered that the COUNT operations in the vertical set square distance
are independent to the input value. Thus, executing the COUNT operations in advance and
retaining the count values is the best approach to expedite the computation of total
variation. The empirical results have shown that the vertical approach to compute total
variation is extremely fast and scalable to very large datasets as opposed to the horizontal
approach regardless of the specification of the machines.
We have extended the use of vertical total variation to classification and proposed a
novel nearest neighbor based classification algorithm, called SMART-TV. The proposed
algorithm efficiently filters the candidates of nearest neighbors by forming a total variation
contour around the unclassified object. The objects within the contour are considered as the
superset of nearest neighbors and are identified efficiently using P-tree range query
algorithm without having to scan the total variation values of the training objects one by
one. Because the candidate set may contain neighbors that are not really close to the
unclassified object, a pruning technique that uses dimensional projections was proposed.
The pruning technique uses the basic P-trees of each dimension directly. After pruning, the
88
k-nearest neighbors are searched from the pruned neighbor set. One observed limitation of
the proposed pruning technique is that in high dimensional datasets, the number of
dimensions that needs to be examined will be great. In such a case, the overall
classification time may increase a little bit.
We have conducted extensive performance evaluations in terms of speed,
scalability, and classification accuracy. The results were analyzed thoroughly and can be
summarized as follows:
1. In terms of speed and scalability, we found that the proposed algorithm is very
comparable to the other vertical nearest neighbor algorithms, but outperforms the speed
of the classification algorithms that use a scanning approach.
2. In terms of classification accuracy, the proposed algorithm is very comparable to that of
KNN classification algorithm. We also found that the proposed algorithm classifies
well on the datasets with unbalanced number of objects in the class, such as the
KDDCUP dataset. Also, the proposed algorithm classifies well on high dimensional
datasets, such as the Corel image and WDBC datasets.
We also have studied and tested the proposed algorithm in image classification
problem. The empirical results on a general-purpose Corel images show that the proposed
algorithm is fast and scalable to classify images, when compared to KNN classifier. In the
experimentations, we again discovered that the classification accuracy of the proposed
algorithm is very comparable to that of the KNN classifier.
In summary, this dissertation addresses the scalability issues in classification. We
conclude that a scalable and highly accurate nearest neighbor classification algorithm has
been discovered. The proposed algorithm employs P-tree vertical data representation, one
choice of vertical representation that has been experimentally proven to address the curse
89
of scalability and to facilitate efficient data mining over large datasets. We are convinced
that the proposed algorithm can be used in many different classification problems
employing large datasets.
8.2. Future Work
As for the vertical total variation algorithm, one of our future directions is to
develop an efficient way to update the COUNT values without having to compute all of
them when some values in the dataset are updated, e.g., some values in particular
dimensions. In the current version of the algorithm, when some values in the set are
changed, regardless of whether the changes occurred in one dimension or many, all count
values must be recomputed. In the cases where the dimensions of the changes can be well
identified, the algorithm should only update the count values in those dimensions. The
other count values should remain unchanged. In large datasets, this strategy can save a lot
of time.
One future direction for the proposed algorithm is to devise a strategy for
automatically providing the epsilon parameter based on the inherent features of the training
set and the unclassified object. In the current implementation, a single global epsilon is
used for all unclassified objects. Although the experiments demonstrated that, with a single
global epsilon, good classification accuracy can be achieved, it is even better if the epsilon
can be adjusted based on inherent features of the unclassified object so that the number of
neighbors filtered in the candidate set will be equally balanced for each unclassified object.
In the current implementation, the non-closed k-nearest neighbor set is used to
determine the class label of the unclassified object. The non-closed k-nearest neighbor set
is managed in a heap structure similar to the KNN algorithm used for comparison. The
90
experimental results show that the accuracy is about the same as the classical KNN
algorithm. It varies a little bit depending upon which the tie kth nearest neighbor is picked.
For future work, we would like to observe whether considering a closed k-nearest neighbor
set in SMART-TV algorithm will give better classification accuracy.
Another good direction for future work is to observe the proposed algorithm in
different popular domains, such as bioinformatics and unstructured data (text). Text is
known to have very large dimensions because each word (term) in the document is
considered as one dimension. It is very interesting and challenging to analyze the concepts
of the proposed algorithm when applied to those domains. Another great possibility is to
adopt the same concept for retrieval problems. The ability of the proposed algorithm to
filter the candidates of neighbors provides a great possibility to quickly determine the
superset of the most relevant objects to the query object. The objects that are not quite
relevant to the query object can be eliminated efficiently. The ranking algorithm is then
performed on the pruned objects in the relevant set. We have demonstrated that our
approach works very well in finding the right candidates of neighbors in classification
problems. Therefore, we are also convinced that the same idea will work for retrieval
problems.
We also see a window of opportunities for the vertical total variation to be used in
clustering and outlier detection analysis. The combination the total variation and
dimensional projection contours perhaps can be used to discover special types of clusters in
the space, such as projective and oblique clusters.
91
REFERENCES
[1] J. Han and M. Kamber, “Data Mining Concepts and Techniques,” 2nd edition, Morgan
Kaufmann Publishers, San Francisco, CA, 2006.
[2] T. Abidin, A. Perera, M. Serazi, and W. Perrizo, “A Vertical Approach to Computing
Set Squared Distance,” International Journal of Computers and their Applications
(IJCA), vol. 13, no. 2, pp. 94-102, June 6, 2006.
[3] T. Abidin and W. Perrizo, “SMART-TV: A Fast and Scalable Nearest Neighbor Based
Classifier for Data Mining,” Proceedings of the 21st ACM Symposium on Applied
Computing (SAC-06), pp. 536-540, Dixon, France, April 23-27, 2006.
[4] T. Abidin, A. Dong, H. Li, and W. Perrizo, “Efficient Image Classification on
Vertically Decomposed Data,” Proceedings of the 1st IEEE International Workshop
on Multimedia Databases and Data Management (MDDM-06), Atlanta, Georgia,
April 8, 2006.
[5] T. Abidin, A. Perera, M. Serazi, and W. Perrizo, “Vertical Set Squared Distance: A
Fast and Scalable Technique to Compute Total Variation in Large Datasets,”
Proceedings of the 20th ISCA International Conference on Computers and Their
Applications (CATA-05), pp. 60-65, New Orleans, Louisiana, March 16-18, 2005.
[6] T. Abidin and W. Perrizo, “An Alternative Arrangement of Symmetric Datasets for
Vertical Clustering Algorithms,” Proceedings of the 21st ISCA International
Conference on Computers and their Applications (CATA-06), Seattle, Washington,
March 23-25, 2006.
[7] Sorcerer Expedition, http://www.sorcerer2expedition.org/version1/HTML/main.htm,
February 6, 2006.
92
[8] V. Vapnik, “The Nature of Statistical Learning Theory,” Springer-Verlag Publisher,
New York, NY, 1995.
[9] H. Byun and S.W. Lee, “A Survey on Pattern Recognition Applications of Support
Vector Machines,” International Journal of Pattern Recognition and Artificial
Intelligence, 17(3), pp. 459-486, 2003.
[10] O.L. Mangasarian and D.R. Musicant, “Langarian Support Vector Machine,” Journal
of Machine Learning Research, vol. 1, pp. 161-177, 2001.
[11] T.M. Cover and P.E. Hart. “Nearest Neighbor Pattern Classification,” Journal IEEE
Transactions on Information Theory, IT 13, pp. 21-27, 1967.
[12] A.P. Vries, N. Mamoulis, N. Nes, and M. Kersten, “Efficient k-NN Search on
Vertically Decomposed Data,” Proceedings of the ACM SIGMOD, pp. 322-333,
2002.
[13] ANN: A Library for Approximate Nearest Neighbor Searching,
http://www.cs.umd.edu/~mount/ANN/, January 2006.
[14] J. Han, J. Pei, and Y. Yin, “Mining Frequent Patterns without Candidate Generation,”
Proceedings of the ACM International Conference on Management of Data
(SIGMOD), Dallas, TX, 2000.
[15] J. Gray, “The Next Database Revolution,” Proceedings of the 10th ACM SIGMOD, pp.
1-4, Paris, 2004.
[16] R. Jin and G. Agrawal, “A Middleware for Developing Parallel Data Mining
Implementations,” Proceedings of the 1st SIAM Conference in Data Mining, April
2001.
[17] M. Serazi, “A Super-Max Data Mining Benchmark by Vertically Structuring Data,”
Ph.D. Thesis, North Dakota State University, Fargo, ND, 2005.
93
[18] W. Perrizo, “Peano Count Tree Technology Lab Notes,” Technical Report
NDSU-CS-TR-01-1, North Dakota State University, Computer Science Department,
http://www.cs.ndsu.nodak.edu/~perrizo/classes/785/pct.html, 2001.
[19] Data Mining Tutorials, http://www.eruditionhome.com/datamining/overview.html,
March 1, 2006.
[20] R.T. Ng and J. Han, “CLARANS: A Method for Clustering Objects for Spatial Data
Mining,” IEEE Transactions on Knowledge and Data Engineering, vol. 14(5), pp.
1003-1016, September/October 2002.
[21] M. Kantardzic, “Data Mining: Concepts, Models, Methods, and Algorithms,” IEEE
Press, John Willey and Sons, Inc., New Jersey, 2003.
[22] J. R. Quinlan, “Induction on Decision Trees,” Machine Learning, vol. 1. pp. 81-106,
1986.
[23] S. Mitra and T. Acharya, “Data Mining: Multimedia, Soft Computing, and
Bioinformatics,” John Wiley and Sons, Inc., New Jersey, 2003.
[24] K. Alsabti, S. Ranka and V. Singh, “CLOUDS: A Decision Tree Classifier for Large
Datasets,” Proceedings of the ACM SIGKDD, pp. 2-8, 1998.
[25] D. Hand, H. Mannila, and P. Smyth, “Principles of Data Mining,” MIT Press,
Massachusetts, 2001.
[26] M. Khan, Q. Ding, and W. Perrizo, “K-nearest Neighbor Classification on Spatial Data
Stream Using P-trees,” Proceedings of the Pacific-Asia Conference on Knowledge
Discovery and Data Mining, pp. 517-528, Taipei, Taiwan, May 2002.
[27] W. Perrizo, Q. Ding, A. Denton, K. Scott, Q. Ding, and M. Khan, “PINE – Podium
Incremental Neighbor Evaluator for Classifying Spatial Data,” Proceedings of the
94
ACM Symposium on Applied Computing, pp. 503-508, Melbourne, FL, August
2003.
[28] A. Perera, T. Abidin, M. Serazi, G. Hamer, and W. Perrizo, “Vertical Set Squared
Distance Based Clustering without Prior Knowledge of K,” Proceedings of the 14th
International Conference on Intelligent and Adaptive Systems and Software
Engineering (IASSE-05), pp. 72-77, Toronto, Canada, July 20-22, 2005.
[29] R. Syamala, T. Abidin, and W. Perrizo, “Clustering Microarray Data Based on Density
and Shared Nearest Neighbor Measure,” Proceedings of the 21st ISCA International
Conference on Computers and their Applications (CATA-06), Seattle, Washington,
March 23-25, 2006.
[30] I. Rahal, D. Ren, and W. Perrizo, “A Scalable Vertical Model for Mining Association
Rules,” Journal of Information & Knowledge Management (JIKM), vol.3, no. 4,
pp. 317-329, 2004.
[31] D. Ren, B. Wang, and W. Perrizo, “RDF: A Density-Based Outlier Detection Method
using Vertical Data Representation,” Proceedings of the 4th IEEE International
Conference on Data Mining (ICDM-04), pp. 503-506, November 1-4, 2004.
[32] W. Perrizo, M. Serazi, A. Perera, Q. Ding, and V. Malakhov, “P-Tree API Reference
Manual,” Technical Report NDSU-CSOR-TR-04-1, North Dakota State University
Fargo, ND, 2004.
[33] Q. Ding, M. Khan, A. Roy and W. Perrizo, “The P-tree Algebra,” Proceedings of
ACM Symposium on Applied Computing, pp. 426-431, Madrid, Spain, March 2002.
[34] I. Claude, R. Winzenrieth, P. Pouletaut, and J. Boulanger, “Contour Features for
Colposcopic Image Classification by Artificial Neural Networks,” Proceedings of the
95
16th International Conference on Pattern Recognition (ICPR'02), vol. 1, pp. 10771,
2002.
[35] C. G. Atkeson, A. W., Moore, and S. Schaal, “Locally Weighted Learning,” Artificial
Intelligence Review, vol. 11, no. 1-5, pp. 11-73, 1997.
[36] D.J. Newman, S. Hettich, S., C.L. Blake, C.L., and C.J. Merz, “UCI Repository of
Machine Learning Databases,” http://www.ics.uci.edu/~mlearn/MLRepository.html,
Irvine, CA, University of California, Department of Information and Computer
Science, 1998.
[37] S. Hettich and S. Bay, “The UCI KDD Archive http://kdd.ics.uci.edu,” University of
California, Irvine, CA, Department of Information and Computer Science, 1999.
[38] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, “OPTICS: Ordering Points
to Identify the Clustering Structure,” Proceedings of the ACM SIGMOD, pp. 49-60,
1999.
[39] Iris Dataset, http://www.ailab.si/orange/doc/datasets/iris.htm, July 3, 2005.
[40] R.J. Roiger and M.W. Geatz, “Data Mining: A Tutorial-Based Primer,” Addison
Wesley, NY, 2003.
[41] A. Vailaya, M. Figueiredo, A. K. Jain, and H. Zhang, “Image Classification for
Content-Based Indexing,” IEEE Transaction on Image Processing, vol. 10, no. 1, pp.
117-139, 2001.
[42] O. Chapelle, P. Haffner, and V. Vapnik, “SVM for Histogram Based Image
Classification”, IEEE Transactions on Neural Networks, 10(5), pp. 1055-1064, 1999.
[43] D. Masip and J. Vitrià, “Feature Extraction for Nearest Neighbor Classification:
Application to Gender Recognition,” International Journal of Intelligent Systems, vol.
20 (5), pp. 561-576, 2005.
96
[44] J. Hafner, H. S. Sawhney, W. Equitz, M. Flickner, and W. Niblack, “Efficient Color
Histogram Indexing for Quadratic Form Distance Functions,” IEEE Transactions
Pattern Analysis Machine Intelligence, vol. 17, pp. 729-736, 1995.
[45] Q. Iqbal and J. Aggarwal, “Combining Structure, Color and Texture for Image
Retrieval: A Performance Evaluation,” Proceedings of the 16th International
Conference on Pattern Recognition, vol. 2, pp. 438-443, Quebec City, Canada, 2002.
[46] J. Z. Wang, J. Li, and G. Wiederhold, “SIMPLIcity: Semantics-Sensitive Integrated
Matching for Picture Libraries,” IEEE Transaction on Pattern Analysis ad Machine
Intelligence, vol. 23 (9), pp. 947-963, 2001.
[47] B. S. Manjunath and W. Y. Ma, “Texture Features for Browsing and Retrieval of
Image Data,” IEEE Transaction on Pattern Analysis and Machine Intelligence, vol.
18(8), pp. 837-842, 1996.
[48] M. Serazi, A. Perera, Q. Ding, V. Malakhov, I. Rahal, F. Pan, D. Ren, W. Wu, and W.
Perrizo, “DataMIMETM,” Proceedings of the ACM International Conference on
Management of Data (SIGMOD), pp. 923-924, Paris, France, June 2004.
[49] W. Perrizo, M. Serazi, A. Perera, Q. Ding, and V. Malakhov, “DataMIMETM
Developer Manual,” Technical Report NDSU-CSOR-TR-04-2, North Dakota State
University Fargo, ND, 2004.
97
APPENDIX
A.1. SmartTVApp Class
/*************************************** * Program: SmartTVApp.cpp * * Author : Taufik Abidin * DataSURG Research Group at CS NDSU * ***************************************/ #include "SmartTV.h"#include "Util.h"#include <time.h>#include <iostream>#include <PTreeInfo.h>#include <PTreeSet.h>#include <MetaFileParser.h>#include <DataFeeder.h>#include <RelationalDataFeeder.h>#include <TiffDataFeeder.h>#include <BasicPt.h>#include <string_token_iterator.h>
using namespace std;
int main(int argc, char **argv){ string dataFolder = argv[1]; string testingData = dataFolder + argv[2]; string metafile = dataFolder + argv[3]; string ptree_set_id= argv[4]; string ks = argv[5]; string ep = argv[6]; string msize= argv[7]; string mode = argv[8]; int k = atoi(ks.c_str()); int mansize = atoi(msize.c_str()); double eps = atof(ep.c_str()); DIR *dir = opendir(ptree_set_id.c_str()); clock_t starttime, endtime; try {
PTreeSet ps;double loadingPtree = 0;cout<<"\nStart SMART-TV classifier 2.0 using PTree..."<<endl;if(!dir){
starttime = clock(); MetaFileParser mParser(metafile); mParser.setDataRoot(dataFolder); DataFeeder *rFeeder; string ext = mParser.getFileExtension(); if(ext=="tiff" || ext=="tif")
98
rFeeder = new TiffDataFeeder(mParser); else if(ext=="data")
rFeeder = new RelationalDataFeeder(mParser); ps.feed(rFeeder); ps.store(ptree_set_id); endtime = clock(); double gentime = ((double) abs(endtime - starttime)) / CLOCKS_PER_SEC; cout<<"Generating Ptrees... "<<gentime<<" seconds"<<endl; delete rFeeder; } else { cout<<"Loading Ptree set..."<<endl; starttime = clock(); ps.load(ptree_set_id); endtime = clock(); loadingPtree = ((double) abs(endtime-starttime)) / CLOCKS_PER_SEC; cout<<"Done... "<<loadingPtree<<" seconds"<<endl; }
PTreeInfo pi = ps.getPTreeInfo();cout<<"Number of Ptrees: "<<pi.num_ptrees()<<endl;cout<<"Number of dimension: "<<pi.degree()<<endl;cout<<"Number of cardinality: "<<pi.cardinality()<<endl;SmartTV smarttv(ps,k,eps,mansize);
smarttv.setPTreeSetID(ptree_set_id);smarttv.setClassLabel("class_label");starttime = clock(); smarttv.getTVs(); endtime = clock();double loadingTime = ((double) abs(endtime - starttime)) / CLOCKS_PER_SEC;if(mode=="-l"){ // learning using testing set
vector <string> classDom = smarttv.getClassDomain();double TP[classDom.size()];double FP[classDom.size()];double PX[classDom.size()]; for(int i=0;i<classDom.size();i++){
TP[i] = 0; FP[i] = 0; PX[i] = 0; } int N = 0; double totalTime = 0.0; ifstream teststream(testingData.c_str()); while(!teststream.eof()){ string line = ""; string actualClass; string predictedClass = ""; getline(teststream, line); if(line=="") continue; string_token_iterator tok(line, ", "), end; vector<string> v(tok, end); N++; Tuple a = pi.to_tuple(v); int actualClassIndex = pi.degree()-1; int type = (a.get(actualClassIndex))->type();
99
starttime = clock(); Item *predictedItem = smarttv.predict(a); if(type==Type::UNSIGNED_INT){ UsignIntItem *va =
dynamic_cast<UsignIntItem*>(a.get(actualClassIndex)); actualClass = to_string(va->value()); UsignIntItem *pc =
dynamic_cast<UsignIntItem*>(predictedItem); predictedClass = to_string(pc->value()); } else if(type==Type::SING_CAT){ SingCatItem *va =
dynamic_cast<SingCatItem*>(a.get(actualClassIndex)); actualClass = va->value(); SingCatItem *pc =
dynamic_cast<SingCatItem*>(predictedItem); predictedClass = to_string(pc->value()); } endtime = clock(); double predictedTime = ((double) (endtime - starttime)) /
CLOCKS_PER_SEC; totalTime += predictedTime; int pos = -1; int getPos = 0; int updatedPX = 0; for(int i=0;i<classDom.size();i++){ if((!updatedPX)&&(classDom.at(i)==actualClass)){ PX[i]++; updatedPX = 1; } if((!getPos)&&(classDom.at(i)==predictedClass)){ pos = i; getPos = 1; } } if(actualClass == predictedClass) TP[pos]++; else FP[pos]++; } cout<<"\nk: "<<k<<endl; cout<<"eps: "<<eps<<endl; cout<<"ManSize: "<<mansize<<endl; cout<<"Testset: "<<testingData<<endl; cout<<"Total new samples: "<<N<<endl; cout<<"Total prediction time: "<<totalTime<<" seconds"<<endl; cout<<"Time/sample: "<<totalTime/N<<" seconds"<<endl; cout<<"Time for loading PTrees, RCs, and TVs: "<<loadingPtree
+ loadingTime<<" seconds"<<endl; for(int i=0;i<classDom.size();i++){ cout<<"\nClass: "<<classDom.at(i)<<endl; cout<<"|X| = "<<PX[i]<<endl; cout<<"TP = "<<TP[i]<<endl; cout<<"FP = "<<FP[i]<<endl; double R = TP[i]/PX[i]; double P = TP[i]/(TP[i]+FP[i]); cout<<"Recall = "<<R<<endl; cout<<"Precision = "<<P<<endl; cout<<"F = "<<(2*P*R)/(P+R)<<endl;
100
} teststream.close(); } else{ int N = 0; double totalTime = 0.0; ifstream teststream(testingData.c_str()); string result = argv[2]; ofstream outpstream((result + ".result").c_str()); while(!teststream.eof()){ string line = ""; string predictedClass = ""; getline(teststream, line); if(line=="") continue; string_token_iterator tok(line, ", "), end; vector<string> v(tok, end); N++; Tuple a = pi.to_tuple(v); int actualClassIndex = pi.degree()-1; int type = (a.get(actualClassIndex))->type();
starttime = clock(); Item *predictedItem = smarttv.predict(a); if(type==Type::UNSIGNED_INT){ UsignIntItem *pc =
dynamic_cast<UsignIntItem*>(predictedItem); predictedClass = to_string(pc->value()); } else if(type==Type::SING_CAT){
SingCatItem *pc = dynamic_cast<SingCatItem*>(predictedItem);
predictedClass = to_string(pc->value()); } outpstream<<a<<" "<<predictedClass<<endl; endtime = clock(); double predictedTime = ((double) (endtime - starttime)) /
CLOCKS_PER_SEC; totalTime += predictedTime; } outpstream<<"\nk: "<<k<<endl; outpstream<<"eps: "<<eps<<endl; outpstream<<"ManSize: "<<mansize<<endl; outpstream<<"Total new samples: "<<N<<endl; outpstream<<"Prediction time: "<<totalTime<<" seconds"<<endl; outpstream<<"Time/sample: "<<totalTime/N<<" seconds"<<endl; outpstream<<"Time for loading PTrees, RCs, and TVs: "
<<loadingPtree + loadingTime<<" seconds"<<endl; cout<<"Done..."<<endl; teststream.close(); outpstream.close(); } } catch(const exception& ex){ cout<<ex.what()<<endl; } return (EXIT_SUCCESS);}
101
A.2. SmartTV Header Class
/*************************************** * Program: SmartTV.h * * Author : Taufik Abidin * DataSURG Research Group at CS NDSU * ***************************************/ #ifndef _SMART_TV_H_#define _SMART_TV_H_
#include <vector>#include <PTreeSet.h>#include <Tuple.h>#include <BasicPt.h>#include <boost/dynamic_bitset.hpp>#include "PredictionModel.h"
typedef boost::dynamic_bitset<> boost_bitset;typedef struct heap { size_t idx; double val;}Heap;
class SmartTV : public PredictionModel{ public: SmartTV(PTreeSet& ps, int ks, double eps, int mansize);
// must be implemented when inherit PredictionModel // PredictionModel is the model for the DataMIME virtual Item* predict(const Tuple& t) throw (FailException); virtual vector<pair<Item*, double> > vote_histogram(const
Tuple& t) throw (FailException); virtual void setPTreeSet(const PTreeSet& pset); virtual void setClassLabel(const string& cl)
throw (UnknownAttribute); virtual string getClassLabel()const; vector<string> getClassDomain(); void setPTreeSetID(const string& ptree_set_id)
{ptreeSetID = ptree_set_id;} void getTVs() throw (FailException);
private:int k; const int MS;
double epsilon; double lenAminusMean;
int classType; int numdimension; string classLabel; size_t classIndex; string ptreeSetID; vector<double> mean;
102
vector<size_type> partition;
vector<pair<Item*, double> >votes; PTreeSet& ps; PTreeSet pstv; PTreeInfo pi; PTreeInfo pitv; Item* winner(); double getMean(int i); double f(const vector<double>& a); void vote(const Heap nearestNeighbors[], int k)
throw (FailException); vector<double> tupleToVector(const Tuple& t); double L2(const vector<double>& x, const vector<double>& a); void createHeap(Heap *heap, size_t i, size_t newidx,
double newval); void adjustHeap(Heap *heap, size_t pos, size_t heapsize); BasicPt getTVContourMask(const double& fb, const double& fc); BasicPt getXiCandidateMask(const int& i, const double& val,
const double& eps); void generatePTreeTV() throw (FailException); void createTVMetadata(const double& max)throw (FailException); vector<double> vectorDifferent(const vector<double>& x,
const vector<double>& y); vector<vector<double> > getRingVectors(double& eps,
const vector<double>& aMinusMean, const vector<double>& a);
};#endif
103
A.3. SmartTV Class
/*************************************** * Program: SmartTV.cpp * * Author : Taufik Abidin * DataSURG Research Group at CS NDSU * * ***************************************/ #include "SmartTV.h"#include "Util.h"#include "SingCatAttributeInfo.h"#include "UsignIntAttributeInfo.h"#include "SignIntAttributeInfo.h"#include "UsignDoubleAttributeInfo.h"#include "SignDoubleAttributeInfo.h"#include <PTreeInfo.h>#include <boost/dynamic_bitset.hpp>#include <iostream>#include <fstream>#include <time.h>#include <algorithm>
/** * Constructor */SmartTV::SmartTV(PTreeSet& ps, int ks, double eps, int mansize):
ps(ps), k(ks), MS(mansize){ pi = ps.getPTreeInfo(); numdimension = pi.degree()-1; size_type row = 0; partition.resize(pi.cardinality());
for(vector<size_type>::iterator it = partition.begin(); it!=partition.end(); it++, row++)
*it = row;
epsilon = (eps < 0 ? 0.01 : eps); clock_t starttime, endtime; starttime = clock(); for(int i=0; i<numdimension; i++) // get vector mean mean.push_back(getMean(i)); endtime = clock();
cout <<"Done computing mean..., "<<((double)(endtime-starttime))/CLOCKS_PER_SEC<<" secs"<<endl;
}
/** * Set PTree set */void SmartTV::setPTreeSet(const PTreeSet& pset){ ps = pset;}
104
/** * Get class domains */vector<string> SmartTV::getClassDomain(){ vector<string> classDomain; if(classType==Type::SING_CAT) {
SingCatAttributeInfo *sInfo =dynamic_cast<SingCatAttributeInfo*>(&(pi.getAttributeInfo(classLabel)));classDomain = sInfo->getDomain();
} else
cout<<"not a supported class label in getClassDomain"<<endl; return classDomain;}
/** * Get the total variations of the training objects */void SmartTV::getTVs() throw (FailException){ try{ clock_t starttime, endtime;
string tv_ptree_set_id = ptreeSetID + "/ptrees_hdtv"; DIR *dir = opendir(tv_ptree_set_id.c_str()); if(!dir){
cout <<"Computing HDTVs..."<<endl;string tvs = ptreeSetID + "/hdtvs.data"; ofstream oftvs(tvs.c_str());double max = 0.0;starttime = clock();
for(vector<size_type>::iterator it = partition.begin();it!=partition.end();it++){Tuple x = ps.getTuple(*it); double fx = f(tupleToVector(x));oftvs<<fx<<endl; if(max < fx)
max = fx; } endtime = clock(); oftvs.close();cout<<"Done..., "<<((double)(endtime-starttime))/
CLOCKS_PER_SEC<<" seconds"<<endl; createTVMetadata(max);
generatePTreeTV(); } else{ starttime = clock(); pstv.load(tv_ptree_set_id); endtime = clock(); cout <<"Done..., "<<((double)(endtime-starttime))/
CLOCKS_PER_SEC<<" seconds"<<endl; } pitv = pstv.getPTreeInfo();
}
105
catch(const exception& e){ throw FailException(e.what()); }}
/** * Create metadata for TV PTrees */void SmartTVNoScan::createTVMetadata(const double& max) throw (FailException){ try{ string metafile = ptreeSetID + "/metahdtv.xml"; ofstream tvmeta(metafile.c_str()); tvmeta<<"<?xml version=\"1.0\" encoding=\"UTF-8\"?>"<<endl; tvmeta<<"<datasetinfo>"<<endl; tvmeta<<" <description>"<<endl; tvmeta<<" <title>"<<endl; tvmeta<<" <line>Metadata TVs</line>"<<endl; tvmeta<<" </title>"<<endl; tvmeta<<" </description>"<<endl; tvmeta<<" <cardinality>"<<pi.cardinality()<<"
</cardinality>"<<endl; tvmeta<<" <delimiter>comma</delimiter>"<<endl; tvmeta<<" <data_file name=\"hdtvs.data\">"<<endl; tvmeta<<" <attribute>"<<endl; tvmeta<<" <name>tv</name>"<<endl; tvmeta<<" <type>double</type>"<<endl; tvmeta<<" <domain>"<<endl; tvmeta<<" <lower>0</lower>"<<endl; tvmeta<<" <upper>"<<max<<"</upper>"<<endl; tvmeta<<" </domain>"<<endl; tvmeta<<" <precision>3</precision>"<<endl; tvmeta<<" </attribute>"<<endl; tvmeta<<" </data_file>"<<endl; tvmeta<<"</datasetinfo>"<<endl; } catch(const exception& e){ throw FailException(e.what()); }}
/** * Generaate TV PTrees */void SmartTVNoScan::generatePTreeTV() throw (FailException){ try{ clock_t starttime, endtime; string tv_ptree_set_id = ptreeSetID + "/ptrees_hdtv"; string metafile = ptreeSetID + "/metatv.xml"; DIR *dir = opendir(tv_ptree_set_id.c_str()); if(dir){ cout<<"Directory is already exist..."<<endl; throw IdExistsException("Fail to create dir..."); closedir(dir);
106
} else{ starttime = clock(); MetaFileParser mParser(metafile); mParser.setDataRoot(ptreeSetID + "/"); DataFeeder *rFeeder; rFeeder = new RelationalDataFeeder(mParser); cout<<"Feed TV Ptree set..."<<endl; pstv.feed(rFeeder); cout<<"Storing TV Ptree set..."<<endl; pstv.store(tv_ptree_set_id); endtime = clock(); double gentime = ((double) abs(endtime - starttime)) /
CLOCKS_PER_SEC; cout<<"Generating TV Ptrees, done... "<<gentime<<"
seconds"<<endl; delete rFeeder; } } catch(const exception& e){ throw FailException(e.what()); }}
/** * f(a), the function */double SmartTV::f(const vector<double>& a){ double len = 0; for(int i=0; i<numdimension; i++) len += pow((a[i] - mean[i]),2); return log(pi.cardinality() * len + 1); }
/** * Get vector different */vector<double> SmartTV::vectorDifferent(const vector<double>& x,
const vector<double>& y){ double sum = 0; vector<double> z(numdimension); for(int i=0; i<numdimension; i++){ z[i] = x[i]-y[i]; sum += z[i]*z[i]; } lenAminusMean = sqrt(sum); return z;}
107
/** * Get vectors at the lower ring and upper ring of a */
vector<vector<double> > SmartTV::getRingVectors(double& eps, const vector<double>& aMinusMean, const vector<double>& a){
double upper; vector<double> b(numdimension); vector<double> c(numdimension); vector<vector<double> > z(2,b); if(((pi.getAttributeInfo(0)).type()==Type::UNSIGNED_DOUBLE)||
((pi.getAttributeInfo(0)).type()==Type::SIGNED_DOUBLE)){ DoubleAttributeInfo* doubleAttInfo =
dynamic_cast<DoubleAttributeInfo*>(&(pi.getAttributeInfo(0))); upper = doubleAttInfo->upper(); } else{ IntAttributeInfo* intAttInfo =
dynamic_cast<IntAttributeInfo*>(&(pi.getAttributeInfo(0))); upper = intAttInfo->upper(); } double min = (1 - eps/lenAminusMean); double plus = (1 + eps/lenAminusMean); for(int i=0; i<numdimension; i++){ b[i] = min * aMinusMean[i] + mean[i]; c[i] = plus * aMinusMean[i] + mean[i]; } z[0] = b; z[1] = c; return z; }
/** * SMART-TV prediction */Item *SmartTV::predict(const Tuple& a) throw (FailException) { try { int pos = 0; int heapSize; int paramK = k; votes.clear(); int candidates = 0; double eps = epsilon; vector<size_type> contourPoints; vector<vector<double> > ringVectors; vector<double> newSample = tupleToVector(a); vector<double> aMinusMean = vectorDifferent(newSample, mean); eps = (eps > lenAminusMean ? lenAminusMean : eps); //Pruning by means of Dimensional Projections while(!candidates){ ringVectors = getRingVectors(eps, aMinusMean, newSample); double fb = f(ringVectors[0]); double fc = f(ringVectors[1]);
108
BasicPt pn = getTVContourMask(fb, fc); contourPoints = pstv.getAllTupleIndices(pn); int candSize = contourPoints.size(); cout<<"TV candidates: "<<candSize<<endl;
if(candSize>MS){ pos = 0; vector<pair<unsigned,int> > maxCand; pn = pn(ps.createDerivedPTree(contourPoints)); while(pos<numdimension){ BasicPt pp = getXiCandidateMask(pos, newSample[pos],
eps); unsigned count = ps.count(pn & pp); if(count!=0) maxCand.push_back(pair<unsigned,int>(count,pos)); pos++; } if(maxCand.size()>0){ sort(maxCand.begin(),maxCand.end()); vector<pair<unsigned, int> >::reverse_iterator it =
maxCand.rbegin(); for(; it!=maxCand.rend();it++){ pos = (*it).second; BasicPt pp(getXiCandidateMask(pos, newSample[pos],
eps)); unsigned count = ps.count(pn & pp); if(count!=0){ pn = pn & pp; if(count<MS) break; } } } contourPoints = ps.getAllTupleIndices(pn); cout<<"Pruned: "<<contourPoints.size()<<endl; } candidates = contourPoints.size(); if(!candidates) eps *= 2; } if(paramK==0) heapSize = contourPoints.size(); else{ if(contourPoints.size() < paramK) paramK = contourPoints.size(); heapSize = paramK; }
// measure the distance Heap nearestNeighbors[heapSize]; for(int i=0;i<contourPoints.size();i++){ Tuple x = ps.getTuple(contourPoints[i]); double distance = L2(tupleToVector(x), newSample); if(i<heapSize) createHeap(nearestNeighbors,i,contourPoints[i],distance); else{ if(distance < nearestNeighbors[0].val){
109
nearestNeighbors[0].val = nearestNeighbors[heapSize-1].val;
nearestNeighbors[0].idx = nearestNeighbors[heapSize-1].idx;
nearestNeighbors[heapSize-1].val = distance; nearestNeighbors[heapSize-1].idx = contourPoints[i]; for(int t=(heapSize/2)-1;t>=0;t--) adjustHeap(nearestNeighbors,t,heapSize); } } } epsilon = eps; vote(nearestNeighbors,heapSize); } catch(const exception& e){ cout<<e.what()<<endl; } return winner();}
/** * Get TV contour mask given fb (lower than fa) and fc (upper than fa) */BasicPt SmartTV::getTVContourMask(const double& fb, const double& fc){ BasicPt pu(pitv); BasicPt pl(pitv); double lowerBound = fb; double upperBound = fc;
// make sure that the contour range is not out of boundDoubleAttributeInfo* doubleAttInfo =dynamic_cast<DoubleAttributeInfo*>(&(pitv.getAttributeInfo(0)));if(upperBound > doubleAttInfo->upper())
upperBound = doubleAttInfo->upper(); if(lowerBound < doubleAttInfo->lower())
lowerBound = doubleAttInfo->lower(); if(lowerBound > doubleAttInfo->upper())
lowerBound = doubleAttInfo->upper(); // get the mask of all values that are less than upperBound
boost_bitset bits = doubleAttInfo->encode(new UsignDoubleItem(upperBound)); int nonZeroBit = 0;
if(bits.any()){ while(!bits[nonZeroBit]) nonZeroBit++; } int first = true; for(int j=nonZeroBit; j<pitv[0].binaryLength(); j++){ if(first){ pu = pu(0,j); first = false; continue; }
110
if(bits[j]==true) pu = pu & pu(0,j);
else pu = pu | pu(0,j);
} // get the mask of all values that are greater than lowerBound bits = doubleAttInfo->encode(new UsignDoubleItem(lowerBound)); nonZeroBit = 0; if(bits.any()){ while(!bits[nonZeroBit]) nonZeroBit++;
}
first = true; for(int j=nonZeroBit; j<pitv[0].binaryLength(); j++){ if(first){ pl = pl(0,j); first = false; continue; } if(bits[j]==true) pl = pl & pl(0,j); else pl = pl | pl(0,j); } return pl & !pu;}
/** * Get neighborhood mask of the dimensional projection */BasicPt SmartTV::getXiCandidateMask(const int& i, const double& val,
const double& eps){ BasicPt pu(pi); BasicPt pl(pi); boost_bitset bits; IntAttributeInfo* intAttInfo; DoubleAttributeInfo* doubleAttInfo; double upperBound; double lowerBound; int intUpperBound; int intLowerBound;
// make sure that the contour range is not out of boundif(((pi.getAttributeInfo(i)).type()==Type::UNSIGNED_DOUBLE)||((pi.getAttributeInfo(i)).type()==Type::SIGNED_DOUBLE)){
doubleAttInfo = dynamic_cast<DoubleAttributeInfo*>(&(pi.getAttributeInfo(i)));
upperBound = val + eps; lowerBound = val - eps; if(upperBound > doubleAttInfo->upper()) upperBound =
doubleAttInfo->upper(); if(lowerBound < doubleAttInfo->lower()) lowerBound =
doubleAttInfo->lower(); if(lowerBound > doubleAttInfo->upper()) lowerBound =
111
doubleAttInfo->upper(); } else{ intAttInfo =
dynamic_cast<IntAttributeInfo*>(&(pi.getAttributeInfo(i))); intUpperBound = (int)(val + eps); intLowerBound = (int)(val - eps); if(intUpperBound > intAttInfo->upper()) intUpperBound =
intAttInfo->upper(); if(intLowerBound < intAttInfo->lower()) intLowerBound =
intAttInfo->lower(); if(intLowerBound > intAttInfo->upper()) intLowerBound =
intAttInfo->upper(); } int nonZeroBit = 0; if(((pi.getAttributeInfo(i)).type()==Type::UNSIGNED_DOUBLE)||
((pi.getAttributeInfo(i)).type()==Type::SIGNED_DOUBLE)) bits = doubleAttInfo->encode(new UsignDoubleItem(upperBound)); else bits = intAttInfo->encode(new UsignIntItem(intUpperBound));
if(bits.any()){ while(!bits[nonZeroBit]) nonZeroBit++; } int first = true; for(int j=nonZeroBit; j<pi[i].binaryLength(); j++){ if(first){ pu = pu(i,j); first = false; continue; } if(bits[j]==true) pu = pu & pu(i,j); else pu = pu | pu(i,j);
}
nonZeroBit = 0; if(((pi.getAttributeInfo(i)).type()==Type::UNSIGNED_DOUBLE)||
((pi.getAttributeInfo(i)).type()==Type::SIGNED_DOUBLE)) bits = doubleAttInfo->encode(new UsignDoubleItem(lowerBound)); else bits = intAttInfo->encode(new UsignIntItem(intLowerBound));
if(bits.any()){ while(!bits[nonZeroBit]) nonZeroBit++; }
first = true; for(int j=nonZeroBit; j<pi[i].binaryLength(); j++){ if(first){ pl = pl(i,j); first = false; continue;
112
} if(bits[j]==true) pl = pl & pl(i,j); else pl = pl | pl(i,j); } return pl & !pu; }
/** * Vote for winner */void SmartTV::vote(const Heap nearestNeighbors[], int k) throw (FailException){ try{ if(classType==Type::SING_CAT){ SingCatAttributeInfo *sInfo =
dynamic_cast<SingCatAttributeInfo*>(&(pi.getAttributeInfo(classLabel)));
vector<string> classDom = sInfo->getDomain(); double vote[classDom.size()]; for(int i=0;i<classDom.size();i++)
vote[i] = 0.0; for(int i=0;i<k;i++){ Tuple closestNeighbor =
ps.getTuple(nearestNeighbors[i].idx); SingCatItem *vclosestNeighbor =dynamic_cast<SingCatItem*>
(closestNeighbor.get(classIndex)); for(int j=0;j<classDom.size();j++){ if(classDom[j]==vclosestNeighbor->value()){ vote[j] += exp(-1 * (nearestNeighbors[i].val) *
(nearestNeighbors[i].val)); break; } } } Item *classDomItem; for(int i=0;i<classDom.size();i++){ classDomItem = new SingCatItem(classDom[i]); votes.push_back(pair<Item *,
double>(classDomItem,vote[i])); }
} else throw FailException("not supported type in voting"); } catch(const exception& e){ throw FailException(e.what()); } }
113
/** * Get the winner */Item* SmartTV::winner(){ pair<Item*, double> max; vector<pair<Item*, double> >::const_iterator it = votes.begin(); for( ; it != votes.end(); ++it){ if((*it).second > max.second) max = *it; } return max.first;}
/** * Calculate the vote histogram and return it */vector<pair<Item*, double> > SmartTV::vote_histogram(const Tuple& t)
throw (FailException){ try{ predict(t); } catch(const exception& e){ throw FailException(e.what()); } return votes;}
/** * Convert a tuple to a vector */vector<double> SmartTV::tupleToVector(const Tuple& t){ vector<boost_bitset> bt= pi.encodeS(t); vector<double> tupleVector(pi.degree()-1); // exclude class label for(int i=0; i<numdimension; i++){ double tval = 0;
for(int j=pi[i].binaryLength()-1; j>=0; j--)
tval += pow(2.0,j) * bt[i][j]; if(((pi.getAttributeInfo(i)).type()==Type::UNSIGNED_DOUBLE)|| ((pi.getAttributeInfo(i)).type()==Type::SIGNED_DOUBLE)){ DoubleAttributeInfo *doubleAttInfo =
dynamic_cast<DoubleAttributeInfo*>(&(pi.getAttributeInfo(i)));
for(int k=0;k<doubleAttInfo->precision();k++) tval *= 0.1; } tupleVector[i] = tval; } return tupleVector;}
114
/** * Set a class label */void SmartTV::setClassLabel(const string& cl) throw (UnknownAttribute){ classLabel = cl; classIndex = pi.getAttributeIndex(cl); classType = (pi.getAttributeInfo(cl)).type();}
/** * Get a class label */string SmartTV::getClassLabel()const{ return classLabel; }
/** * Compute Euclidian distance L2(x,a) of two vectors: x and a */double SmartTV::L2(const vector<double>& x, const vector<double>& a){ double sum = 0.0; for(int i=0; i<numdimension; i++) sum += pow((x[i] - a[i]),2); return sqrt(sum);}
/** * Create a maximum heap: the maximum value is in the root of the heap* and the heap contains the smallest values. */void SmartTV::createHeap(Heap *heap, size_t i, size_t newidx,
double newval){ heap[i].val = newval; heap[i].idx = newidx; while((i>0) && (heap[(i-1)/2].val < newval)) { heap[i].val = heap[(i-1)/2].val; heap[i].idx = heap[(i-1)/2].idx; i = (i-1)/2; } heap[i].val = newval; heap[i].idx = newidx; }
115
/** * Adjust the maximum heap */void SmartTV::adjustHeap(Heap *heap, size_t pos, size_t heapsize){ double val = heap[pos].val; size_t idx = heap[pos].idx; int i = 2*(pos+1)-1; while(i<=heapsize-1){ if((i<heapsize-1) && (heap[i].val < heap[i+1].val)){ i++; } if(val >= heap[i].val) break; heap[(i-1)/2].val = heap[i].val; heap[(i-1)/2].idx = heap[i].idx; i = 2*(i+1)-1; } heap[(i-1)/2].val = val; heap[(i-1)/2].idx = idx;}
/** * Compute vector mean */double SmartTV::getMean(int i){ BasicPt p(pi); double sum = 0.0; int attType = (pi.getAttributeInfo(i)).type(); if((attType==Type::SIGNED_DOUBLE)||(attType==Type::SIGNED_INT)){
for(int j=pi[i].binaryLength()-2; j>=0; j--){ sum += pow(2.0,j) * ps.count(p(i,j) &
!p(i,pi[i].binaryLength()-1))-pow(2.0,j) * ps.count(p(i,j) & p(i,pi[i].binaryLength()-1));
} }
else{for(int j=pi[i].binaryLength()-1; j>=0; j--)
sum += pow(2.0,j) * ps.count(p(i,j)); }
if((attType==Type::SIGNED_DOUBLE)||(attType==Type::UNSIGNED_DOUBLE)){
DoubleAttributeInfo *doubleAttInfo = dynamic_cast<DoubleAttributeInfo*>(&(pi.getAttributeInfo(i)));
for(int t=0;t<doubleAttInfo->precision();t++) sum = sum * 0.1; } return sum / pi.cardinality();}
116
117
A.4. Makefile
## # Makefile # Author: Taufik Abidin # DataSURG Research Group at CS NDSU # #
PTREE_LIB = $(PTREE_HOME)/lib/ptreeAPI2.aPTREE_INC = $(PTREE_HOME)/includeBOOST_HOME = $(HOME)/usr/boost_1_29_0XML_FLAGS = `xml2-config --cflags`XML_LIBS = `xml2-config --libs`
CC = c++C_FLAGS = -g -c
SMART_TV = SmartTV.o
all: SmartTVApp
## SMART-TV object#SmartTV.o: SmartTV.cpp SmartTV.h
$(CC) -pg $(C_FLAGS) $*.cpp -o $@ -I$(PTREE_INC) -I$(BOOST_HOME) $(XML_FLAGS)
## SMART-TV Application#SmartTVApp: clean $(SMART_TV)
$(CC) -g -o $@ [email protected] $(SMART_TV) $(PTREE_LIB) -I$(PTREE_INC) -I$(BOOST_HOME) $(XML_FLAGS) $(XML_LIBS)
## Clean#clean:
@rm -f SmartTV.o SmartTVApp.o SmartTVApp@rm -f *~
118