8
JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 14, ISSUE 1, JULY 2012 1 A New Decision Tree Learning Approach for Novel Class Detection in Concept Drifting Data Stream Classification Amit Biswas, Dewan Md. Farid and Chowdhury Mofizur Rahman Abstract—Novel class detection in concept drifting data stream classification is the process of learning, where the data distributions change over time like weather conditions, economical changes, astronomical, and intrusion detection etc. Arrival of a novel class in concept-drift occurs in data stream when new data introduce the new concept classes or remove the old ones. Existing data mining classifiers cannot detect and classify the novel class instances until the classifiers are trained with the labeled instances of the novel class. In this paper, we propose a new approach for detecting novel class in concept drifting data stream classification using decision tree classifier that can determine whether new data instance belongs to a novel class. The proposed approach builds a decision tree from training data points, which continuously updates with recent data points so that the tree represents the most recent concept in data stream. The experimental analysis on benchmark datasets from UCI machine learning repository proved that the proposed approach can detect novel class in concept drifting data stream classification problems. Index Terms—Concept Drifting, Data Stream Classification, Decision Tree, Novel Class. —————————— —————————— 1 INTRODUCTION ATA stream classification is the process of extracting knowledge and information from continuous data instances. A data stream is an ordered sequence of data points that includes attribute values and class values. The goal of data mining classifiers is to predict the class value of a new or unseen instance, whose attribute values are known but the class value is unknown. The existing data mining classifiers (or classification models) are trained on instances of the dataset with fixed number of class values, but in real-world data stream classification problems a new data instance with new class value may appear and the classification model misclassify the new instance. Most of the existing data mining classifiers can- not detect and classify the novel class instances until the classifiers are trained with the labeled instances of the novel class. In real-life data stream mining problems the data distributions change over time, such as weather pre- dictions, astronomical, and intrusion detection etc. Novel class detection in concept drifting data stream mining causes problems because the classification models become less accurate as time passes. The concept drift means the statistical properties of the target class, which the data mining classifiers are trying to classify, change over time in unforeseen ways. Novel class detection in concept drifting data stream classification refers to a change in the data stream when the underlying concept of the data changes over time. Recently research on novel class detection in concept drifting data stream classifica- tion received much attention to intelligent computational researchers [1], [2], [3]. The data mining classifiers should update continuously so that it reflects the most recent concept in the data stream. The data stream classifiers are divided into two categories: single model and ensemble model. Single model incrementally update a single classi- fier and effectively respond to concept drifting [9], [13]. On the other side, ensemble model use a combination of classifiers, which combines a series of classifiers with the aim of creating an improved composite model, and also handle concept drifting efficiently [1], [5], [10], [12]. In this paper, we provide a solution for handling the novel class detection problem using decision tree. Our approach builds a decision tree from data stream, which continuously updates with new data points so that the latest tree represents the most recent concept in data stream. We calculate the threshold value based on the ratio of percentage of data points between each leaf node in the tree and the training dataset and also cluster the data points of training dataset based on the similarity of attribute values. If number of the data points classify by a leaf node of the tree increases than the threshold value that calculated before, which means a novel class arrived. Then we compare the new data point with existing data points based on the similarity of attribute’s value. If the attribute’s value of new data point is different than exist- ing data points and the new data point does not belongs to any cluster, which confirms a novel class arrived. Then we add the new data point into training dataset and re- ———————————————— Amit Biswas is with the Department of Computer Science and Engineer- ing, United International University, Dhaka, Bangladesh. Dewan Md. Farid is with the Department of Computer Science and Engi- neering, United International University, Dhaka, Bangladesh. Chowchury Mofizur Rahman is with the Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh. D © 2012 JCSE www.Journalcse.co.uk

A New Decision Tree Learning Approach for Novel Class Detection in Concept Drifting Data Stream Classification

Embed Size (px)

DESCRIPTION

Journal of Computer Science and Engineering, ISSN 2043-9091, Volume 14, Issue 1, July 2012http://www.journalcse.co.uk

Citation preview

Page 1: A New Decision Tree Learning Approach for Novel Class Detection in Concept Drifting Data Stream Classification

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING, VOLUME 14, ISSUE 1, JULY 2012 1

A New Decision Tree Learning Approach for Novel Class Detection in Concept Drifting

Data Stream Classification Amit Biswas, Dewan Md. Farid and Chowdhury Mofizur Rahman

Abstract—Novel class detection in concept drifting data stream classification is the process of learning, where the data distributions change over time like weather conditions, economical changes, astronomical, and intrusion detection etc. Arrival of a novel class in concept-drift occurs in data stream when new data introduce the new concept classes or remove the old ones. Existing data mining classifiers cannot detect and classify the novel class instances until the classifiers are trained with the labeled instances of the novel class. In this paper, we propose a new approach for detecting novel class in concept drifting data stream classification using decision tree classifier that can determine whether new data instance belongs to a novel class. The proposed approach builds a decision tree from training data points, which continuously updates with recent data points so that the tree represents the most recent concept in data stream. The experimental analysis on benchmark datasets from UCI machine learning repository proved that the proposed approach can detect novel class in concept drifting data stream classification problems.

Index Terms—Concept Drifting, Data Stream Classification, Decision Tree, Novel Class.

—————————— u ——————————

1 INTRODUCTIONATA stream classification is the process of extracting knowledge and information from continuous data instances. A data stream is an ordered sequence of

data points that includes attribute values and class values. The goal of data mining classifiers is to predict the class value of a new or unseen instance, whose attribute values are known but the class value is unknown. The existing data mining classifiers (or classification models) are trained on instances of the dataset with fixed number of class values, but in real-world data stream classification problems a new data instance with new class value may appear and the classification model misclassify the new instance. Most of the existing data mining classifiers can-not detect and classify the novel class instances until the classifiers are trained with the labeled instances of the novel class. In real-life data stream mining problems the data distributions change over time, such as weather pre-dictions, astronomical, and intrusion detection etc.

Novel class detection in concept drifting data stream mining causes problems because the classification models become less accurate as time passes. The concept drift means the statistical properties of the target class, which the data mining classifiers are trying to classify, change over time in unforeseen ways. Novel class detection in concept drifting data stream classification refers to a change in the data stream when the underlying concept of

the data changes over time. Recently research on novel class detection in concept drifting data stream classifica-tion received much attention to intelligent computational researchers [1], [2], [3]. The data mining classifiers should update continuously so that it reflects the most recent concept in the data stream. The data stream classifiers are divided into two categories: single model and ensemble model. Single model incrementally update a single classi-fier and effectively respond to concept drifting [9], [13]. On the other side, ensemble model use a combination of classifiers, which combines a series of classifiers with the aim of creating an improved composite model, and also handle concept drifting efficiently [1], [5], [10], [12].

In this paper, we provide a solution for handling the novel class detection problem using decision tree. Our approach builds a decision tree from data stream, which continuously updates with new data points so that the latest tree represents the most recent concept in data stream. We calculate the threshold value based on the ratio of percentage of data points between each leaf node in the tree and the training dataset and also cluster the data points of training dataset based on the similarity of attribute values. If number of the data points classify by a leaf node of the tree increases than the threshold value that calculated before, which means a novel class arrived. Then we compare the new data point with existing data points based on the similarity of attribute’s value. If the attribute’s value of new data point is different than exist-ing data points and the new data point does not belongs to any cluster, which confirms a novel class arrived. Then we add the new data point into training dataset and re-

———————————————— • Amit Biswas is with the Department of Computer Science and Engineer-

ing, United International University, Dhaka, Bangladesh. • Dewan Md. Farid is with the Department of Computer Science and Engi-

neering, United International University, Dhaka, Bangladesh. • Chowchury Mofizur Rahman is with the Department of Computer Science

and Engineering, United International University, Dhaka, Bangladesh.

D

© 2012 JCSE www.Journalcse.co.uk

Page 2: A New Decision Tree Learning Approach for Novel Class Detection in Concept Drifting Data Stream Classification

2

build the decision tree. We organize this paper as follows. Section 2 discusses

related work. Section 3 provides an overview of learning algorithms. Our approach is introduced in section 4. Sec-tion 5 discusses the datasets and experimental analysis. Finally, conclusions and future works are drawn is sec-tion 6.

2 RELATED WORK Novelty detection and data stream classification, where data distributions inherently change over time that re-ceived much attention to the intelligent computational researchers in many practical real-world applications, such as spam, climate change and intrusion detection. In 2011, Masud et al. proposed a novelty detection and data stream classification technique, which integrates a novel class detection mechanism into traditional mining classi-fiers that enabling automatic detection of novel classes before the true labels of the novel class instances arrive [1]. In order to determine whether an instance belongs to a novel class, the classification model sometimes needs to wait for more test instances to discover similarities among those instances. In the same year, R. Elwell and R. Polikar introduced an ensemble of classifiers-based approach named Learn++.NSE for incremental learning of concept drift, characterized by nonstationary environments [2]. Learn++.NSE trains one new classier for each of data it re-ceives, and combines these classifiers using a dynamically weighted majority voting. The novelty of the approach is in determining the voting weights, based on each classifi-er’s time-adjusted accuracy on current and past environ-ments.

In 2007, Kolter and Maloof proposed an ensemble ap-proach for concept drifting data stream classification that dynamically creates and removes weighted experts in response to changes in performance using dynamic weighted majority (DWM) [5]. It trains online learners of the ensemble and adds or removes experts based on the global performance of the ensemble. In 2006, Gaber and Yu [8] proposed a novel class detection approach termed as STREAM-DETECT to identify changes in data streams, which concerned with detecting changes in data streams by measuring online clustering result deviation over time. In 2005, Yang et al. [9] proposed an approach, which in-corporates proactive and reactive predictions. In a proac-tive mode, it anticipates what the new concept will be if a future concept change takes place, and prepares predic-tion strategies in advance. If the anticipation turns out to be correct, a proper prediction model can be launched instantly upon the concept change. If not, it promptly resorts to a reactive mode: adapting a prediction model to the new data. Widmer and Kubat presented a single clas-sifier named FLORA, which use a sliding window to choose a block of new instances to train a new classifier [14]. FLORA has a built-in forgetting mechanism with the implicit assumption that those instances that fall outside the window are no longer relevant, and the information carried by them can be forgotten.

3 LEARNING ALGORITHMS Data mining is the process of finding hidden information and patterns in a huge database. Data mining algorithms have two major functions: classification and clustering. Classification maps data into predefined groups or clas-ses. It is often referred to a supervised learning because the classes are determined before examining the data. Classification creates a function from training data. On the other side, clustering is similar to classification except that the groups are not predefined, but rather defined by the data alone. It is alternatively referred to as unsuper-vised learning.

3.1 Decision Tree Learning Decision tree (DT) learning is very popular mining tool for classification and prediction. It is easy to implement and requires little prior knowledge. DT can be build from large dataset with many attributes. In DT the successive division of the set of training instances proceeds until all the subsets consists of instances to a single class. There are 3 main components in a DT: nodes, leaves, and edges. Each node is labeled with an attribute by which the data is to be partitioned. Each node has a number of edges, which are labeled according to possible values of the at-tribute. An edge connects either two nodes or a node and a leaf. Leaves are labeled with a decision value for catego-rization of the data. To make a decision using a DT, start at the root node and follow the tree down the branches until a leaf node representing the class is reached. Each DT represents a rule set, which categorizes data according to the attribute of dataset.

The ID3 (Iterative Dichotomiser) technique builds DT using information theory [16]. The basic strategy used by ID3 is to choose splitting attributes from a dataset with the highest information gain. The amount of information associated with an attribute value is related to the proba-bility of occurrence. The concept used to quantify infor-mation is called entropy, which is used to measure the amount of randomness from a dataset. When all data in a set belong to a single class, there is no uncertainty then the entropy is zero. The objective of decision tree classifi-cation is to iteratively partition the given dataset into sub-sets where all elements in each final subset belong to the same class. The entropy calculation is shown in equation 1. Given probabilities p1, p2,…, ps where ∑i=1 pi = 1,

(1)

Given a dataset, D, H(D) finds the amount of subset of

dataset. When that subset is split into s new subsets S = {D1, D2, … , Ds}, we can again look at the entropy of those subsets. A subset of dataset is completely ordered if all examples in it are the same class. ID3 chooses the splitting attribute with the highest gain. The ID3 algorithm calcu-lates the gain by the equation 2.

(2)

The C4.5 is a successor of ID3 through GainRatio [15].

∑=

=S

i iiS pppppHEntropy

121 ))1log((),...,,(:

∑=

−=S

iii DHDpDHSDGain

1

)()()(),(

Page 3: A New Decision Tree Learning Approach for Novel Class Detection in Concept Drifting Data Stream Classification

3

For splitting purpose, C4.5 use the largest GainRatio that ensures a larger than average information gain.

(3)

The C5.0 algorithm improves the performance of

building trees using boosting, which is an approach to combining different classifiers. But boosting does not al-ways help when the training data contains a lot of noise. When C5.0 performs a classification, each classifier is as-signed a vote, voting is performed, and the example of dataset is assigned to the class with the most number of votes. CART (Classification and Regression Trees) is a process of generating a binary tree for decision making [17]. CART handles missing data and contains a pruning strategy. The SPRINT (Scalable Parallelizable Induction of Decision Tree) algorithm uses an impurity function called gini index to find the best split [18]. Equation 4, difines the gini index for a dataset, D.

(4)

Where, pj is the frequency of class Cj in D. The good-

ness of a split of D into subsets D1 and D2 is defined by (5)

The split with the best gini value is chosen. A number of research projects of optimal feature selection and classi-fication have been done, which adopt hybrid stratecy in-volving evolutionary algorithm and inductive decision tree learning [19], [20], [21], [22], [23].

3.2 Clustering Clustering can be considered the most important unsu-pervised learning problem, which has been used in many real-world application domains, including biology, medi-cine, anthropology, marketing etc. It is the process of or-ganizing objects into groups whose members are similar in some way. A data point within one cluster is more like data points within that cluster than it is similar to data points outside it. A cluster is therefore a collection of ob-jects which are “similar” between them and are “dissimi-lar” to the objects belonging to other clusters. So, the goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. Give a dataset D = {t1,t2,…,tn} of data points, a similarity measure, sim(ti,tl), defined between any two data points, ti, tl, € D, and an integer value k, the clustering problem is to define a mapping f: D�{1,…,k} where each ti is assigned to one cluster Kj, 1≤ j ≤k. Cluster-ing algorithms can be categorized based on their cluster model, like k-means clustering, distribution-based clus-tering, density-based clustering etc.

4 PROPOSED APPROACH The data stream is a continuous sequence of data points: {x1,x2,…,xnow}, where x1 is the very first data point in the

stream, and xnow is the latest data point, which has just arrived in the data stream. Each data point xi is an n-dimensional feature vector that consists of a number of attributes Ai = {A1,A2,…,An} with class label Ci = {C1,C2,…,Cn}. Each attribute consists of a number of attrib-ute values Ai = {Ai1,Ai2,…,Aip}. Algorithm 1 outlines the overview of our approach. We build a decision tree from training data points and calculate the threshold value based on the ratio of percentage of data points between each leaf node in the tree and the training data points and also cluster the training data points based on the similari-ty of attribute values. When classifying the continuous data streams in real-time, if number of the data points classify by a leaf node of the tree increases than the threshold value that calculated before, which means a novel class arrived. Then we compare the new data point with existing data points based on the similarity of attrib-ute’s value. If the attribute’s value of new data point is different than existing data points and the new data point does not belongs to any cluster, which confirms a novel class arrived. Then we add the new data point into train-ing dataset and rebuild the decision tree. The decision tree classifier continuously updates so that if represents the most resent concept in the data stream.

Algorithm 1: Novel Class Detection using Decision Tree 1. Find the best splitting attribute with highest in-

formation gain value in training dataset. 2. Create a node and label with splitting attribute.

[First node is the root node, T of the decision tree] 3. For each branch of the node, partition the data

points and grow sub training datasets Di by apply-ing splitting predicate to training dataset D.

4. For each sub training datasets Di, if data points in Di, are all of same class value, Ci then the leaf node labeled with Ci. Else continues steps 1 to 4 until each final subset belong to the same class value or leaf node created.

5. When the decision tree construction is complete, calculate the threshold value for each leaf node in the tree based on the ratio of percentage of data points between each leaf node in the decision tree and the data points in the training dataset.

6. Cluster the training data points based on the simi-larity of attribute values.

7. For classifying the continuous data streams in real-time, if number of the data points classify by a leaf node of the decision tree increases compare to threshold value that calculated before, which means a novel class may arrived.

8. If the attribute’s value of new data point is differ-ent than existing data points of the leaf node of the decision tree, and also the new data point does not belongs to any existing cluster, which confirms a novel class arrived.

9. If novel class detected, then add the new data point into existing training data points and gener-ate a new training dataset, Dnew.

10. Rebuild a new decision tree using new/updated training dataset, Dnew.

)||||,...,

||||(

),(),(1

DD

DDH

SDGainSDGainRatioS

=

∑−= 21)( jpDgini

))(())(()(

2

2

1

1

Dgininn

DgininnDginisplit +=

Page 4: A New Decision Tree Learning Approach for Novel Class Detection in Concept Drifting Data Stream Classification

4

5 EXPERIMENTAL ANALYSIS

In   this   section,   we   describe   the   datasets,   and   the  experimental  results.  

5.1 Datasets Data stream mining is the process of analyzing online data to discover patterns, which uses sophisticated math-ematical algorithms to segment the continuous data and evaluate the probability of future events. A set of data items called the dataset, which is the very basic concept of data mining and machine learning research. A dataset is roughly equivalent to a two-dimensional spreadsheet or database table. Table 1 describes about the datasets from UCI machine learning repository, which are used in ex-perimental analysis [26].

1. Iris Plants Database: This is one of the best known dataset in the pattern recognition literature. This dataset contains 3 class values (Iris Setosa, Iris Versicolor, and Iris Virginica), where each class re-fers to a type of iris plant. There are 150 instances and 4 attributes in this dataset (50 in each of three classes). One class is linearly separable from the other 2 classes.

2. Image Segmentation Data: The goal of this dataset is to provide an empirical basis for research on im-age segmentation and boundary detection. There are 1500 data instances in this dataset with 19 at-tributes and all the attributes are real. There are 7 class attribute values: brickface, sky, foliage, ce-ment, window, path, and grass.

3. Large Soybean Database: There are 35 attributes in this dataset and all attributes are nominalized. There are 683 data instances and 19 class values in this dataset.

4. Fitting Contact Lenses Database: It is very small dataset with only 24 data instances, 4 attibutes and 3 class attribute values (soft, hard, and none). All the attribute values are nominal in this dataset. The instances are complete and noise free and 9 rules cover the training set.

5. NSL-KDD Dataset: The Knowledge Discovery and Data Mining 1999 (KDD99) competition data con-tains simulated intrusions in a military network environment. It is often used a benchmark to eval-uate handling concept drift. NSL-KDD dataset is the new version of the KDD99 dataset, which solved some of the inherent problems of the KDD99 dataset [25]. Although, NSL-KDD dataset still suffers from some of the problems that dis-cussed by McHugh [24]. The main advantage of NSL-KDD dataset is that the training and testing data points are reasonable, so it become affordable to run the experiments on the complete set of training and testing dataset without the need to randomly select a small portion of dataset. Each record in NSL-KDD dataset consists of 41 attrib-utes and 1 class attribute. NSL-KDD dataset does not include redundant and duplicate examples in training dataset.

TABLE 1 Data Set Descriptions

Dataset No of

Attributes Attribute

Types No of

Instances

No of Class

Attribute Iris Plants Database

4 Real 150 3

Image Seg-mentation

Data 19 Real 1500 7

Large Soybean Database

35 Nominal 683 19

Fitting Contact Lenses Data-

base 4 Nominal 24 3

NSL-KDD Dataset

41 Real &

Nominal 25192 23

5.2 Results We implement our algorithm in Java. The code for decision tree has been adapted from the Weka machine learning open source repository (http://www.cs.waikato.ac.nz/ml/weka). Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. The experiments were run on an Intel Core 2 Duo Processor 2.0 GHz processor (2 MB Cache, 800 MHz FSB) with 1 GB of RAM. There are various approaches to determine the per-formance of data stream classifiers. The performance can most simply be measured by counting the proportion of correctly classified instances in an unseen test dataset. Table 2 summarizes the symbols and terms used throughout the equation 6 to 8.

TABLE 2 Used Symbols and Terms

Symbol Term N Total instances in the data stream Nc Total novel class instances in the data stream Fp Total existing class instances misclassified as novel class Fn Total novel class instances misclassified as existing class Fe Total existing class instances misclassified

Mnew % of novel class instances misclassified as existing class

Fnew % of existing class instances falsely identified as novel class

ERR Total misclassified error

(6)

(7)

(8) The equations 6, 7, and 8 are used to evaluate our approach. Table 3 and table 4 tabulate the results of performance com-

c

nnew N

FM 100*=

c

pnew NN

FF

−=

100*

NFFF

ERR enp 100*)( ++=

Page 5: A New Decision Tree Learning Approach for Novel Class Detection in Concept Drifting Data Stream Classification

5

parison between our approach and traditional decision tree classifier.

TABLE 3 Performance of Proposed Approach

Dataset ERR Mnew Fnew

Iris Plants Database 4 4 3 Image Segmentation Data 2.9 1.3 2.9 Large Soybean Database 9.2 2.8 1.9

Fitting Contact Lenses Database 16.6 0 5.2 NSL-KDD Dataset 4.0 8.4 1.2

TABLE 4

Performance of Traditional Decision Tree Dataset ERR Mnew Fnew

Iris Plants Database 5.3 4 5 Image Segmentation Data 5.2 3.7 5.2 Large Soybean Database 10.8 6.5 2.8

Fitting Contact Lenses Database 50 100 5.2 NSL-KDD Dataset 5.3 10.0 1.5

6 CONCLUSION In   this  paper,  we   introduce  decision   tree   classifier  based  novel  class  detection  in  concept  drifting  data  stream  clas-­‐‑sification,  which  builds  a  decision  tree  from  data  stream.  The   decision   tree   continuously   updates   with   new   data  

points   so   that   the   most   recent   tree   represents   the   most  recent   concept   in   data   stream.  The  main  propose   of   this  paper  is  to  improve  the  performance  of  decision  tree  clas-­‐‑sifier   in   concept   drifting   data   stream   classification   prob-­‐‑lems.   The   decision   tree   classifier   is   very   popular   super-­‐‑vised  learning  algorithm  that  has  several  advantages  such  as   it   is   easy   to   implement   and   requires   little   prior  knowledge.  We   tested   the   performance   of   proposed   ap-­‐‑proach   on   several   benchmark   datasets,   which   proved  proposed  approach  efficiently  detect  novel   class  and   im-­‐‑prove  the  classification  accuracy.  The  future  work  focuses  on  addressing  this  problem  under  dynamic  attribute  sets.    

7 APPENDIX , AN ILLUSTRATIVE EXAMPLE In  large  soybean  database  from  UCI  machine  learning  repos-­‐‑itory   [26],   there   are   total   35   attributes   and   all   the   attribute  values  are  nominal-­‐‑valued.  There  are  683  data  points  in  this  dataset,  which  are  categorized  into  19  class  attribute  values.  We  split   the  dataset   into  3  sub  datasets:  sub-­‐‑dataset  A  con-­‐‑tains  356  instances  with  10  class  attribute  values,  sub-­‐‑dataset  B   contains   107   instances  with   5   class   attribute   values,   and  sub-­‐‑dataset  C   contains   220   instances   with   4   class   attribute  values.  We   built   a   decision   tree,  DTA   using   sub-­‐‑dataset  A,  which  is  shown  in  figure  1.  

leafspot-size

seed

lt-1/8

norm

Bacterial-blight

purple-seed-stain

abnorm

fruit-pods

gt-1/8

norm

int-discolor

none

precip

gt-norm

plant-growth

abno

rm

alternarialeaf-spot

norm

lt-norm

norm

seed-discolor presentab

sent

few-present

dise

ased

dna

leaf-malf

absent

seed

norm

leaf-mild

absent

stem-cankers

above-sec-nde

diaporthe-stem-canker

abse

nt

brown-stem-rot

above-soil

brown-stem-rot

powdery-mildew

plant-growth

abnorm

normabno

rm

present

2-4-d-injury

frog-eye-leaf-spot

phyllosticta-leaf-spot

plant-stand

area-damaged

norm

al

low-areas

frog-eye-leaf-spot

scatte

red

frog-eye-leaf-spot

whole-field

germination

90-100

frog-eye-leaf-spot

80-8

9

alternarialeaf-spot

alternarialeaf-spot

lt-80

upper-areas

frog-eye-leaf-spot

lt-normal

phyllosticta-leaf-spot

alternarialeaf-spot

brown

brown-stem-rot

frog-eye-leaf-spot

cyst-nematode

upper-surf

cyst-nematode

purple-seed-stain

Fig. 1. Decision Tree DTA using Sub Dataset A.

Page 6: A New Decision Tree Learning Approach for Novel Class Detection in Concept Drifting Data Stream Classification

6

Then  we  classified   the  356   instances  of   sub-­‐‑dataset  A   by  applying  the  decision  tree,  DTA  that  correctly  classified  323  instances  and  misclassified  33  instances.  After  that  we  classi-­‐‑fied  107  instances  of  sub-­‐‑dataset  B  [sub-­‐‑dataset  B  contains  5  novel  classes]  by  applying  the  decision  tree  DTA that detect novel class arrived. For example, leafspot-size = lt-1/8 and seed = norm: bacterial-blight, this leaf node satisfied 20 instances from sub-dataset A and 10 instances from sub-dataset B. The other attribute’s value of 10 instances from sub-dataset B are quite dissimilar than 20 instances from sub-dataset A, which confirms novel class arrived. Then we merged sub-dataset A and sub-dataset B to generate a new dataset XA+B. Next we rebuild the decision tree DTX, which is shown in figure 2. Similarly, we merged dataset XA+B with sub-­‐‑dataset  C [sub-­‐‑dataset  C  contains  220  instanc-­‐‑es  with   4   novel   classes] and again generate a new dataset XA+B+C. Final, we again rebuild the decision tree DTY, which correctly classified all the 683 instances of dataset XA+B+C to 91.5081% and all the 220 instances of sub-­‐‑dataset  C  to  98.6364%. Decision tree, DTY is shown in figure 3.    

ACKNOWLEDGMENT This research work was supported by Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh.

REFERENCES [1] M. M. Masud, J. Gao, L. Khan, J. Han, and B. Thuraisingham, “Classifi-

cation and Novel Class Detection in Concept Drifting Data Streams un-der Time Constraints,” IEEE Transactions on Knowledge and Data Engi-neening, Vol. 23, No. 6, pp. 859-874, June 2011.

[2] R. Elwell, and R. Polikar, “Incremental Learning of Concept Drift in Nonstationary Environment,” IEEE Transactions on Neural Networks, Vol. 22, No. 10, pp. 1517-1531, October 2011.

[3] A. Zhou, F. Cao, W. Qian, and C. Jin, “Tracking Clusters in Evolving Data Streams over Sliding Window,” Knowledge and Information Systems, Vol. 15, No. 2, pp. 181-214, May 2008.

[4] E. J. Spinosa, A. P. de Leon, F. de Carvalho, and J. Gama, “Clus-ter-Based Novel Concept Detection in Data Streams Applied to Intrusion Detection in Computer Networks," Proc. 2008 ACM Symp. Applied Computing, pp. 976-980, 2008.

[5] J. Z. Kolter, and M. A. Maloof, "Dynamic Weighted Majority: An Ensemble Method for Drifting Concept," Journal of Machine Learning Research, Vol. 8, pp. 2755-2790, 2007.

[6] B. R. Dai, J. W. Huang, M. Y. Yeh, and M. S. Chen, “Adaptive Clustering for Multiple Evolving Streams,” IEEE Transactions on Knowledge and Data Engineening, Vol. 18, No. 9, pp. 1166-1180, Septem-ber 2006.

[7] C. C. Aggarwal, J. Han, J. Wang, and P. S. yu, “A Framework for On-Demand Classification of Evolving Data Streams,” IEEE Transac-tions on Knowledge and Data Engineening, Vol. 18, No. 5, pp. 577-589, May 2006.

leafspot-size

canker-lesion

Bacterial-blight

lt-1/8

dna

leafspots-marg

w-s

-mar

g

seed-size

no

rm

bacterial-pustule

lt-norm

no-w-s-m

arg

bacterial-pustule

dna

Bacterial-blight

bro

wn

Bacterial-blight

bacterial-blight

dk-brown-blk

purple-seed-stain

tan

gt-1

/8

dna

int-discolor

none

leaves

norm

stem-cankers

abse

nt

canker-lesion

dna

diaporthe-pod-&-stem-

blight

brow

n

purple-seed-stain

dk-brown-

blk

purple-seed-stain

tan

purple-seed-stain

anthracnose

anthracnose

above-

soil

above-sec-ndestem

abnorm

plant-growth

norm

norm

abnorm

powdery-mildew

cyst-nematode

abnorm

fruit-spots

brownleaf-malf

abse

nt present

brown-stem-rot

2-4-d-injury

black

leaf-malf

absent

dna

diaporthe-stem-canker

absent

anthracnose

colored

diaporthe-stem-canker

brown-w/blk-specks

anthracnose

present2-4-d-injury

charcoal-rot

leaf-malf

abse

nt present

2-4-d-injury

fruit-pods

norm

int-discolor

none

precip

gt-norm

plant-growth

abno

rm

alternarialeaf-spot

norm

lt-no

rm

norm

seed-discolor presentab

sent

few-present

diseased

frog-eye-leaf-spot

phyllosticta-leaf-spot

plant-stand

area-damaged

norm

al

low-areas

frog-eye-leaf-spot

scatt

ered

frog-eye-leaf-spot

whole-field

germination

90-100

frog-eye-leaf-spot

80-8

9

alternarialeaf-spot

alternarialeaf-spot

lt-80

upper-areas

frog-eye-leaf-spot

lt-normal

phyllosticta-leaf-spot

alternarialeaf-spot

brown

brown-stem-rot

cyst-nematode

alternarialeaf-spot

black

leaves

abnorm

norm

frog-eye-leaf-spot

diaporthe-pod-&-stem-

blight

herbicide-injury

dna

Fig. 2. Decision Tree DTX using Sub Dataset XA+B.

Page 7: A New Decision Tree Learning Approach for Novel Class Detection in Concept Drifting Data Stream Classification

7

[8] M. M. Gaber, and P. S. Yu, "Detection and Classification of Changes in Evolving Data Streams," Int’l Journal of Information Technology & Decision Making, Vol. 5, No. 4, pp. 659-670, 2006.

[9] Y. Yang, X. Wu, and X. Zhu, “Combining Proactive and Reac-tive Predictions for Data Streams," Proc. ACM SIGKDD, pp. 710-715, 2005.

[10] W. Fan, “Mining Concept Drifting Data Streams using Ensem-ble Classifiers," Proc. 10th ACM SIGKDD Int’l Conference on Knowledge Discovery and Data Mining, pp. 128-137, 2004.

[11] M. Markou, and S. Singh, "Novelty Detection: A Review Part 2: Neural Network based Approaches," Signal Processing, Vol. 83, Issue 12, pp. 2499-2521, December 2003.

[12] H. Wang, W. Fan, P. S. Yu, and J. Han, “Mining Concept Drift-ing Data Streams using Ensemble Classifiers,” IBM T. J. Watson Research, Hawthorne, NY 10532, Association for Computing Machinery Aug. 24, 2003.

[13] G. Hulten, L. Spencer, and P. Domingos, “Mining Time Changing Data Streams," Proc. 7th ACM SIGKDD Int’l Conference on Knowledge Discovery and Data Mining, ACM New York, NY, USA, pp. 97-106, 2001.

[14] G. Widmer, and M. Kubat, "Laerning in the Presence of Con-cept Drift and Hidden Contests," Machine Learning, Vol. 23, No. 1, pp. 69-101, April 1996.

[15] J. R. Quinlan, “C4.5: Programs for Machine Learning,” Morgan Kaufmann Publishers, San Mateo, CA, 1993.

[16] J. R. Quinlan, “Induction of Decision Tree,” Machine Learning, Vol. 1, pp. 81-106, 1986.

[17] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, “Classifica-tion and Regression Trees,” Statistics Probability Series, Wadsworth, Belmont, 1984.

[18] J. Shafer, R. Agrawal, and M. Meha, “SPRINT: A Scalable Parallel Classifier for Data Mining,” Morgan Kaufmann, pp. 544-555, 1996.

[19] D. Turney, “Cost-Sensitive Classification: Empirical Evaluation of a Hybrid Genetic Decision Tree Induction Algorithm,” Journal of Artifi-cial Intelligence Research, pp. 369-409, 1995.

[20] J. Bala, J. Huang, H. Vafaie, K. DeJong, and H. Wechsler, “Hybrid Learning using Genetic Algorithms and Decision Trees for Pattern Classification,” Proc. 14th Int’l Con. On Artificial Intelligence, Montreal, pp. 1-6, 19-25 August 1995.

[21] C. G. Salcedo, S. Chen, D. Whitley, and S. Smith, “Fast and Accurate Feature Selection using Hybrid Genetic Strategics,” Proc. Genetic and Evolutionary Computation Conference, pp. 1-8, 1999.

[22] S. R. Safavian, and D. Landgrebe, “A Survey of Decision Tree Classi-fier Methodology,” IEEE Transactions on Systems, Man. and Cybermet-ics, Vol. 21, No. 3, pp. 660-674, 1991.

leafspot-size

canker-lesion

Bacterial-blight

lt-1/8

dna

leafspots-marg

w-s

-mar

g

seed-size

norm

bacterial-pustule

lt-norm

no-w-s-m

arg

bacterial-pustule

dna

Bacterial-blight

bro

wn

Bacterial-blight

phytophthora-rot

dk-brown-blk

purple-seed-stain

tan

roots

gt-1

/8

norm

mold-growth

absent

fruit-spots

absent

leaf-malf

absent

fruiting-bodies

abse

nt

date

april

may june

brown-spot brown-spot precip

lt-norm

norm

gt-norm

phyllosticta-leaf-spot brown-spot brown-spot

july

precip

lt-no

rm norm

gt-norm

phyllosticta-leaf-spot

phyllosticta-leaf-spot

frog-eye-leaf-spot

august

leaf-shread

seed-tmt

absen

t

none

alternarialeaf-spot

fungic

ide

plant-stand

norm

al

lt-normal

frog-eye-leaf-spot

alternarialeaf-spot

other

frog-eye-leaf-spot

present

alternarialeaf-spot

september

stem norm

abnorm

alternarialeaf-spot

frog-eye-leaf-spot

october

alternarialeaf-spot

present

brown-spot

present

phyllosticta-leaf-spot

colo

red

fruit-pods

brown-spot

frog-eye-leaf-spot

frog-eye-leaf-spot

frog-eye-leaf-spot

dnafew-present

diseasednorm

brown-w/blk-speckscrop-hist

brown-spot

brown-spot

brown-spot

frog-eye-leaf-spot

same-lst-sev-yrs

same-lst-two-

yrs

same-lst-yr

diff-

lst-year

dnadistort brown-spot

brown-stem-rot

present

leaves

norm

abnorm

diaporthe-pod-&-stem-

blight downy-mildew

area-damaged

rotted

herbicide-injury phytophth

ora-rot phytophthora-rot

herbicide-injury

scattered

low-areas

whole-field

upper-areas

cyst-nematode

galls-cysts

dna

int-discolor

none

leaves

norm

stem-cankers

absent

canker-lesiondna

diaporthe-pod-&-stem-

blight

brow

n

purple-seed-stain

dk-brown-blk

purple-seed-stain

tan

purple-seed-stain

rhizoctonia-root-rot anthracno

se

anthracnose

belo

w-

soil

above-soil

above-sec-nde

stem

abnorm

plant-growth

norm

norm

abnorm

powdery-mildew

cyst-nematode

abnorm

plant-stand

norm

al

leaf-malf

absent

seed

norm

abnorm

diaporthe-stem-canker anthracnose

pres

ent

2-4-d-injury

lt-normal

fruiting-bodies

absent

phytophthora-rot

present

roots

norm

rotte

d

galls-cysts

anthracnose phytophthora-rot

phytophthora-rot

brown

leaf-malf

absen

t

present

brown-stem-rot

2-4-d-injury

black

charcoal-rot

Fig. 3. Decision Tree DTY using Sub Dataset XA+B+C.

Page 8: A New Decision Tree Learning Approach for Novel Class Detection in Concept Drifting Data Stream Classification

8

[23] W. Y. Loh, and X. Shih, “Split selection methods for classification tree,” Statistica Sininca, Vol. 7, pp. 815-840, 1997.

[24] J. McHugh, “Testing Intrusion Detection Systems: A critique of the 1998 and 1999 darpa intusion detection evaluations as performed by lincoln laboratory,” ACM Transactions on Information and System Secu-rity, Vol. 3, No. 4, pp. 262-294, 2000.

[25] The KDD Archive. KDD99 cup dataset, http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html, 1999.

[26] A. Frank, and A. Asuncion, “UCI Machine Learning Repository,” University of California, Irvine, School of Information and Computer Scienc-es, 2010, http://archive.ics.uci.edu/ml

Amit Biswas is currently completing Mas-ter of Science in Computer Science and Engineering from United International Uni-versity, Bangladesh. He obtained Bachelor of Computer Application (BCA) from Ban-galore University, India in 2004. He is an IT professional working as a Team Leader of Software department in a reputed IT com-pany named BASE Limited. He has also worked for Access to Information Pro-

gramme (A2I), Prime Minister’s Office, supported UNDP Bangla-desh. He has extensive experience and knowledge on Software Development and Database. Some of his developed software suc-cessfully using by PLAN Bangladesh, CARE Bangladesh, Bangla-desh Small and Cottage Industries Corporation (BSCIC), Habib Bank Limited, Dutch Bangla Bank, Rahimafrooz, etc.

Dr. Dewan Md. Farid received B.Sc. in Computer Science and Engineering from Asian University of Bangladesh in 2003, M.Sc. in Computer Science and Engineer-ing from United International University, Bangladesh in 2004, and Ph.D. in Com-puter Science and Engineering from Ja-hangirnagar University, Bangladesh in 2012. He is a part-time faculty member in the Department of Computer Science and

Engineering, United International University, Bangladesh and Daf-fodil International University, Bangladesh. He has published 1 book chapter, 8 journals, and 10 conferences in machine learn-ing, data mining, and intrusion detection. He has participated and presented his papers in international conferences at Malaysia, Por-tugal, Italy, and France. Dr. Farid is a member of IEEE and IEEE Computer Society. He worked as a visiting researcher at ERIC La-boratory, University Lumière Lyon 2 – France from 01-09-2009 to 30-06-2010. He received Senior Fellowship I & II awarded by National Science & Information and Communication Technology (NSICT), Ministry of Science & Information and Communication Technology, Government of Bangladesh, in 2008 and 2011 respectively.

Professor Dr. Chowdhury Mofizur Rahman had his B.Sc. (EEE) and M.Sc. (CSE) from Bangladesh Universi-ty of Engineering and Technology (BUET) in 1989 and 1992 respectively. He earned his Ph.D. from Tokyo Institute of Technology in 1996 under the aus-pices of Japanese Government scholar-ship. Prof Chowdhury is presently work-ing as the Pro Vice Chancellor and

acting treasurer of United International University (UIU), Dhaka, Bangladesh. He is also one of the founder trustees of UIU. Before joining UIU he worked as the head of Computer Science & Engineer-ing department of Bangladesh University of Engineering & Technol-ogy which is the number one technical public university in Bangla-desh. His research area covers Data Mining, Machine Learning, AI and Pattern Recognition. He is active in research activities and published around 100 technical papers in international journals and conferences. He was the Editor of IEB journal and worked as the moderator of NCC accredited centers in Bangladesh. He worked as the organizing chair and program committee member of a num-ber of international conferences held in Bangladesh and abroad. At present he is acting as the coordinator from Bangladesh for EU sponsored eLINK project. Prof Chowdhury has been working as the external expert member for Computer Science departments of a number of renowned public and private universities in Bangladesh. He is actively contributing towards the national goal of converting the country towards Digital Bangladesh.