Upload
rebecca-opal-hunt
View
215
Download
1
Embed Size (px)
Citation preview
YaDT (Yet another Decision Tree
builder)
Ah Young [email protected] Communication Lab.
Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.
Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.
1. Introduction
• YaDT is a from-scratch main-memory implementation of the C4.5-like
decision tree algorithm.
• ID3(Entropy) → C4.5(Information Gain) → C5.0 의 순으로 확장
• Unfortunately, C4.5 (and EC4.5) are implemented in the old style
K&R C code. The sources are then hard to understand, profile and
extend.
• Experimental results are reported comparing YaDT with Weka, dti
and (E)C4.5.
Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.
1. Introduction - C4.5
• C4.5
① 수치형 속성 취급 ( Handling continuous attributes )
② 무의미한 속성을 제외하는 문제
③ 나무의 깊이 문제 ( How deeply to grow the decision tree )
④ 결측치 처리 ( Handling missing attributes values )
⑤ 비용고려 ( Handling attributes with different costs )
⑥ 효율성 ( Improving computational efficiency )
Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.
2. Meta data representation
• Each attribute has one the following attribute types
: discrete, continuous, weights or class.
• The values of an attribute in a case belong to some data type includ-
ing
: integer, float, double, string. (special value‘?’or NULL)
Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.
2. Meta data representation
• Summarizing, in YaDT meta data de-
scribing the training set TS can be
structed as a table with columns
: attribute name, data type and attribute
type.
• Such a table can be provided as a database table, or as a text
file.
Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.
3. Data representation
• Example) training data for PlayTennis may include the following
case:
• C4.5 models an attribute value by a union structure to distinguish
discrete from continuous attributes.
Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.
4.1 YaDT optimizations
• All the strategies implement several optimizations, mainly related to
the efficient
computation of information gain.
① The first strategy computes the local threshold using the algorithm
of C4.5, which in particular sort cases by means of the quicksort
method.
② The second strategy also uses the algorithm of C4.5, but adopts a
counting sort method.
⇒ The selection of the strategy to adopt is performed accordingly to an
analytic comparison of their efficiency.
Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.
4.1 YaDT optimizations
• After splitting a node, a (weighted) subset of cases are “pushed
down” to each child node. (pushed down = LIFO)
• YaDT builds a weighted array for each node.
• The depth-first strategy is slightly faster, since the following opti-
mization can be implemented.
• The breadth-first strategy has a better memory occupation perfor-
mance, requiring to maintain arrays of weights and cases indexes for
a total of at most 2∙|TS| elements. -> YaDT
Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.
4.2 Some experiments on efficiency
• Ts name : the name of training set
• |TS| : the number of cases
• NC : the number of class values
Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.
5. YaDT version 1.2.5
Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.
5. YaDT version 1.2.5
Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.
6. Conclusion
• a structured object-oriented programing implementation
• portable code over Windows (Visual Studio) and Linux (gcc)
• 32 bit and 64 bit executable
• a documented C++ library of classes
• compressed binary output/input of trees
• a command line tree builder and a Java GUI.