12
YaDT (Yet another Decision Tree builder) Ah Young Shin [email protected] Visual Communication Lab. Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.

YaDT (Yet another Decision Tree builder) Ah Young Shin [email protected] Visual Communication Lab. Dept. Electronic Computer Engineering. University Of

Embed Size (px)

Citation preview

Page 1: YaDT (Yet another Decision Tree builder) Ah Young Shin ayoung18@uos.ac.kr Visual Communication Lab. Dept. Electronic Computer Engineering. University Of

YaDT (Yet another Decision Tree

builder)

Ah Young [email protected] Communication Lab.

Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.

Page 2: YaDT (Yet another Decision Tree builder) Ah Young Shin ayoung18@uos.ac.kr Visual Communication Lab. Dept. Electronic Computer Engineering. University Of

Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.

1. Introduction

• YaDT is a from-scratch main-memory implementation of the C4.5-like

decision tree algorithm.

• ID3(Entropy) → C4.5(Information Gain) → C5.0 의 순으로 확장

• Unfortunately, C4.5 (and EC4.5) are implemented in the old style

K&R C code. The sources are then hard to understand, profile and

extend.

• Experimental results are reported comparing YaDT with Weka, dti

and (E)C4.5.

Page 3: YaDT (Yet another Decision Tree builder) Ah Young Shin ayoung18@uos.ac.kr Visual Communication Lab. Dept. Electronic Computer Engineering. University Of

Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.

1. Introduction - C4.5

• C4.5

① 수치형 속성 취급 ( Handling continuous attributes )

② 무의미한 속성을 제외하는 문제

③ 나무의 깊이 문제 ( How deeply to grow the decision tree )

④ 결측치 처리 ( Handling missing attributes values )

⑤ 비용고려 ( Handling attributes with different costs )

⑥ 효율성 ( Improving computational efficiency )

Page 4: YaDT (Yet another Decision Tree builder) Ah Young Shin ayoung18@uos.ac.kr Visual Communication Lab. Dept. Electronic Computer Engineering. University Of

Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.

2. Meta data representation

• Each attribute has one the following attribute types

: discrete, continuous, weights or class.

• The values of an attribute in a case belong to some data type includ-

ing

: integer, float, double, string. (special value‘?’or NULL)

Page 5: YaDT (Yet another Decision Tree builder) Ah Young Shin ayoung18@uos.ac.kr Visual Communication Lab. Dept. Electronic Computer Engineering. University Of

Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.

2. Meta data representation

• Summarizing, in YaDT meta data de-

scribing the training set TS can be

structed as a table with columns

: attribute name, data type and attribute

type.

• Such a table can be provided as a database table, or as a text

file.

Page 6: YaDT (Yet another Decision Tree builder) Ah Young Shin ayoung18@uos.ac.kr Visual Communication Lab. Dept. Electronic Computer Engineering. University Of

Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.

3. Data representation

• Example) training data for PlayTennis may include the following

case:

• C4.5 models an attribute value by a union structure to distinguish

discrete from continuous attributes.

Page 7: YaDT (Yet another Decision Tree builder) Ah Young Shin ayoung18@uos.ac.kr Visual Communication Lab. Dept. Electronic Computer Engineering. University Of

Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.

4.1 YaDT optimizations

• All the strategies implement several optimizations, mainly related to

the efficient

computation of information gain.

① The first strategy computes the local threshold using the algorithm

of C4.5, which in particular sort cases by means of the quicksort

method.

② The second strategy also uses the algorithm of C4.5, but adopts a

counting sort method.

⇒ The selection of the strategy to adopt is performed accordingly to an

analytic comparison of their efficiency.

Page 8: YaDT (Yet another Decision Tree builder) Ah Young Shin ayoung18@uos.ac.kr Visual Communication Lab. Dept. Electronic Computer Engineering. University Of

Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.

4.1 YaDT optimizations

• After splitting a node, a (weighted) subset of cases are “pushed

down” to each child node. (pushed down = LIFO)

• YaDT builds a weighted array for each node.

• The depth-first strategy is slightly faster, since the following opti-

mization can be implemented.

• The breadth-first strategy has a better memory occupation perfor-

mance, requiring to maintain arrays of weights and cases indexes for

a total of at most 2∙|TS| elements. -> YaDT

Page 9: YaDT (Yet another Decision Tree builder) Ah Young Shin ayoung18@uos.ac.kr Visual Communication Lab. Dept. Electronic Computer Engineering. University Of

Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.

4.2 Some experiments on efficiency

• Ts name : the name of training set

• |TS| : the number of cases

• NC : the number of class values

Page 10: YaDT (Yet another Decision Tree builder) Ah Young Shin ayoung18@uos.ac.kr Visual Communication Lab. Dept. Electronic Computer Engineering. University Of

Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.

5. YaDT version 1.2.5

Page 11: YaDT (Yet another Decision Tree builder) Ah Young Shin ayoung18@uos.ac.kr Visual Communication Lab. Dept. Electronic Computer Engineering. University Of

Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.

5. YaDT version 1.2.5

Page 12: YaDT (Yet another Decision Tree builder) Ah Young Shin ayoung18@uos.ac.kr Visual Communication Lab. Dept. Electronic Computer Engineering. University Of

Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.

6. Conclusion

• a structured object-oriented programing implementation

• portable code over Windows (Visual Studio) and Linux (gcc)

• 32 bit and 64 bit executable

• a documented C++ library of classes

• compressed binary output/input of trees

• a command line tree builder and a Java GUI.