28
Discovering Unrevealed Discovering Unrevealed Properties of Probability Properties of Probability Estimation Trees: on Algorithm Estimation Trees: on Algorithm Selection and Performance Selection and Performance Explanation Explanation Kun Zhang, Wei Fan, Bill Buckles Xiaojing Yuan, and Zujia Xu Dec. 21, 2006

What this Paper Offers

Embed Size (px)

DESCRIPTION

Discovering Unrevealed Properties of Probability Estimation Trees: on Algorithm Selection and Performance Explanation Kun Zhang , Wei Fan, Bill Buckles Xiaojing Yuan, and Zujia Xu Dec. 21, 2006. What this Paper Offers. Preference of a probability estimation tree (PET) - PowerPoint PPT Presentation

Citation preview

Discovering Unrevealed Discovering Unrevealed Properties of Probability Properties of Probability

Estimation Trees: on Algorithm Estimation Trees: on Algorithm Selection and Performance Selection and Performance

ExplanationExplanation

Kun Zhang, Wei Fan, Bill Buckles Xiaojing Yuan, and Zujia Xu

Dec. 21, 2006

What this Paper Offers

Preference of a probability estimation tree (PET)

Many important and previously unrevealed properties of PETs

A practical guide for choosing the most appropriate PET algorithm

Statistical Supervised Leaning Bayesian Decision Rule

))y),,x(f(LE(Min)y),,x(f(LE(Min x|yy,x

)y),,x(f(LEE)y),,x(f(LE x|yxy,x Since

)jy,y(L)x|jy(Pminarg)y,y(LEminargy '

1jy

'x|y

y

*

''

,...,1j,'y),,x(f'ywhere

0-1 Loss:

Cost-Sensitive Loss:

)x|y(Pmaxarg)'y,y(LEminargy x|y'y

*

)jy,'y(L)x|jy(Pminargy1j'y

*

Why Probability Estimation?Why Probability Estimation? - Theoretical Necessity- Theoretical Necessity

Challenge: P(x,y) is unknown

so is P(y|x)!

Model

Learning Algorithm

f(x,θ)

LS= N1iii }y,x{

Unknown Distribution

P(X,Y), y=F(x)

A Loss/Cost function L

))y),,x(f(LE(Min y,x f :

Why Probability Estimation?Why Probability Estimation? - - Practical Necessity

Medical Domain

Ozone level Prediction

Direct Marketing

•Non-static, skewed distribution

•Unequal loss (Yang,icdm05)

• Direct estimation of Probability• Decision threshold determination

Posterior Probabilit

y Estimatio

n

Parametric Methods

Non-Parametric Approache

s

• The true and unknown distribution follows a “particular form”.• Via maximum likelihood estimation• E.g. Naïve Bayes, logistic regression

•Directly calculated without making any assumption•E.g. Decision trees, Nearest neighbors

Posterior Probability Estimation

a rather unbiased,

flexible and convenient

solution

PETs - Probabilistic View

of Decision Trees

N/N),x|y(P y• , E.g. (C4.5, CART)

• Confidences in the predicted labels

• Appropriately thesholding for classification w.r.t. different loss functions.

• The dependence of P(y|x,θ) on θ is non-trivial

Problems of Traditional PETs

1. Probability estimates through frequency tend to be too close to the extremes of 1 and 0

-------------------------------------------------

2. Additional inaccuracies result from the small number of examples within a leaf.

-------------------------------------------------

3. The same probability is assigned to the entire region of space defined by a given leaf.

•C4.4(Provost,03)

•CFT(Ling,03)

•BPET(Breiman,96),

•RDT(Fan,03)

Popular PET Algorithms

PET Algorith

ms

Single or

Multiple

Model(s)

Feature Selectio

n Criterion

Probability Estimation Method

Pruning Strateg

y

Diversity Acquisitio

n

C4.5 (Quinlan,93)

Single Gain RatioFrequency Estimation Error-

based Pruning

N/A

C4.4

(Provost,03)

Single Gain RatioLaplace Correction

No

CFT(Ling,03)

Single Gain Ratio

Aggregation over Leaves (FE/LC at leaf node)

No or Error-based

Pruning

RDT(Fan,03)

MultipleRandomly

Chosen

Model Averaging No or Depth

Constraint

Random Manipulati

on of feature set

BPET

(Breiman,96)

Multiple Gain Ratio

Model Averaging

No

Random Manipulati

on of training set

),x|y(PB/1),x|y(P B1k k

qqL ssP̂),x|y(Pi

)CN/()1N(),x|y(P y

N/N),x|y(P y

),x|y(PB/1),x|y(P B1k k

Which one to choose?

What performances to be expected ?

Why should one PET be preferred over

another?

Contributions

A large scale learning curve study using multiple evaluation metrics Preference of a PET: signal-noise

separability of datasets Many important and previously

unrevealed properties of PETs: In ensembles, RDT is preferable on low-signal

separability datasets, while BPET is favorable when the signal separability is high.

A practical guide for choosing the most appropriate PET algorithm

Analytical Tool # 1: AUC Analytical Tool # 1: AUC - - Index of Signal Noise Separability• Signal-noise separability

-Correct identification of information of interest and some other noise factors which may interfere this identification.

- A good analogy for two different populations present in every learning domain with uncertainty

• A synthetic scenario – tumor diagnosis- Tumor: signal present

- No tumor: signal absent

- Based on yes/no decision

1. P(yes|tumor): hit (TP)

2. P(yes|no tumor): false alarm (FP)

3. P(no|tumor): miss (FN)

4. P(no|no tumor): correct reject (TN)

0 0.5 10

0.2

0.4

0.6

0.8

1

FPR

TPR

LowSepDist

HighSepDist

-8

-6

-4

-2

024680

0.05

0.1

0.15

0.2

0.25

f(x|signal)f(x|noise1)f(x|noise2)

Noise Signal

Miss

Correct reject

False alarm

Hit

Decision Criterion

• An Illustration

Relative areas of the four different outcomes vary, the separation of the two distribution does not !

Analytical Tool # 1: AUC Analytical Tool # 1: AUC - - Index of Signal Noise Separability

AUC: an index for the separability of signal from noise

Domains: high/low degree of signal separability• High: deterministic/ little noise• Low: Stochastic/Noisy

Analytical Tool # 2: Analytical Tool # 2: Learning Curves

&

&

&

&&

&& &

& &

20 40 60 80 100

0.0

50

.10

0.1

50

.20

0.2

50

.30

Percentage of 75% Data Examples

MS

E &

&

&

&&

&& &

& &

oo

o oo o o o o o

$

$

$

$$ $

$ $

$ $

x

x

x

x xx x

x

x x

## #

# # # # # # #

&o$x#

BagPETRDTC4.4C4.5CFT

• Instead of CV or training-test splitting based on fixed data set size

•Generalization performance of different models as a function of the size of the training set

• Correlation between performance metrics and training set sizes can be observed and possibly generalized over different data sets.

1. Area Under ROC Curve (AUC)- Summarizes the “ranking capability” of a learning

algorithm in ROC space

2. MSE (Brier Score)

- - A proper assessment for the “accuracy” of probability

estimation- Calibration-Refinement decomposition

** Calibration measures the absolute precision of

probability estimation* Refinement indicates how confident the estimator is in

its estimates* Visualization tools – reliability plots and sharpness

graphs

3. Error Rate- Inappropriate criterion for evaluating probability estimates-

))5.0),x|0y(p(I)5.0),x|1y(p(I(N/1}0y|i{

ii}1y|i{

ii

ii

2N

1i y ii ),x|y(P)x|y(TN/1

),x|y(PP̂where))],P̂|1y(P1)(P̂|1y(P[E]))P̂|1y(PP̂[(E p̂2

Analytical Tool # 3: Multiple Evaluation Metrics

Data setsFeature

TypesMAX AUC

Winner AUC Winner MSE Winner Error Rate

SinglePETs

Ensembles

SinglePETs

Ensembles

SinglePETs Ensembles

Mushroom Categorical 1 C4.4/C4.5

RDT/BPET C4.5

RDT/BPET C4.5/C4.4 RDT/BPET

BC_wdbc Continuous 0.995 CFT

RDT/BPET C4.4

RDT/BPET C4.5 RDT/BPET

Chess Categorical 0.99 C4.4/CFT BPET C4.5/C4.4 BPET C4.5/C4.4 BPET

BC_wisc Continuous 0.99 CFT RDT C4.5/C4.4 RDT C4.5/C4.4 RDT

HouseVote Categorical 0.99 CFT/C4.4

RDT/BPET C4.5 BPET C4.5/C4.4 BPET

Tic Categorical 0.99 C4.4/CFT BPET C4.4 BPET C4.5/C4.4 BPET

Hypothyroid Mixed 0.989 CFT/C4.4

RDT/BPET C4.5 BPET C4.5 BPET

Spam Continuous 0.98 CFT RDT C4.4BPET/

RDT C4.5 RDT

SickEuthyroid Mixed 0.98 C4.4/CFT BPET C4.5 BPET C4.5 BPET

Ionosphere Continuous 0.966 C4.4/CFT

RDT/BPET C4.4 BPET C4.5/C4.4 BPET

Spectf Continuous 0.96 CFT RDT C4.4

RDT/BPET C4.5/C4.4 RDT/BPET

Australian Mixed 0.934 CFT

RDT/BPET C4.5 BPET C4.5 BPET

Adult Mixed 0.9 CFT

RDT/BPET C4.5 BPET C4.5 BPET

Sonar Continuous 0.88 CFT/C4.4 RDT CFT/C4.4

RDT/BPET CFT/C4.4 RDT

Hepatitis Mixed 0.87 C4.4/CFT

RDT/BPET CFT

RDT/BPET C4.5/CFT RDT

Pima Continuous 0.825 CFT

RDT/BPET CFT/C4.4 RDT CFT RDT

Spect Categorical 0.816 CFT RDT CFT RDT C4.5/CFT RDT/BPET

Liver Continuous 0.75 CFT

RDT/BPET CFT RDT C4.5/CFT RDT/BPET

Experiment Experiment ResultsResults

1. RDT and CFT are better on AUC2. RDT is preferable on low-signal separability

datasets, While BPET is favorable on high-signal separability data sets

3. High separability categorical datasets with limited feature values hurt RDT

4. Among single trees, CFT is preferable on low-signal separability datasets

Conjectures in SummaryConjectures in Summary

Behind the ScenesBehind the Scenes - Why RDT and CFT better on AUC?

Superior capability on unique probability generation

Unique Probabilities (Win-Loss-Tie)

AVG RDT C4.4 C4.5 CFT

BagPET 0-14.9-3.1 18-0-0 18-0-0 11.6-4-2.4

RDT 18-0-0 18-0-0 16-0.6-1.4

C4.4 17.9-0-0.1 0.1-15.2-2.7

C4.5 0-17.3-0.7

STDEV RDT C4.4 C4.5 CFT

BagPET 0-0.9-0.9 0-0-0 0-0-0 1.9-1.8-0.5

RDT 0-0-0 0-0-0 0.8-0.5-0.7

C4.4 0.3-0-0.3 0.3-1.6-1.3

C4.5 0-1.9-1.9

• AUC calculations: • Trapezoidal integration (Fawcett,03)

(Hand,01)

•For larger AUC, P(y|x,θ) should vary from one test point to another

•The number of unique probabilities is maximized as a result

RDT > BPET > CFT > C4.4 > C4.5

10

00i

nn

2/)1n(nrAUC

Behind the Scenes Behind the Scenes - - Why RDT (BPET) preferable on low (high) signal separability datasets?

1. RDT: discards any criterion for optimal feature selection

2. More like a structure for data summarization.

3. When the signal-separability is low, this property protects RDT from the danger of identifying noise as signal or overfitting on noise, which is very likely to be caused by massive searches or optimization adopted by BPET.

4. RDT provides an average of probability estimation which approaches the mean of true probabilistic values as more individual trees added.

• The reasons:

&

&

&&

& & & && &

20 40 60 80 100

0.1

20

.14

0.1

60

.18

0.2

00

.22

Percentage of 75% Data Examples

MS

E &

&

&&

& & & && &o o o

oo o o o o o

$ $

$ $ $ $$ $ $ $

x

x

xx

x x

xx x

x# #

# # # # # # # #

&o$x#

BagPETRDTC4.4C4.5CFT

&&

&&

&

&& &

& &

20 40 60 80 100

0.6

00

.65

0.7

00

.75

0.8

00

.85

0.9

0

Percentage of 75% Data Examples

AU

C&

&&

&

&

&& &

& &

o

oo

oo

o

oo

o o

$

$ $$ $

$

$$ $

$

x

xx

x x xx x x x

#

# #

##

#

##

##

&o$x#

BagPETRDTC4.4C4.5CFT

Behind the Scenes Behind the Scenes - - Why RDT (BPET) preferable on low (high) signal separability datasets?

• The evidence (I) – Spect and Sonar, low-signal separability domains

Behind the Scenes Behind the Scenes - - Why RDT (BPET) preferable on low (high) signal separability datasets?

• The evidence (II) – Pima, a low-signal separability domain

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.6

Score

Em

piric

al P

robabili

ty

Cal = 0.0036

1 6 11 12 22 13 2 0

Class 1 Examples in Each Bin

Fre

quency

020

40

Class 1Class 0

3239 42

2634

17

2 0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.6

Score

Em

piric

al P

robabili

ty

Cal = 0.018

6 9 10 4 4 10 14 10

Class 1 Examples in Each Bin

Fre

quency

020

50 Class 1

Class 0

66

3019 14 13 17 20

13

RDT:

BPET:

Behind the Scenes Behind the Scenes - Why RDT (BPET) preferable on low (high) signal separability datasets?

• The evidence (III) - Spam, a high-signal separability domain

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

Score

Em

piric

al P

robability

Cal = 0.013

4 3 4 6 15 28 24 49 102 219

Class 1 Examples in Each Bin

Fre

quency

020

040

0

Class 1Class 0

404

156 86

33 35 36 26 52102

221

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

Score

Em

piric

al P

robability

Cal = 0.0038

6 5 4 11 4 8 17 24 38 337

Class 1 Examples in Each Bin

Fre

quency

020

050

0Class 1Class 0

545

78 36 28 20 17 19 29 39

340

RDT:

BPET:

&

&&

&&

& & && &

20 40 60 80 100

0.0

50

.15

0.2

5

Percentage of 75% Data Examples

MS

E

&

&&

&&

& & && &

oo

o o o o o o o o

$

$$

$$ $

$ $$ $

xx

xx x

x xx

x x

# # ## # # # # # #

&o$x#

BagPETRDTC4.4C4.5CFT &

&& & & & & & & &

20 40 60 80 100

0.9

50

.97

0.9

9

Percentage of 75% Data Examples

AU

C

&

&& & & & & & & &

o

o

oo

oo o o o o

$

$$ $ $ $ $ $ $ $

x

xx

xx x x x x x

#

## # #

# # # # #

&o$x#

BagPETRDTC4.4C4.5CFT

Behind the ScenesBehind the Scenes - Why high separability categorical datasets with limited feature values hurt RDT?

• The observations – Tic-tac-toe and Chess

Behind the ScenesBehind the Scenes - - Why high separability categorical datasets with limited feature values hurt RDT?

• The reason:• High separability categorical datasets

with limited values tend to restrict the degree of diversity that RDT’s random feature selection can explore

- Random feature selection mechanism of RDT

• Categorical features: once;

• Continuous features: multiple times, but different splitting value each time.

The reasons

1. Low-signal separability domains Good performance benefits from the

probability aggregation mechanism Rectify errors introduced to the

probability estimates due to the attribute noise

2. High-signal separability domains Aggregation of the estimated

probabilities from the other irrelevant leaves will adversely affect the final probability estimates.

Behind the ScenesBehind the Scenes - Why CFT preferable on low-signal separability datasets ?

&

&

&&

& & & && &

20 40 60 80 100

0.1

20

.14

0.1

60

.18

0.2

00

.22

Percentage of 75% Data Examples

MS

E &

&

&&

& & & && &o o o

oo o o o o o

$ $

$ $ $ $$ $ $ $

x

x

xx

x x

xx x

x# #

# # # # # # # #

&o$x#

BagPETRDTC4.4C4.5CFT &

&& & & &

& & & &

20 40 60 80 100

0.6

00

.65

0.7

00

.75

0.8

00

.85

Percentage of 75% Data Examples

AU

C

&&

& & & && & & &

oo o o o o

o o o o

$

$

$$

$ $ $$ $

$

xx

xx

xx

x

x

xx# #

# #

# ## #

##

&o$x#

BagPETRDTC4.4C4.5CFT

• The evidence (I) – Spect and Pima, low-signal separability domains

Behind the Scenes Behind the Scenes - Why CFT preferable on low-signal separability datasets ?

Behind the Scenes Behind the Scenes - Why CFT preferable on low-signal separability datasets ?

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.6

Score

Em

piric

al P

robability

Cal = 0.0044

0 1 11 12 13 0 0 0

Class 1 Examples in Each Bin

Fre

quency

015

30

Class 1Class 0

0

11

3125

20

0 0 0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.6

Score

Em

piric

al P

robability

Cal = 0.081

9 5 4 0 2 0 7 10

Class 1 Examples in Each Bin

Fre

quency

010

Class 1Class 0

23

1016

1

10

0

9

18

CFT:

C4.4:

• The evidence (II) - Liver, a low-signal separability domain

20

AUC Score

Given dataset

Signal-noise separability estimation

through RDT or BPET

Ensemble or Single trees

Low signal-noise

separability

High signal-noise

separability

Ensemble or Single

trees

Ensemble

(AUC,MSE,ErrorRate)

RDT CFT

Single Trees

(AUC,MSE,ErrorRate)

>=0.9< 0.9

EnsembleSingle Tree

AUCMSEError Rate

CFT

AUC

MSE, ErrorRate

C4.5 or C4.4

Feature types and

value characteris

tics Categorical feature (with limited values)

BPETRDT ( BPET)

Continuous features (categorical feature with a large number of values)

AUC, MSE, ErrorRate

AUC, MSE, ErrorRate

Choosing the Appropriate PET Algorithm Given a New

Problem

Summary AUC: iAUC: index of signal noise separability Preference of a PET on multiple

evaluation metrics “signal-noise separability” of the dataset other observable statistics.

Many important and unrevealed properties of PETs are analyzed

A practical guide for choosing the most appropriate PET algorithm

Thank you!Thank you!

Questions?Questions?