KES2016: A Novel Structure Learning Algorithm for Optimal Bayesian Network: Best Parents

A Novel Structure Learning Algorithm for Optimal Bayesian

Network: Best ParentsAndrew Kreimer, Dr. Maya Herman

Dept. of Mathematics and Computer ScienceThe Open University of Israel, Ra’anana, Israel

Agenda Introduction Motivation Rational Best Parents algorithm Experiments Conclusions & future research

Introduction: Bayesian Networks DAG (Directed Acyclic Graph) CPT (Conditional Probability Tables) for each feature Guiding estimation rule: Bayes’ theorem P(A|B) = P(B|A)P(A)/P(B) Categorical features

Introduction: Iris example

Source: UCI ML, WEKA

Introduction: Structure Learning Search problem (best structure) Optimization problem (structure metrics minimization/maximization) K2, TAN (Tree Augmented Naive Bayes), Hill Climbing, Simulated Annealing,

Tabu, GA (Genetic Algorithms) etc.

Motivation Structure learning is a complex optimization problem Avoid feature ordering, DAG validity or structure metrics Deterministic solution Optimal Bayesian Network

Best Parents: Rational Rely on direct quality of features Incorporating attribute relational metrics to find the best rules

(Conditional Entropy) Greedy construction method Structure learning in a deterministic simple way Top down approach No feature ordering, DAG validation, structure metrics

Best Parents: Feature Direction Find the optimal structure using only attributes relations Bayesian Networks provide an immediate visual dependency of

attributes relative to each other Some attributes may influence several other attributes Usually we apply a bounded number of relations to avoid unfeasible

structures

Best Parents: Feature Relations A measurement of direction impact is applied (Conditional Entropy)

Zero conditional entropy reflects a complete dependence Let us define relations:

A B as best child of A is B A B as B is the best parent of A

Best Parents: Pseudo Code BestParents(dataset)

Count instantiations Calculate Conditional Entropy Save expansion rules (sorted) Expand structure (greedy algorithm with black list)

Best Parents: Construction Example a. best child rules: the best child of attribute f5 is f4

Best child rule notation: we have f4 f6, f1 f8 b. shows best parent rules: the best parent of attribute f2 is f7

Best parent rule notation: we have f3 f1, f9 f8

Source: WEKA BN explorer

Best Parents: Sorted Rules Best Children

f4 f6, 0.2 f1 f8, 0.3

Best Parents f3 f1, 0.5 f9 f8, 0.8

How to expand: Best Parents (the best) Best Children Combine them all (full list)

Best Parents: Complexity Iterate over all features m and samples n Counting instantiations is Feature direction search is Combining the two running times we get No iterative optimization or closed space search

Experiments Comparison to Random Forest, K2, TAN, Hill Climber, Tabu search and Naïve

Bayes Implemented in Java using the WEKA environment Public datasets Minor feature selection (cardinality) and data preprocessing No feature engineering Two key factors for performance assessment: normalized AUC and running

time Random split, 0.7 / 0.3

Experiment: Kaggle BNP Paribas Goal:

Reveal optimal BN Accelerating claims management process

Data: contains several nominal features with high cardinality (excluded) Results: Best Parents converged faster, TAN has higher AUC Conclusion: Best Parents is better when combining AUC and performance

RandomForest BN: Local TAN New BN: Best Parents

New BN: Full List BN: Local K2 BN: Local Hill Climber

BN: Local Tabu Naive Bayes0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

AUC Performance

Source: Kaggle BNP Paribas

Experiment: Criteo Goal:

Reveal optimal BN Predict clicks (estimate CTR)

Data: Widely used as a benchmark for large scale training High cardinality of nominal features Originally contains 40 features having 14 numerical features Tested on a small sample dataset having only 20 features with the lowest cardinality

Results: Best Parents converged faster, TAN has higher AUC Conclusion: Best Parents is better when combining AUC and performance


New BN: Full List BN: Local K2 BN: Local Hill Climber

BN: Local Tabu Naive Bayes0.68

0.69

0.7

0.71

0.72

0.73

0.74

0.75

0.76

0.77

AUC Performance

Source: Kaggle Criteo, Criteo

Experiment: Kaggle Homesite Goal:

Reveal optimal BN Targeting potential customers of insurance plans

Data: mostly numeric features Results: Best Parents converged faster, TAN has higher AUC Conclusion: Best Parents is better when combining AUC and performance


New BN: Full List BN: Local Tabu BN: Local K2 BN: Local Hill Climber

Naive Bayes0.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

AUC Performance

Source: Kaggle Homesite

Experiment: Poker Hand Goal:

Reveal optimal BN Classify poker hands

Data: 5 cards and hand ML is not efficient at solving this Simple algorithm can identify the strength of a given hand deterministically

Results: Best Parents converged faster, TAN has higher AUC Conclusion: Best Parents is better when combining AUC and performance


New BN: Full List BN: Local K2 Naive Bayes BN: Local Tabu BN: Local Hill Climber

0.48

0.58

0.68

0.78

0.88

0.98

AUC Performance

Source: UCI ML

Experiment: NYSE Stocks Goal:

Reveal optimal BN Classify trends and reveal trading signals

Data: 1.59 million samples and 95 features Technical indicators Mostly numeric features Binary class: trend up or down

Steps: Repeated random sampling (30 times), 5% sample size, 70%-30% train-test split ANOVA between all of the classifier performances Two t-tests between Best Parents and the two top classifiers (highest AUC)

Results: Significant difference of variances Significant difference of means Best Parents has higher mean (combined AUC and running time)

Conclusion: Best Parents is an optimal algorithm in terms of performance comprised of AUC and

running time

Experiment: Features analysis

RF BN: Local TAN

New BN: Best ParentsNew BN: Full List

BN: Local K2BN: Local Hill Climber

BN: Local TabuNaive Bayes

0.65

0.7

0.75

0.8

0.85

0.9

0.95

131 44 20 11

Classifier

AUC

Attributes

Source: Kaggle, UCI MLNumber of features does not improve mining

Experiment: Dataset analysis

Ran-domFor-

estBN: Local

TAN New BN: Best

ParentsNew BN: Full List BN: Local K2

BN: Local Hill Climber BN: Local

Tabu Naive Bayes

0.650.7

0.750.8

0.850.9

0.951

100000

114321

260753

1000000

100000 114321 260753 1000000

Classifier

AUC

Samples

Source: Kaggle, UCI ML

Higher number of samples improve mining

Conclusion Avoided preprocessed metadata

Feature ordering DAG validity checks Structure metrics

Deterministic solution Substantial optimality in terms of running time and AUC combination

Future Research Parallelized implementation Applications in large scale Improve attribute relation selection Expansion paths

Thank You!Andrew Kreimer, Dr. Maya Herman

Dept. of Mathematics and Computer ScienceThe Open University of Israel, Ra’anana, Israel