Optimal Bayesian Networks

1

Optimal Bayesian Networks

Advanced Topics in Computer Science Seminar Supervisor: Dr. Herman Maya

Author: Kreimer Andrew

2

Data Mining

› Massive amounts of data: Petabyte, Terabyte

› Data evolution

› Multidisciplinary field

› Data Warehouse

› OLAP & OLTP

› Preprocessing

› KDD – Knowledge Discovery in Databases

› One truth

3

Data Mining Methods

› Clustering– Bank clients: private or business

› Association Rules– YouTube suggestions, Amazon checkout suggestions

› Classification and Prediction– SPAM mail classification, FX trend predictions– Payment power prediction

› Integration– Clustered client gets specific classification model

4

The Bayesian Approach

› Probability & Statistics– Instances – classical approach– A priori/A posteriori knowledge – the Bayesian approach

› Bayes’ Theorem– P(A|B) = P(B|A)P(A)/P(B)

› MAP– Maximum A Posteriori

5

Bayesian Classifier› Describe a client by age and income

› P(X) – probability that exists a client aged 25 with income of 5000

› P(H) – probability that client buys a guitar

› P(X|H) – probability that exists client X given that someone bought a guitar

› P(H|X) – probability of client X buying a guitar

› P(H|X) = P(X|H)P(H)/P(X)

› The Naïve approach– Independent variables

6

Naïve Bayes Classifier

› Optimal classifier is not practical

› The variables are independent

› Probability 0– Laplace – add one dummy record– m-estimate – there are m virtual records

7

Classification example using Naïve Bayes Classifier

News in EU News in US EU GDP US GDP EURUSD

bad bad Up Down Up

bad good Down Down Up

good bad Up Up Down

good good Up Up Up

bad bad Down Up Down

good bad Down Up Down

bad good Up Down Up

bad bad Up Down Down

good good Up Up Up

Bad good Down down Up

NewsEu, NewsUs ϵ {bad, good}

EuGDP, UsGDP, EURUSD (Class) ϵ {up, down}

Let’s try to classify trends in the FX market using several attributes: news in Europe, news in US, GDP in Europe and news in US. Each instance is monthly measurement. News attributes describe general market temperament. GDP attributes describe the change relative to last period.

8

Classification example using Naïve Bayes Classifier› Let’s classify new instance

– X=(NewsEU = good, NewsUS = bad, EuGDP = up, UsGDP = up)

› We start with :

– P(EURUSD = Up) = 6/10 = 0.6– P(EURUSD = Down) = 4/10 = 0.4

› Then we calculate the joints

– P(NewsEu = good | EURUSD = Up) = 2/6 = 0.33– P(NewsUs = bad | EURUSD = Up) = 1/6 = 0.16– etc.

9

Classification example using Naïve Bayes Classifier

› The classification:

› P(X|EURUSD = up) = P(NewsEu = good | EURUSD = up) * P(NewsUs = bad | EURUSD = up) * P(EuGDP = up | EURUSD = up) * P(UsGDP = up | EURUSD = up) = 0.33 * 0.16 * 0.66 * 0.33 = 0.01149984

› P(X|EURUSD = down) = P(NewsEu = good | EURUSD = down) * P(NewsUs = bad | EURUSD = down) * P(EuGDP = up | EURUSD = down) * P(UsGDP = up | EURUSD = down) = 0.5 * 1 * 0.5 * 0.75 = 0.1875

› Using MAP:

› = max{P(X|EURUSD=up)P(EURUSD=up), P(X|EURUSD=down)P(EURUSD=down)} = max{0.01149984, 0.1875} = 0.1875

› Conclusion: trend down, we should sell EURUSD.

10

Bayesian Network

› Graphical probabilistic model

› DAG

› CPT for each attribute

› d-separated, d-connected

› A -> D, D -> A

› P(C|A,B,D,E) = P(C|A,B,D)– E and C are d-separated

11

Probability Inference› Probability calculation:

› Given A, B, C, D & E, calculate P(A, B, C, D, E):

› Given A, B, D & E, calculate C by using MAP:

› Given A, C, D & E, calculate B by using Bayes Theorem:– P(A|B) = P(B|A) * P(A)/P(B) //Bayes Theorem– P(B) = P(B|A) + P(B|Â) = P(B|A) * P(A) + P(B|Â) * P(Â) =

…

12

Dynamic Bayesian Network

› Bayesian Network extension

› Time slice attributes relations

› Matrix of attributes and time slices

› Time series

› Cycle are allowedX1

X2

X3

X4

Attribute

p

… Attribute 2 Attribute 1

… Time 1

… Time 2

… … … … …

… Time n

13

Bayesian Network Example

› Let’s try to predict trends in EURUSD

› Binary class variable: Up or Down

› Attributes: Open, High, Low, Close, MA100, MA200

› Class: ClassTrend

14

Bayesian Network Example

CPT: BN:

15

Bayesian Network Learning

› Structure is given by field expert (Wish You Were Here)

› Structure learning - computational barrier– structures– Heuristics– Metrics for evaluating structures: local, global, d-

separation

› Conditional Probability Tables calculation

16

Bayesian Network Learning

› Attributes ordering:– Set – is candidate parent of iff is before in order– Possible parents come before the node in order

› Structure– DAG

X1 X3 X2

X1

X3

X2

Order(left) and Structure(right)

17

Network Scoring

› Structures are evaluated by scoring (global/local)

› Bayesian Dirichlet – BD

› BDeu (equivalent uniform Bayesian Dirichlet).

› MDL given model M and dataset D:– Description cost: – Looking for minimum or maximum (start at )

18

Bayesian Network Learning Algorithms› Gradient Descent

– Structure is given, CPT to be calculated– Some of the a priori probabilities are missing– Infinitesimal approximation

› K2– Well known– Greedy Algorithm– Each node has a maximum number of parents– Add parents gradually (from 0)– Attributes ordering is given– Look for the structure having highest score– Stop when no better structure is found

19

Bayesian Network Learning Algorithms

› Hill-Climbing Search– Local Search, Global Search– Global: incremental solution construction– Local: start with random solution, optimize towards the

optimal 1

4

8 .. .. .. 8 7 6 5

4

3 3

2

2 1

Global (right) vs. Local (left)

20


› Taboo Search– List of forbidden solutions– Allow bad solutions to reveal good solutions– Avoid local max/min– Efficient data structures– Decisions made with 4 dimensions:

› Past occurrences› Frequencies› Quality› Impact

Possible Solutions

Solutions Evaluation

Find Optimal SolutionStop?

Update Taboo List

Initial Solution

Optimal Solution

Taboo Search scheme

21


› TAN – Tree Augmented Naïve Bayes– Tree based– Conditional Mutual Information – Edges from class to attributes– Chow-Liu (1968)

› גנטי (GA) אלגוריתם– Evolution– Mutation– Selection from several generations

Stop?

Selection

Solutions

CreationNew Solutions

Change

Solution Generatio

n

Genetic Algorithm, Source: P. Larranaga et al.

Optimal SolutionInitializatio

n

22


› Simulated Annealing– Thermodynamics principal– Possible local minimum/maximum

› Ordering-Based Search– Attributes ordering is given– Each node has max number of descendants– Cardinality of orderings is lower than cardinality of

structures– There is an ordering to structure map

23

Classifiers Comparison› WEKA 3.6, votes.arff, 435 records, 17 attributes, 10

foldsFN TN FP TP

Inaccurat

eAccurate

Calculation

TimeClassifier

14 154 29 238 9.89% 90.11% 0.01sec Naïve Bayes

8 160 8 259 3.68% 96.32% 0sec J48

10 158 23 244 7.59% 92.41% 0sec IB1

10 158 13 254 5.29% 94.71% 1.75sec MLP

14 154 29 238 9.89% 90.11% 0.04sec BN, K2, Local

14 154 29 238 9.89% 90.11% 0.01sec BN, K2, Global

14 154 28 239 9.66% 90.34% 0.02sec BN, Hill Climber, Local

12 156 12 255 5.52% 94.48% 2.87sec BN, Hill Climber, Global

10 158 12 255 5.06% 94.94% 1.34sec BN, Simulated Annealing, Local

13 155 13 254 5.98% 94.02% 52.04sec BN, Simulated Annealing, Global

14 154 28 239 9.66% 90.34% 0.02sec BN, Taboo Search, Local

15 153 12 255 6.21% 93.79% 1.92sec BN, Taboo Search, Global

9 159 13 254 5.06% 94.94% 0.04sec BN, TAN, Local

6 162 15 252 4.83% 95.17% 3.24sec BN, TAN, Global

24

Classifiers Comparison› WEKA 3.7, GBPAUD, 37 attributes, 10k records, 33%-66%

splitIncorrectly

Classified

Correctly

ClassifiedCalculatioT time Classifier

36.62% 63.38% 0.03sec Naïve Bayes

1.23% 98.77% 0.48sec J48

31.21% 68.79% 0.01sec IB1

? ? >5min MLP

35.73% 64.27% 0.11sec BN, K2, Local

35.73% 64.27% 3.62sec BN, K2, Global

37.2647% 62.7353% 143.19sec BN, Hill Climber, Local

? ? >5min BN, Simulated Annealing, Local

35.5294% 64.4706% 144.19min BN, Taboo Search, Local

? ? >5min BN, TAN, Local

25

Optimal Bayesian Network

› Combinatorial optimization

› Inference is difficult if we must visit the whole structure

› Curse of dimensionality

› Feature selection – critical phase

› Attributes ordering – usually must be calculated

› Search space pruning by heuristics

› A priori knowledge, field experts (Wish You Were Here)

26

Summary

› Graphical classification model– Judea Pearl (1988)– Chow-Liu (1968)

› Easily fitted

› Easily interpreted

› Computational limit (as always!)

› Polynomial algorithms?– Time– Memory

27

That's all folks!Kreimer Andrew

Algonell.com – Scientific FX [email protected]

Data & Analytics

Optimal Bayesian Networks