Upload
andrew-kreimer
View
167
Download
2
Tags:
Embed Size (px)
Citation preview
1
Optimal Bayesian Networks
Advanced Topics in Computer Science Seminar Supervisor: Dr. Herman Maya
Author: Kreimer Andrew
2
Data Mining
› Massive amounts of data: Petabyte, Terabyte
› Data evolution
› Multidisciplinary field
› Data Warehouse
› OLAP & OLTP
› Preprocessing
› KDD – Knowledge Discovery in Databases
› One truth
3
Data Mining Methods
› Clustering– Bank clients: private or business
› Association Rules– YouTube suggestions, Amazon checkout suggestions
› Classification and Prediction– SPAM mail classification, FX trend predictions– Payment power prediction
› Integration– Clustered client gets specific classification model
4
The Bayesian Approach
› Probability & Statistics– Instances – classical approach– A priori/A posteriori knowledge – the Bayesian approach
› Bayes’ Theorem– P(A|B) = P(B|A)P(A)/P(B)
› MAP– Maximum A Posteriori
5
Bayesian Classifier› Describe a client by age and income
› P(X) – probability that exists a client aged 25 with income of 5000
› P(H) – probability that client buys a guitar
› P(X|H) – probability that exists client X given that someone bought a guitar
› P(H|X) – probability of client X buying a guitar
› P(H|X) = P(X|H)P(H)/P(X)
› The Naïve approach– Independent variables
6
Naïve Bayes Classifier
› Optimal classifier is not practical
› The variables are independent
› Probability 0– Laplace – add one dummy record– m-estimate – there are m virtual records
7
Classification example using Naïve Bayes Classifier
News in EU News in US EU GDP US GDP EURUSD
bad bad Up Down Up
bad good Down Down Up
good bad Up Up Down
good good Up Up Up
bad bad Down Up Down
good bad Down Up Down
bad good Up Down Up
bad bad Up Down Down
good good Up Up Up
Bad good Down down Up
NewsEu, NewsUs ϵ {bad, good}
EuGDP, UsGDP, EURUSD (Class) ϵ {up, down}
Let’s try to classify trends in the FX market using several attributes: news in Europe, news in US, GDP in Europe and news in US. Each instance is monthly measurement. News attributes describe general market temperament. GDP attributes describe the change relative to last period.
8
Classification example using Naïve Bayes Classifier› Let’s classify new instance
– X=(NewsEU = good, NewsUS = bad, EuGDP = up, UsGDP = up)
› We start with :
– P(EURUSD = Up) = 6/10 = 0.6– P(EURUSD = Down) = 4/10 = 0.4
› Then we calculate the joints
– P(NewsEu = good | EURUSD = Up) = 2/6 = 0.33– P(NewsUs = bad | EURUSD = Up) = 1/6 = 0.16– etc.
9
Classification example using Naïve Bayes Classifier
› The classification:
› P(X|EURUSD = up) = P(NewsEu = good | EURUSD = up) * P(NewsUs = bad | EURUSD = up) * P(EuGDP = up | EURUSD = up) * P(UsGDP = up | EURUSD = up) = 0.33 * 0.16 * 0.66 * 0.33 = 0.01149984
› P(X|EURUSD = down) = P(NewsEu = good | EURUSD = down) * P(NewsUs = bad | EURUSD = down) * P(EuGDP = up | EURUSD = down) * P(UsGDP = up | EURUSD = down) = 0.5 * 1 * 0.5 * 0.75 = 0.1875
› Using MAP:
› = max{P(X|EURUSD=up)P(EURUSD=up), P(X|EURUSD=down)P(EURUSD=down)} = max{0.01149984, 0.1875} = 0.1875
› Conclusion: trend down, we should sell EURUSD.
10
Bayesian Network
› Graphical probabilistic model
› DAG
› CPT for each attribute
› d-separated, d-connected
› A -> D, D -> A
› P(C|A,B,D,E) = P(C|A,B,D)– E and C are d-separated
11
Probability Inference› Probability calculation:
› Given A, B, C, D & E, calculate P(A, B, C, D, E):
› Given A, B, D & E, calculate C by using MAP:
› Given A, C, D & E, calculate B by using Bayes Theorem:– P(A|B) = P(B|A) * P(A)/P(B) //Bayes Theorem– P(B) = P(B|A) + P(B|^A) = P(B|A) * P(A) + P(B|^A) * P(^A) =
…
12
Dynamic Bayesian Network
› Bayesian Network extension
› Time slice attributes relations
› Matrix of attributes and time slices
› Time series
› Cycle are allowedX1
X2
X3
X4
Attribute
p
… Attribute 2 Attribute 1
… Time 1
… Time 2
… … … … …
… Time n
13
Bayesian Network Example
› Let’s try to predict trends in EURUSD
› Binary class variable: Up or Down
› Attributes: Open, High, Low, Close, MA100, MA200
› Class: ClassTrend
14
Bayesian Network Example
CPT: BN:
15
Bayesian Network Learning
› Structure is given by field expert (Wish You Were Here)
› Structure learning - computational barrier– structures– Heuristics– Metrics for evaluating structures: local, global, d-
separation
› Conditional Probability Tables calculation
16
Bayesian Network Learning
› Attributes ordering:– Set – is candidate parent of iff is before in order– Possible parents come before the node in order
› Structure– DAG
X1 X3 X2
X1
X3
X2
Order(left) and Structure(right)
17
Network Scoring
› Structures are evaluated by scoring (global/local)
› Bayesian Dirichlet – BD
› BDeu (equivalent uniform Bayesian Dirichlet).
› MDL given model M and dataset D:– Description cost: – Looking for minimum or maximum (start at )
18
Bayesian Network Learning Algorithms› Gradient Descent
– Structure is given, CPT to be calculated– Some of the a priori probabilities are missing– Infinitesimal approximation
› K2– Well known– Greedy Algorithm– Each node has a maximum number of parents– Add parents gradually (from 0)– Attributes ordering is given– Look for the structure having highest score– Stop when no better structure is found
19
Bayesian Network Learning Algorithms
› Hill-Climbing Search– Local Search, Global Search– Global: incremental solution construction– Local: start with random solution, optimize towards the
optimal 1
4
8 .. .. .. 8 7 6 5
4
3 3
2
2 1
Global (right) vs. Local (left)
20
Bayesian Network Learning Algorithms
› Taboo Search– List of forbidden solutions– Allow bad solutions to reveal good solutions– Avoid local max/min– Efficient data structures– Decisions made with 4 dimensions:
› Past occurrences› Frequencies› Quality› Impact
Possible Solutions
Solutions Evaluation
Find Optimal SolutionStop?
Update Taboo List
Initial Solution
Optimal Solution
Taboo Search scheme
21
Bayesian Network Learning Algorithms
› TAN – Tree Augmented Naïve Bayes– Tree based– Conditional Mutual Information – Edges from class to attributes– Chow-Liu (1968)
› גנטי (GA) אלגוריתם– Evolution– Mutation– Selection from several generations
Stop?
Selection
Solutions
CreationNew Solutions
Change
Solution Generatio
n
Genetic Algorithm, Source: P. Larranaga et al.
Optimal SolutionInitializatio
n
22
Bayesian Network Learning Algorithms
› Simulated Annealing– Thermodynamics principal– Possible local minimum/maximum
› Ordering-Based Search– Attributes ordering is given– Each node has max number of descendants– Cardinality of orderings is lower than cardinality of
structures– There is an ordering to structure map
23
Classifiers Comparison› WEKA 3.6, votes.arff, 435 records, 17 attributes, 10
foldsFN TN FP TP
Inaccurat
eAccurate
Calculation
TimeClassifier
14 154 29 238 9.89% 90.11% 0.01sec Naïve Bayes
8 160 8 259 3.68% 96.32% 0sec J48
10 158 23 244 7.59% 92.41% 0sec IB1
10 158 13 254 5.29% 94.71% 1.75sec MLP
14 154 29 238 9.89% 90.11% 0.04sec BN, K2, Local
14 154 29 238 9.89% 90.11% 0.01sec BN, K2, Global
14 154 28 239 9.66% 90.34% 0.02sec BN, Hill Climber, Local
12 156 12 255 5.52% 94.48% 2.87sec BN, Hill Climber, Global
10 158 12 255 5.06% 94.94% 1.34sec BN, Simulated Annealing, Local
13 155 13 254 5.98% 94.02% 52.04sec BN, Simulated Annealing, Global
14 154 28 239 9.66% 90.34% 0.02sec BN, Taboo Search, Local
15 153 12 255 6.21% 93.79% 1.92sec BN, Taboo Search, Global
9 159 13 254 5.06% 94.94% 0.04sec BN, TAN, Local
6 162 15 252 4.83% 95.17% 3.24sec BN, TAN, Global
24
Classifiers Comparison› WEKA 3.7, GBPAUD, 37 attributes, 10k records, 33%-66%
splitIncorrectly
Classified
Correctly
ClassifiedCalculatioT time Classifier
36.62% 63.38% 0.03sec Naïve Bayes
1.23% 98.77% 0.48sec J48
31.21% 68.79% 0.01sec IB1
? ? >5min MLP
35.73% 64.27% 0.11sec BN, K2, Local
35.73% 64.27% 3.62sec BN, K2, Global
37.2647% 62.7353% 143.19sec BN, Hill Climber, Local
? ? >5min BN, Simulated Annealing, Local
35.5294% 64.4706% 144.19min BN, Taboo Search, Local
? ? >5min BN, TAN, Local
25
Optimal Bayesian Network
› Combinatorial optimization
› Inference is difficult if we must visit the whole structure
› Curse of dimensionality
› Feature selection – critical phase
› Attributes ordering – usually must be calculated
› Search space pruning by heuristics
› A priori knowledge, field experts (Wish You Were Here)
26
Summary
› Graphical classification model– Judea Pearl (1988)– Chow-Liu (1968)
› Easily fitted
› Easily interpreted
› Computational limit (as always!)
› Polynomial algorithms?– Time– Memory