View
228
Download
0
Category
Preview:
Citation preview
Bayesian Networks
Martin Bachlermartin.bachler@igi.tugraz.at
MLA - VO06.12.2005
2
3
Overview
• „Microsoft‘s competitive advantage lies in its expertise in Bayesian networks“(Bill Gates, quoted in LA Times, 1996)
4
Overview
• (Recap of) Definitions
• Naive Bayes– Performance/Optimality ?– How important is independence ?– Linearity ?
• Bayesian networks
5
Definitions
• Conditional probability
• Bayes theorem
P A,BP A| B
P B
P A,BP B | A
P A
P A| B P B P B | A P A
P B | A P AP A| B
P B
6
Definitions• Bayes theorem
• Likelihood
• Prior probability
• normalization term
P B | A P AP A| B
P B
P B
P A
P B | A
7
Definitions
• Classification problem– Input space X={x1 x x2 x…x xn}
– Output space Y = {0,1}
– Target concept C:X→Y
– Hypothesis space H
• Bayesian way of classifying an instance :
1 n c Y
c Y
c Y
h ,..., arg max P( c | )
P( | c ) P( c )arg max
P( )
arg max P( | c ) P( c )
8
Definitions
• Theoretically OPTIMAL!
• For large n the estimation of is very hard!
• => Assumption: pairwise conditional independence between input-variables given C:
1 n c Y 1 nh ,..., arg max P( ,..., | c ) P( c )
1 nP( ,..., | c )
i j i jP( x ,x |C ) P( x |C ) P( x |C )
i, j 1,...,n;i j
9
Overview
• (Recap of) Definitions
• Naive Bayes– Performance/Optimality ?– How important is independence ?– Linearity ?
• Bayesian networks
10
Naive Bayes
n
1 n c C ii 1
h ,..., arg max P( | c ) P( c )
i j i jP( x ,x |C ) P( x |C ) P( x |C )
i, j 1,...,n;i j
n
1 2 n 1 2 n ii 1
P( x ,x ,...,x |C ) P( x |C ) P( x |C ) ... P( x |C ) P( x |C )
11
Example
1/41100
………………
2/3
1
2/3
P(x2|C)C
1/3
0
2/3
P(x1|C)
3/4
1/4
3/4
P(C)
10
01
11
x2x1
0 1
1
0
1
1
h 1,1 arg max[ P( x 1|C 1) P( x2 1|C 1) P( C 1),
P( x 1|C 0 ) P( x2 1|C 0 ) P( C 0 )]
1
h 1,0 arg max[...,...] 1
h 0,1 arg max[...,...] 1
h 0,0 arg max[...,...] 0
n
1 n c C ii 1
h ,..., arg max P( | c ) P( c )
12
Naive Bayes - Independence
• The independence assumption is very strict!
• For most practical problems it is blatantly wrong!(not even fulfilled in the previous example!...see later)
=> Is naive Bayes a rather „academic“ algorithm ?
13
Naive Bayes - Independence
• For which problems is naive Bayes optimal ?(Lets assume for the moment we can perfectly
estimate all necessary probabilites)
• Guess: For problems for which the independence assumption holds
• Let‘s check… (empirically + theoretically)
14
Independence - Example
1111000
1/31/31/90100
0100010
2/31/31/91/3110
1000001
1/32/32/91/3101
0000011
2/32/34/91/3111
P(x2|C)P(x1|C)P(x1|C)P(x2|C)P(x1,x2|C)Cx2x1
0 1
1
0 1 2C x x
15
Independence - Example 1 2C x x
16
Independence - Example
1/21/21/41/2000
1/21/21/40100
1/21/21/40010
1/21/21/41/2110
1/21/21/40001
1/21/21/41/2101
1/21/21/41/2011
1/21/21/40111
P(x2|C)P(x1|C)P(x1|C)P(x2|C)P(x1,x2|C)Cx2x1
0 1
1
0 1 2C x x
17
Independence - Example 1 2C x x
18
Naive Bayes - Independence
[1] Domingos, Pazzani, Beyond independence, Conditions for the optimality of the simple Bayesian classifier, 1996
19
Naive Bayes - Independence
i j i j i jD x ,x |C H x |C H x |C H x x |C
20
Naive Bayes - Independence
• For which problems is naive Bayes optimal ?
• Guess:For problems for which the independence assumption holds
• Empirical answer: Not really….
• Theoretical answer ?
21
Naive Bayes - optimality
• Example: 3 features x1, x2, x3
• P(c=0) = P(c=1)
• x1, x3 independent; x2 = x1 (totally dep.)
=> optimal classification:
naive Bayes:
[1] Domingos, Pazzani, Beyond independence, Conditions for the optimality of the simple Bayesian classifier, 1996
opt 1 3 1 3
1 3 1 3
h sgn P x |1 P x |1 P x |0 P x |0
sgn P 1| x P 1| x P 0 | x P 0 | x
2 2
nb 1 3 1 3
2 2
1 3 1 3
h sgn P x |1 P x |1 P x |0 P x |0
sgn P 1| x P 1| x P 0 | x P 0 | x
22
Naive Bayes - optimality
• Let p =P(1|x1), q = P(1|x3)
• optimal:
• naive Bayes:
opth sgn p q (1 p ) (1 q )
nbh sgn p² q (1 p )² (1 q )
independence assumption holds
optimal and naive classifier disagree only
here
23
Naive Bayes - optimality
• In general: Instance x = <x1,…,xn>
Let
Theorem 1:A naive Bayesian classifier is optimal for x, iff
n
ii 1
n
ii 1
p P(1| x )
r P(1) / P( x ) P( x |1)
s P(0 ) / P( x ) P( x |0 )
1 1p r s p r s
2 2
24
Naive Bayes - optimality
region of optimality
independence assumption holds
only here
25
Naive Bayes - optimality
• This is a criterion for local optimality ( instance)
• What about global optimality ?Theorem 2: The naive Bayesian classifier is globally
optimal for a dataset Ѕ iff
x x x x x x
1 1x S : p r s p r s
2 2
26
Naive Bayes - optimality• What is the reason for this ?
– Difference between classification and probability (distribution) estimation
– I.e. for classification the perfect estimation of probabilities is not important as long as for each instance the maximum estimate corresponds to the maximum true probability.
• Problem with this result: Verification of global optimality (optimality for all instances) ?
27
Naive Bayes - optimality
• For which problems is naive Bayes optimal ?
• Guess:For problems for which the independence assumption holds
• Empirical answer: Not really….
• Theoretical answer no 1:For all problems for for which Theorem 2 holds.
28
Naive Bayes - linearity
• other question:
how does naive Bayes‘ hypothesis depend on the input variables ?
• Consider simple case of binary variables only…
• It can be shown (e.g.[2]) that in binary domains naive Bayes is LINEAR in the input variables!!
[2]: Duda, Hart: Pattern classification and Scene Analysis, Wiley, 1973
29
Naive Bayes - linearity
• Proof…
30
Naive Bayes – linearity - examples
naive Bayes
Perceptron
31
Naive Bayes – linearity - examples
32
Naive Bayes - linearity• For boolean domains naive Bayes‘ hypothesis is
a linear hyperplane!
=> It can only be globally optimal for linearly separable problems!!
BUT: It is not optimal for all linearly separable problems! (e.g. not for certain m-out-of-n concepts)
33
Naive Bayes - optimality• For which problems is naive Bayes optimal ?• Guess:
For problems for which the independence assumption holds
• Empirical answer: Not really….
• Theoretical answer no 1:For all problems for for which Theorem 2 holds.
• Theoretical answer no 2: For a (large) subset of the set of linearly separable problems.
34
Naive Bayes - optimality
class of concepts for which perceptron is optimal
class of concepts for which naive Bayes is optimal
35
Overview
• (Recap of) Definitions
• Naive Bayes– Performance/Optimality ?– How important is independence ?– Linearity ?
• Bayesian networks
36
Bayesian networks
• The problem-class for which naive Bayes is optimal is quite small….
• Idea: Relax the independence-assumption to obtain a more general classifier
• I.e. model cond. dependencies between variables
• Different techniques (e.g. hidden variables,…)
• Most established: Bayesian networks
37
Bayesian networks
• Bayesian network:– tool for representing statistical dependencies
between a set of random variables– acyclic directed graph– one vertex for each variable– for each pair of stat. dependent variables there is
an edge in the graph between the corresponding vertices
– not connected variables(vertices) are independent!– each vertex has a table of local probability
distributions
38
Bayesian networks
• Each variable is dependent only on its parents in the network!
y
x1
x3x2 x4
x5
„parents“ of x4 (Pa4)
i l i i i l 1 n i iP( x | x ,Pa ) P( x | Pa ); x { x ,....,x }\ x Pa
39
Bayesian networks
Bayesian network – based classifier:
y
x1
x3x2 x4
x5
n
1 n c C i ii 1
h ,..., arg max P( | c,Pa ) P( c )
1 n y 1 2 2
3 4 3 5 4
h ,..., arg max [ P( | , y ) P( | y )
P( | y ) P( | , y ) P( | , y ) P( y )]
40
Bayesian networks
• In the case of boolean attributes this is again linear, but not on the input-variables:
• Linear on product-features:
1 mi
n n
c Y i i i i i ii 1i 1
h( ) arg max P( | c,Pa ) P(c) sgn w [x Pa ... Pa ] b
å
41
Bayesian networks
• The difficulty here is to estimate the correct network-structure (and probability-parameters) from training data!
• For general Bayesian networks this problem is NP-hard!
• There exist numerous heuristics for learning Bayesian networks from data!
42
References[1] Domingos, Pazzani, Beyond independence, Conditions for the optimality of the simple Bayesian classifier, 1996
[2] Duda, Hart: Pattern classification and Scene Analysis, Wiley, 1973
Recommended