Conditional Independence Structures and Polyhedrajessica2.msri.org/attachments/13516/13516.pdfDeﬁnition (conditional independence structure) By the CI structure induced by a (discrete)

Conditional Independence Structures and Polyhedra

Milan Studeny

Institute of Information Theory and Automation of the ASCR

Prague, Czech Republic

Algebraic Statistics, MSRI, UC Berkeley

December 15, 2008, 9:30

M. Studeny (Prague) CI structures and polyhedra December 15, 2008 1 / 38

Summary of the talk

1 Motivation

2 Conditional independence inference

3 Simple algebraic method for CI description

4 Independence implication

5 Geometric point of view

6 Algorithms for testing independence implication

7 Learning graphical CI structures

8 Conclusions


Motivation: conditional independence structures

Conditional independence (CI) is a crucial notion in several fields:

(Markov) random processes,





statistics (contingency tables, multivariate statistical analysis),






probabilistic reasoning,







calculi for dealing with knowledge and uncertainty in AI.








The motivation for my (long-term) research effort has been the(mathematical) description of probabilistic CI structures.









The traditional methods for the description of CI structures use graphs.

The disadvantage of the graphical approaches is that they are able todescribe only a very small portion of probabilistic CI structures.









The traditional methods for the description of CI structures use graphs.

The disadvantage of the graphical approaches is that they are able todescribe only a very small portion of probabilistic CI structures.

This led me to the idea to propose a simple algebraic method, which isable to describe all (discrete, regular Gaussian) probabilistic CI structures.


Method of structural imsets

The original motivation source for this approach was given by some tools

of information theory (entropy-based measures of stochastic dependence).

More specifically, one to characterize every CI statement by a linearalgebraic equality for the values of the multiinformation function.






The abstraction of the information-theoretical approach led to the idea todescribe probabilistic CI structures by certain special integral vectors,called structural imsets (Studeny 2005).







The advantages of this method are as follows:

it allows one to describe every probabilistic CI structure,









it offers an elementary algebraic tool for the verification ofimplications between CI statements,









it offers an elementary algebraic tool for the verification ofimplications between CI statements,

it has some potential to be applied to statistical learning of CIstructures, specifically some graphical models.


Probabilistic conditional independence

N . . . finite non-empty set of variables

XY . . . union of disjoint subsets X ,Y ⊆ N





Definition (conditional independence)

Let 〈X ,Y |Z 〉 be a triplet of pairwise disjoint subsets of N. Let P be adiscrete probability distribution over N.






Let 〈X ,Y |Z 〉 be a triplet of pairwise disjoint subsets of N. Let P be adiscrete probability distribution over N. We say that X is conditionally

independent of Y given Z with respect to P and write X ⊥⊥ Y |Z [P ] if

P(x|yz) = P(x|z) for configurations x, y, z for X ,Y ,Z with P(yz) > 0 .









A statement of this form will be called a CI statement.









A statement of this form will be called a CI statement.

Definition (conditional independence structure)

By the CI structure induced by a (discrete) probability distribution P overN is meant the collection of valid CI statements with respect to P .


Conditional independence inference

We address the following inference problem.




Let L be a set of CI statements over N, called an input list, andt another CI statement.





Does L imply t?





Does L imply t?

More formally, is it true that, for any discrete distribution P for which allstatements in L hold, necessarily t holds as well?

This is the probabilistic implication of those CI statements.


Some history: semi-graphoids

The above inference problem was implicitly raised by Pearl (1988), whocame with a conjecture that (discrete) CI structures can be characterizedby well-known semi-graphoid properties or axioms :




Symmetry X ⊥⊥ Y |Z ⇒ Y ⊥⊥ X |Z ,

Decomposition X ⊥⊥ YW |Z ⇒ X ⊥⊥ W |Z ,

Weak union X ⊥⊥ YW |Z ⇒ X ⊥⊥ Y |WZ ,

Contraction X ⊥⊥ Y |WZ & X ⊥⊥ W |Z ⇒ X ⊥⊥ YW |Z .

(X , Y , Z , W ⊆ N are pairwise disjoint)









These basic formal properties of CI structures have been mentioned earlier(Dawid 1979),









These basic formal properties of CI structures have been mentioned earlier(Dawid 1979), (Spohn 1980),









These basic formal properties of CI structures have been mentioned earlier(Dawid 1979), (Spohn 1980), (Mouchart, Rolin 1984).










Semi-graphoids became a research topic: (Matus 2002),










Semi-graphoids became a research topic: (Matus 2002), (Dawid 2001),










Semi-graphoids became a research topic: (Matus 2002), (Dawid 2001),(Strausz 2004),










Semi-graphoids became a research topic: (Matus 2002), (Dawid 2001),(Strausz 2004), (Morton, Pachter, Shiu, Sturmfels, Wienand 2006)










Semi-graphoids became a research topic: (Matus 2002), (Dawid 2001),(Strausz 2004), (Morton, Pachter, Shiu, Sturmfels, Wienand 2006)(Hemmecke, Morton, Shiu, Sturmfels, Wienand 2008)


MultiinformationHowever, semi-graphoid properties do not cover probabilistic implication

(Studeny 1989). This can be shown easily by information-theoretical tools.




Definition (multiinformation, multiinformation function)

Let P be a discrete probability distribution over N.

The multiinformation of P is the relative entropy of P with respect to theproduct of its one-dimensional marginal distributions:

M(P) = H(P |∏

i∈N

P{i}) .







M(P) = H(P |∏

i∈N

P{i}) .

The multiinformation function induced by P a function mP : P(N) → R

which ascribes to every subset of N the multiinformation of thecorresponding marginal distribution: mP(A) ≡ M(PA) for A ⊆ N.







M(P) = H(P |∏

i∈N

P{i}) .

The multiinformation function induced by P a function mP : P(N) → R

which ascribes to every subset of N the multiinformation of thecorresponding marginal distribution: mP(A) ≡ M(PA) for A ⊆ N.

Multiinformation is a measure of joint stochastic dependence(Perez 1977), (Ay, Knauf 2006).


Multiinformation function and conditional independence

Lemma (basic property of the multiinformation function)

Let P be a discrete probability distribution over N and 〈X ,Y |Z 〉 a tripletof pairwise disjoint subsets of N. Then

mP(XYZ ) + mP(Z ) − mP(XZ ) − mP(YZ ) ≥ 0 .






In particular, mP is a (standardized) supermodular function.






In particular, mP is a (standardized) supermodular function.

Moreover, the equality occurs iff the corresponding CI statement holds:

mP(XYZ ) + mP(Z ) − mP(XZ ) − mP(YZ ) = 0 ⇔ X ⊥⊥ Y |Z [P ] .


Example: conditional independence implication

The following property is not derivable from the semi-graphoid properties:

A ⊥⊥ B |C & A ⊥⊥ C |D & A ⊥⊥ D |B =⇒ A ⊥⊥ C |B .




A ⊥⊥ B |C & A ⊥⊥ C |D & A ⊥⊥ D |B =⇒ A ⊥⊥ C |B .

Indeed, the assumptions say:

m(ABC ) + m(C ) − m(AC ) − m(BC )m(ACD) + m(D) − m(AD) − m(CD)m(ABD) + m(B) − m(AB) − m(BD)

= 0= 0= 0




A ⊥⊥ B |C & A ⊥⊥ C |D & A ⊥⊥ D |B =⇒ A ⊥⊥ C |B .



= 0




A ⊥⊥ B |C & A ⊥⊥ C |D & A ⊥⊥ D |B =⇒ A ⊥⊥ C |B .



= 0

Now, one can re-arrange the expression:

m(ABC ) + m(B) − m(AB) − m(BC )m(ACD) + m(C ) − m(AC ) − m(CD)m(ABD) + m(D) − m(AD) − m(BD)

= 0




A ⊥⊥ B |C & A ⊥⊥ C |D & A ⊥⊥ D |B =⇒ A ⊥⊥ C |B .



= 0


AB, BC

AC , CD

AD, BD


≥ 0≥ 0≥ 0




A ⊥⊥ B |C & A ⊥⊥ C |D & A ⊥⊥ D |B =⇒ A ⊥⊥ C |B .



= 0



= 0= 0= 0




A ⊥⊥ B |C & A ⊥⊥ C |D & A ⊥⊥ D |B =⇒ A ⊥⊥ C |B .



= 0



= 0= 0= 0

But m(ABC ) + m(B) − m(AB) − m(BC ) = 0 means A ⊥⊥ C |B !


Abstraction of the exampleEvery CI statement is represented by a formal expression:

A ⊥⊥ B |C −→ δABC + δC − δAC − δBC ≡ u〈A,B|C〉

A ⊥⊥ C |D −→ δACD + δD − δAD − δCD ≡ u〈A,C |D〉

A ⊥⊥ D |B −→ δABD + δB − δAB − δBD ≡ u〈A,D|B〉

A ⊥⊥ C |B −→ δABC + δB − δAB − δBC ≡ u〈A,C |B〉







The core of our consideration was, in fact, this equality:

δABC + δC − δAC − δBC

δACD + δD − δAD − δCD

δABD + δB − δAB − δBD

=

δABC + δB − δAB − δBC

δACD + δC − δAC − δCD

δABD + δD − δAD − δBD







The core of our consideration was, in fact, this equality:

δABC + δC − δAC − δBC

δACD + δD − δAD − δCD

δABD + δB − δAB − δBD

=

δABC + δB − δAB − δBC

δACD + δC − δAC − δCD

δABD + δD − δAD − δBD

This is an elegant way to show this:

{ A ⊥⊥ B |C & A ⊥⊥ C |D & A ⊥⊥ D |B }

⇐⇒ { A ⊥⊥ C |B & A ⊥⊥ D |C & A ⊥⊥ B |D } .


Simple algebraic approach: imset

N ... a finite set of variables

P(N)≡ {A;A ⊆ N} ... the power set of N





Definition (imset)

An imset u is a function u : P(N) 7→ Z.

imset = an abbreviation for i nteger-valued m ulti set





Definition (imset)



We will regard it as a vector whose components are integers and areindexed by subsets of N.





Definition (imset)



We will regard it as a vector whose components are integers and areindexed by subsets of N.

Actually, any real function m : P(N) → R will be interpreted as a (real) vector inthe same way. The symbol 〈m, u〉 will then denote the scalar product of twovectors of this type:

〈m, u〉 ≡∑

A⊆N

m(A) · u(A) .


Elementary imsets

Given A ⊆ N, the symbol δA will denote a special imset given by:

δA(B) =

{1 if B = A,

0 if B 6= A,for B ⊆ N.


Elementary imsets


δA(B) =

{1 if B = A,


Definition (translation of a CI statement, an elementary imset)

Given a CI statement X ⊥⊥ Y |Z , the corresponding imset will be

u〈X ,Y |Z〉 ≡ δXYZ + δZ − δXZ − δYZ .


Elementary imsets


δA(B) =

{1 if B = A,





By an elementary imset is meant an imset of the form

u〈a,b|C〉 = δ{a,b}∪C + δC − δ{a}∪C − δ{b}∪C ,

where C ⊆ N and a, b ∈ N \ C are distinct.


Elementary imsets


δA(B) =

{1 if B = A,





By an elementary imset is meant an imset of the form

u〈a,b|C〉 = δ{a,b}∪C + δC − δ{a}∪C − δ{b}∪C ,

where C ⊆ N and a, b ∈ N \ C are distinct.

The class of elementary imsets over N will be denoted by E(N).


Combinatorial and structural imsets

Definition (combinatorial imset)

A combinatorial imset is an imset u which is the combination ofelementary imsets (= can be written as their sum):

u =∑

v∈E(N) kv · v for some kv ∈ Z+ .





u =∑


The class of combinatorial imsets over N will be denoted by C(N).





u =∑



Definition (structural imsets)

A structural imset is an imset u such that its multiple by a positive naturalnumber can be decomposed into elementary imsets:

n · u =∑

v∈E(N) kv · v for some n ∈ N and kv ∈ Z+ .





u =∑



Definition (structural imsets)

A structural imset is an imset u such that its multiple by a positive naturalnumber can be decomposed into elementary imsets:

n · u =∑

v∈E(N) kv · v for some n ∈ N and kv ∈ Z+ .

The class of structural imsets over N will be denoted by S(N).


Independence implication

Every structural imset induces a whole CI structure through a special(linear) algebraic criterion.




Definition

Suppose u, v ∈ S(N) are structural imsets. We say that u independence

implies v and write u ⇀ v if

∃ k ∈ N such that k · u − v ∈ S(N) .




Definition




Specifically, each u ∈ S(N) describes the following formal CI structure:

{X ⊥⊥ Y |Z ; u ⇀ u〈X ,Y |Z〉 } .




Definition




Specifically, each u ∈ S(N) describes the following formal CI structure:

{X ⊥⊥ Y |Z ; u ⇀ u〈X ,Y |Z〉 } .

Every probabilistic CI structure can be described in this way, but notconversely. These formal CI structures are called structural semi-graphoids.


Algebraic method for ensuring probabilistic implication

Given an input list L of CI statements, they are “translated” to imsets andthe corresponding imset representing L is obtained by summing:

uL =∑

t∈L ut .




uL =∑

t∈L ut .

Revisiting the example

A ⊥⊥ B |C −→ u〈A,B|C〉

A ⊥⊥ C |D −→ u〈A,C |D〉

A ⊥⊥ D |B −→ u〈A,D|B〉

∑−→ uL




uL =∑

t∈L ut .


A ⊥⊥ B |C −→ u〈A,B|C〉

A ⊥⊥ C |D −→ u〈A,C |D〉

A ⊥⊥ D |B −→ u〈A,D|B〉

∑−→ uL

A kind of standard technique for (algebraic) inference of a CI statementX ⊥⊥ Y |Z from an input list L is based on the following property:




uL =∑

t∈L ut .


A ⊥⊥ B |C −→ u〈A,B|C〉

A ⊥⊥ C |D −→ u〈A,C |D〉

A ⊥⊥ D |B −→ u〈A,D|B〉

∑−→ uL


If uL ⇀ u〈X ,Y |Z〉




uL =∑

t∈L ut .


A ⊥⊥ B |C −→ u〈A,B|C〉

A ⊥⊥ C |D −→ u〈A,C |D〉

A ⊥⊥ D |B −→ u〈A,D|B〉

∑−→ uL


If uL ⇀ u〈X ,Y |Z〉 then L probabilistically implies X ⊥⊥ Y |Z .




uL =∑

t∈L ut .


A ⊥⊥ B |C −→ u〈A,B|C〉

A ⊥⊥ C |D −→ u〈A,C |D〉

A ⊥⊥ D |B −→ u〈A,D|B〉

∑−→ uL



In the example uL = u〈A,B|C〉 + u〈A,C |D〉 + u〈A,D|B〉, X ⊥⊥ Y |Z ∼ A ⊥⊥ C |B

We put k ≡ 1; then k · uL − u〈X ,Y |Z〉 = u〈A,B|C〉 + u〈A,C |D〉 + u〈A,D|B〉 − u〈A,C |B〉 =

u〈A,D|C〉 + u〈A,B|D〉 is a combinatorial imset, and therefore, a structural imset.




uL =∑

t∈L ut .


A ⊥⊥ B |C −→ u〈A,B|C〉

A ⊥⊥ C |D −→ u〈A,C |D〉

A ⊥⊥ D |B −→ u〈A,D|B〉

∑−→ uL



In the example uL = u〈A,B|C〉 + u〈A,C |D〉 + u〈A,D|B〉, X ⊥⊥ Y |Z ∼ A ⊥⊥ C |B

We put k ≡ 1; then k · uL − u〈X ,Y |Z〉 = u〈A,B|C〉 + u〈A,C |D〉 + u〈A,D|B〉 − u〈A,C |B〉 =

u〈A,D|C〉 + u〈A,B|D〉 is a combinatorial imset, and therefore, a structural imset.

Note this is only a sufficient condition for probabilistic implication.


Geometric point of view I

Let us imagine imsets as the points in the Euclidean space RP(N).




Definition

Let R(N) denote the cone generated by the set of elementary imsets.

It is a pointed rational cone of the dimension 2|N| − |N| − 1.




Definition



We can re-interpret the above concepts in terms of convex geometry(Barvinok 2002):




Definition




the set of elementary imsets E(N) is the set of generators of R(N),




Definition





the set of combinatorial imsets C(N) is the set of elements of thesemi-group generated by E(N),




Definition





the set of combinatorial imsets C(N) is the set of elements of thesemi-group generated by E(N),

the set structural imsets S(N) is the set of lattice points in R(N)(= vectors having integers as components).


Geometric point of view II

CI structures (induced by structural imsets) correspond to faces of thecone R(N).




More specifically, given u ∈ S(N), the corresponding CI structure{X ⊥⊥ Y |Z ; u ⇀ u〈X ,Y |Z〉 } can be identified with the face Fu

generated by u (= the least face of R(N) containing u).






Independence implication u ⇀ v , where u, v ∈ S(N), means then,in the geometric sense, that Fv ⊆ Fu (≡ v ∈ Fu). Thus:Fu ∩ S(N) = {v ∈ S(N); u ⇀ v } for any u ∈ S(N).







In other words, the poset of structural semi-graphoids is isomorphic to theface lattice of R(N)!







In other words, the poset of structural semi-graphoids is isomorphic to theface lattice of R(N)!

Actually, my original source of inspiration for introducing the independenceimplication was quite close to this geometric interpretation.


Example: the face lattice of R(N) for |N | = 3.

a ⊥⊥ b | ∅

a ⊥⊥ c | ∅

b ⊥⊥ c | ∅

a ⊥⊥ b | c

a ⊥⊥ c | b

b ⊥⊥ c | a'

&

$

%'

&

$

%

'

&

$

%

'

&

$

%

'

&

$

%

'

&

$

%'

&

$

%

'

&

$

%

'

&

$

%

'

&

$

%

'

&

$

%

'

&

$

%

'

&

$

%

'

&

$

%

'

&

$

%'

&

$

%

'

&

$

%

'

&

$

%

'

&

$

%

'

&

$

%

'

&

$

%'

&

$

%M. Studeny (Prague) CI structures and polyhedra December 15, 2008 19 / 38

Dual geometric view

Observation

The dual cone to the cone R(N) is the cone K(N) of supermodular

functions, that is, functions m : P(N) → R such that

m(X ∪ Y ) + m(X ∩ Y ) − m(X ) − m(Y ) ≥ 0 for all X ,Y ⊆ N .

K(N) contains the linear subspace of modular functions of the dimension |N |+1.


Dual geometric view

Observation





After the restriction to a suitable complementary linear subspace(= standardization), the resulting class of standardized supermodular

functions Kℓ(N) becomes a pointed rational cone of the dimension2|N| − |N| − 1.


Dual geometric view

Observation






functions Kℓ(N) becomes a pointed rational cone of the dimension2|N| − |N| − 1. The face lattice for R(N) is anti-isomorphic to the facelattice for Kℓ(N).


Dual geometric view

Observation






functions Kℓ(N) becomes a pointed rational cone of the dimension2|N| − |N| − 1. The face lattice for R(N) is anti-isomorphic to the facelattice for Kℓ(N).

Thus, CI structures (induced by structural imsets) also correspond(anti-isomorphically) to faces of the cone Kℓ(N). In particular, theco-atomic CI structures correspond to the extreme rays of Kℓ(N)!


Example: the face lattice of Kℓ(N) for |N | = 3.

δN + δab

δN + δac

δN + δbc

δN

2 · δN + δab + δac + δbc '

&

$

%'

&

$

%

'

&

$

%

'

&

$

%

'

&

$

%

'

&

$

% '

&

$

%

'

&

$

%

'

&

$

%

'

&

$

%

'

&

$

%

'

&

$

%

'

&

$

%

'

&

$

%

'

&

$

% '

&

$

%

'

&

$

%

'

&

$

%

'

&

$

%

'

&

$

%

'

&

$

% '

&

$

%


Towards algorithms for testing independence implication

The geometric interpretation is particularly useful when one is trying todesign algorithms for computer testing implications between CIstatements.




There are at least three possible methodological approaches to (computer)testing the independence implication:





1 a method based on the characterization of the extreme rays of thecone Kℓ(N), leads to a falsification algorithm






2 the methods based on the characterization of lattice points in thecone R(N), leads to verification algorithms







3 linear programming approaches.







3 linear programming approaches.

Note that in (Bouckaert, Studeny 2007) a kind of hybrid algorithm fortesting uL ⇀ u〈X ,Y |X 〉 was proposed. However, this algorithm does not

guarantee to give a decisive answer in a fixed time.


The first approach: falsification algorithm

Observation

Let K⋄ℓ (N) denote the normalized integral representatives of the extreme

rays of Kℓ(N). Then, ∀ u, v ∈ S(N) one has

u ⇀ v iff ∀m ∈ K⋄ℓ (N) 〈m, v〉 > 0 ⇒ 〈m, u〉 > 0 .



Observation



u ⇀ v iff ∀m ∈ K⋄ℓ (N) 〈m, v〉 > 0 ⇒ 〈m, u〉 > 0 .

The number of elements of K⋄ℓ(N) depends on |N |. In principle, they can be

computed as facet-defining inequalities for the cone R(N).



Observation



u ⇀ v iff ∀m ∈ K⋄ℓ (N) 〈m, v〉 > 0 ⇒ 〈m, u〉 > 0 .



|N | 2 3 4 5|K⋄

ℓ(N)| 1 5 37 117978



Observation



u ⇀ v iff ∀m ∈ K⋄ℓ (N) 〈m, v〉 > 0 ⇒ 〈m, u〉 > 0 .



|N | 2 3 4 5|K⋄

ℓ(N)| 1 5 37 117978

The problem is that the general characterization of the extreme rays of Kℓ(N) isnot available. Computing it for |N | = 6 looks like a hard problem.



Observation



u ⇀ v iff ∀m ∈ K⋄ℓ (N) 〈m, v〉 > 0 ⇒ 〈m, u〉 > 0 .



|N | 2 3 4 5|K⋄

ℓ(N)| 1 5 37 117978

The problem is that the general characterization of the extreme rays of Kℓ(N) isnot available. Computing it for |N | = 6 looks like a hard problem.

Nevertheless, to disprove uL ⇀ u〈X ,Y |Z〉 it suffices to find a supermodularfunction m such that 〈m, uL〉 = 0 and 〈m, u〈X ,Y |Z〉〉 > 0.


The decomposition approach: verification algorithms

Testing whether an imset u is combinatorial can be done recursively, bychecking, for each v ∈ E(N), whether u − v ∈ C(N).




This is because the sum∑

v∈E(N) kv in u =∑

v∈E(N) kv · v only depends on the

imset u! This invariant is called the degree of (a combinatorial imset) u and canbe computed beforehand.








The first verification algorithm for testing uL ⇀ u〈X ,Y |Z〉 was based ontesting whether, for some k ∈ N, k · uL − u〈X ,Y |Z〉 can be decomposedinto elementary imsets.









Thus, to confirm uL ⇀ u〈X ,Y |Z〉 it suffices to find a decomposition likethat.









Thus, to confirm uL ⇀ u〈X ,Y |Z〉 it suffices to find a decomposition likethat. However, this approach does not allow to disprove the implication(unless one is sure that C(N) = S(N) and has the upper limit for k).


The minimal integral Hilbert basis for R(N)

Thus, for the decomposition approach, an important open question waswhether the combinatorial and structural imsets coincide or not.



Thus, for the decomposition approach, an important open question waswhether the combinatorial and structural imsets coincide or not.The equality C(N) = S(N) was shown for |N | ≤ 4 (Studeny 1991).




Nevertheless, Hemmecke, Morton, Shiu, Sturmfels and Wienand (2008)have recently shown that C(N) 6= S(N) for |N| = 5!





Therefore, the following concept became the topic of immediate interest:





Therefore, the following concept became the topic of immediate interest:

Definition (Hilbert basis)

The minimal integral Hilbert basis generating R(N) is the least finite setH(N) of imsets such that an imset u belongs to R(N) [≡ u ∈ S(N)] iff

u =∑

v∈H(N)

kv · v with kv ∈ Z+ .

Note the Hilbert basis exists for any pointed rational cone (Schrijver 1986).


Why is the Hilbert basis H(N) useful?

Of course, it opens the way to potential testing structural imsets by thedecomposition approach.




Moreover, the existence of H(N) implies this:

Observation

There exists the smallest constant n∗ ∈ N depending on |N| such thatu ∈ S(N) iff n∗ · u ∈ C(N).





Observation


Thus, once one knows n∗ one need not know H(N) and can apply thedecomposition approach to testing standard imsets anyway.





Observation



Recent achivement: The Hilbert basis for |N| = 5 was computed byRaymond Hemmecke and Matthias Koppe.





Observation



Recent achivement: The Hilbert basis for |N| = 5 was computed byRaymond Hemmecke and Matthias Koppe.It has 1255 elements and the value of the contant is





Observation



Recent achivement: The Hilbert basis for |N| = 5 was computed byRaymond Hemmecke and Matthias Koppe.It has 1255 elements and the value of the contant is only n∗ = 2!


The limit for the other multiplicative factor

Recall that, for u, v ∈ S(N), u ⇀ v iff ∃ k ∈ N with k · u − v ∈ S(N).




Observation

There exists the smallest constant k∗ ∈ N depending on |N| such that∀ u ∈ S(N) and v ∈ E(N) one has

u ⇀ v iff k∗ · u − v ∈ S(N) .




Observation


u ⇀ v iff k∗ · u − v ∈ S(N) .

We know that k∗ = 1 if |N| ≤ 4, and k∗ = 7 for |N| = 5.

The exact value of the constant can be deduced from the “matrix” of thevalues {〈m, v〉 ; m ∈ K⋄

ℓ (N), v ∈ E(N)}.




Observation


u ⇀ v iff k∗ · u − v ∈ S(N) .

We know that k∗ = 1 if |N| ≤ 4, and k∗ = 7 for |N| = 5.

The exact value of the constant can be deduced from the “matrix” of thevalues {〈m, v〉 ; m ∈ K⋄

ℓ (N), v ∈ E(N)}.

Thus, at the moment, I am not able to compute k∗ without computing thefacets of the cone R(N).


Linear programming approachAnother promising idea is to utilize linear programming (LP) approach.



1 Given u, v ∈ S(N) introduce

M = sup {〈m, v〉 ; m ∈ Kℓ(N), 〈m, u〉 = 0 }.




M = sup {〈m, v〉 ; m ∈ Kℓ(N), 〈m, u〉 = 0 }.

Then M = 0 iff u ⇀ v . This is a classic LP problem because thedomain is specified in the form of a polyhedron.




M = sup {〈m, v〉 ; m ∈ Kℓ(N), 〈m, u〉 = 0 }.


2 Another option (mentioned by R. Hemmecke) is as follows. We put

ku⇀v = inf {k ∈ R; k · u − v ∈ R(N)}︸︷︷︸Ku⇀v

.




M = sup {〈m, v〉 ; m ∈ Kℓ(N), 〈m, u〉 = 0 }.




.

Then u ⇀ v iff ku⇀v < ∞.

One has either Ku⇀v = ∅ or Ku⇀v = [ku⇀v ,∞) with 0 ≤ ku⇀v ∈ Q.




M = sup {〈m, v〉 ; m ∈ Kℓ(N), 〈m, u〉 = 0 }.




.

Then u ⇀ v iff ku⇀v < ∞.

One has either Ku⇀v = ∅ or Ku⇀v = [ku⇀v ,∞) with 0 ≤ ku⇀v ∈ Q.

There are methods for computing the solution of the second LP problem which

do not need the facets of R(N), and are fine with the description of R(N) in

the form of its extreme rays.M. Studeny (Prague) CI structures and polyhedra December 15, 2008 28 / 38

Learning graphical models: Bayesian networks

Another source of motivation for the method of imsets was the ideaof its possible application to statistical learning graphical CI structures

(= model selection).





The most popular class of graphical CI structures in the area ofprobabilistic reasoning is the class of statistical models described byacyclic directed graphs (whose nodes correspond to variables).






These models are known in the literature as Bayesian networks (BN);most working probabilistic expert systems are based on the mathematicaltheory related to BNs.







The geometric view on the problem of structural learning BNs leads tothe study of a special polytope.







The geometric view on the problem of structural learning BNs leads tothe study of a special polytope.This is another important polyhedron related to CI structures. It appears to be

closely related to the cone R(N).


Score and search method for learning BNs

One of the common methods for structural learning Bayesian networksfrom data is the method of maximization of a quality criterion(= score and search method).




By a quality criterion, also named a score metric or a score, is meanta special real function Q of the BN structure, usually represented bya graph G , and of the database D. The value Q(G ,D) should quantifyhow the BN structure given by G fits the observed database D.





From the mathematical point of view, it leads to the task to maximize thefunction G 7→ Q(G ,D) (for observed D).





From the mathematical point of view, it leads to the task to maximize thefunction G 7→ Q(G ,D) (for observed D).

There are two important technical requirements on a quality criterion Qbrought in connection with the maximization problem. One of them isthat Q should be score equivalent (Bouckaert 1995), the other is that Qshould be decomposable (Chickering 2002).


Local search methods

The above assumptions allow one to utilize various local search methods

for this purpose. The basic idea is that one introduces neighborhood

structure between BN structure representatives.






The point is that, instead of the global maximum of Q, one is trying tofind a local maximum with respect to the chosen neighborhood structureby a greedy search technique.







A kind of standard neighborhood relation is the inclusion neighborhood,which comes from the CI interpretation of BN structures.







A kind of standard neighborhood relation is the inclusion neighborhood,which comes from the CI interpretation of BN structures.

A typical example of these techniques is the greedy equivalence search

(GES) algorithm, which is based just on the inclusion neighborhood(Chickering 2002).


Algebraic approach to learning BNs

The basic idea of an algebraic approach to learning BN structures is torepresent both the BN structure and the database by a real vector.




The algebraic representative of the BN structure given by an acyclicdirected graph G is a certain integral vector uG , called the standard

imset (for G ). This imset has “many” zero values and, therefore, the memory

demands for its computer representation are polynomial in |N |.







The crucial point is that every score equivalent and decomposable criterionQ is an affine function (= linear function plus a constant) of the standardimset.







The crucial point is that every score equivalent and decomposable criterionQ is an affine function (= linear function plus a constant) of the standardimset. More specifically, one has

Q(G ,D) = sQD − 〈tQD , uG 〉 , where sQD ∈ R,

tQD is a real vector of the same dimension as uG and 〈∗, ∗〉 denotes the

scalar product.







The crucial point is that every score equivalent and decomposable criterionQ is an affine function (= linear function plus a constant) of the standardimset. More specifically, one has

Q(G ,D) = sQD − 〈tQD , uG 〉 , where sQD ∈ R,

tQD is a real vector of the same dimension as uG and 〈∗, ∗〉 denotes the

scalar product. The vector tQD is named the data vector (relative to Q).


Geometric view

The algebraic approach can be enriched by a geometric view. One canimagine the set of all standard imsets over a fixed set of variables N as theset of points in the respective Euclidean space.


Geometric view


Theorem (Studeny, Vomlel 2008)

The set of standard imsets over N is the set of vertices of a rationalpolytope P ⊆ RP(N). The dimension of the polytope is 2|N| − |N| − 1.


Geometric view




Thus, once one succeeds to describe the above mentioned polytope in theform of a (bounded) polyhedron, one gets a classic task of linearprogramming, namely to maximize a linear function over a polyhedron.


Geometric view




Thus, once one succeeds to describe the above mentioned polytope in theform of a (bounded) polyhedron, one gets a classic task of linearprogramming, namely to maximize a linear function over a polyhedron.

The relation of P to R(N) is this: R(N) is the conic hull of P.Specifically, the zero imset u = 0 is a distinguished vertex of P and R(N) is the

cone generated by the edges of P coming out of it.


Geometric neighborhoodOne of possible interpretations of the simplex method (Schrijver 1998) isthat it is a kind of “search” method in which one moves between thevertices of the polyhedron along its edges.



Definition (geometric neighborhood)

We say that two BN structures are geometric neighbors if the line-segmentconnecting the corresponding standard imsets in RP(N) is an edge of thepolytope P (generated by the set of standard imsets).





In (Studeny, Vomlel, Hemmecke 2009) we

observed that the inclusion neighborhood is always (strictly) contained inthe geometric one,







gave an example of the failure of the “classic” GES algorithm (based on theinclusion neighborhood) to find the global maximum of Q,







gave an example of the failure of the “classic” GES algorithm (based on theinclusion neighborhood) to find the global maximum of Q,

characterized the geometric neighborhood in the case of at most 5 variables.


Example: the geometric neighborhood for three variables

Example

B

A

G bc

C CA

B

A C

B

A C

B

CA

B

A C

B

CA

B

A C

B

A C

B

B

A C

B

CA

G∅

G ab G ac G bc

G•

G acG ab


Conclusions: some open questions

There are many open problems motivated by the method of structuralimsets; they were gathered in Chapter 9 of (Studeny 2005).




Some of them have already been solved, the other ones wait for a solution.Moreover, as usual, successful solutions “generate” further questions.





These are new questions concerning independence implication:






Can we ever find the Hilbert basis H(N) for |N | ≥ 6?







What are the multiplicative factors n∗ and k∗ for |N | ≥ 6?








Can we successfully test the independence implication by the LP approach?









The further questions concern of a potential future LP procedure for structurallearning BNs (by the score and search method).










Is it possible to characterize either facets or edges of the polytope P(generated by standard imsets)?










Is it possible to characterize either facets or edges of the polytope P(generated by standard imsets)?

What are the formulas for data vectors for Bayesian quality criteria?


Some relevant literatureN. Ay, A. Knauf (2006). Maximizing multi-information. Kybernetika 42, n. 5, pp. 517-538.

A. Barvinok (2002). A Course in Convexity. AMS, Providence.

R.R. Bouckaert (1995). Bayesian belief networks: from construction to evidence. PhDthesis, University of Utrecht.

R.R. Bouckaert, M. Studeny (2007). Racing algorithms for conditional independenceinference. International Journal of Approximate Reasoning 45, n. 2, pp. 386-401.

D.M. Chickering (2002). Optimal structure identification with greedy search. Journal of

Machine Learning Research 3, pp. 507-554.

A.P. Dawid (1979). Conditional independence in statistical theory. Journal of Royal

Statistical Society B 41, pp. 1-31.

A.P. Dawid (2001). Separoids: a mathematical framework for conditional independence.Annals of Mathematics and Artificial Intelligence 32, n. 1-4, pp. 335-372.

R. Hemmecke, J. Morton, A. Shiu, B. Sturmfels, O. Wienand (2008). Threecounter-examples on semi-graphoids. Combinatorics, Probability and Computing 17, n. 2,pp. 239-257.

F. Matus (2002). Lengths of semigraphoid inferences. Annals of Mathematics and Artificial

Intelligence 35, pp. 287-294.

M. Mouchart, J.-M. Rolin (1984). A note on conditional independence with statisticalapplications. Statistica 45, n. 4, pp. 557-584.


J. Morton, L. Pachter, A. Shiu, B. Sturmfels, O. Wienenand (2006). Geometry of ranktests. In Proceedings of PGM’06, pp. 207-214.

J. Pearl (1988). Probabilistic Reasoning in Intelligent Systems, Morgan Kaufmann, SanMateo.

A. Perez (1977) ε-admissible simplifications of the dependence structure of a set ofrandom variables. Kybernetika 13, pp. 439-449.

A. Schrijver (1986). Theory of Linear and Integer Programming, John Wiley.

W. Spohn (1980). Stochastic independence, causal independence and shieldability. Journal

of Philosophical Logic 9, n. 1, pp. 73-99.

R. Strausz (2004). On separoids. PhD thesis, Universitad Nacional Autonoma de Mexico.

M. Studeny (1989). Multiinformation and the problem of characterization of conditionalindependence relations. Problems of Control and Information Theory 18, n. 1, pp. 3-16.

M. Studeny (1991). Convex set functions I. and II. Research reports n. 1733 and n. 1734,Institute of Information Theory and Automation, Prague, November 1991.

M. Studeny (2005). Probabilistic Conditional Independence Structures, Springer-Verlag,London.

M. Studeny, J. Vomlel (2008). A geometric approach to learning BN structures. InProceedings of PGM’08, pp. 281-288.

M. Studeny, J. Vomlel, R. Hemmecke (2009). A geometric view on learning Bayesiannetwork structures. Submitted to International Journal of Approximate Reasoning.


Documents

Conditional Independence Structures and Polyhedrajessica2.msri.org/attachments/13516/13516.pdfDeﬁnition (conditional independence structure) By the CI structure induced by a (discrete)