An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura...

Preview:

Citation preview

An Index of Data Sizeto Extract Decomposable Structures in LAD

Hirotaka Ono

Mutsunori Yagiura

Toshihide Ibaraki

(Kyoto University)

Overview1. Overview of LAD2. Decomposability

- Importance & motivation3. An index of decomposability

- #data vectors needed to extract reliable decomposable structures

- Based on probabilistic analyses4. Numerical experiments5. Conclusion

Logical Analysis of Data (LAD)

Input:

Output: discriminant function

nFT }1 ,0{ ,

Fx

Txxf

for 0

for 1 )(

T: positive examples (the phenomenon occurs)F: negative examples (the phenomenon does not occur)

f(x): a logical explanation of the phenomenon

For a phenomenon

Example: influenzaFever Headache Cough Snivel Stomachache

1 1 0 1 1

1 0 1 1 1

1 1 1 1 0

1 0 0 1 1

1 1 0 0 0

0 1 0 1 1

T

F

: Set of patients having influenza: Set of patients having common coldF

T

An example of discriminant functions: 431421)( xxxxxxxf

1=Yes, 0=No

5x4x3x1x 2x

Discriminant function f (x) represents knowledge “influenza”.

One kind of knowledge acquisition

Guideline to find a discriminant function

• Simplicity• Explain the structure of the phenomenon

x1 x2 x3 x4 x5 h(x[S1])

T

1 1 0 1 1 1

1 0 1 1 1 1

1 1 1 1 0 1

F

1 0 0 1 1 0

1 1 0 0 0 1

0 1 0 1 1 1

Decomposability

S0 {1, 4, 5}

h(x[S1]) x2 x3

f (x) x1x2x4 x1x3x4

x1x4 h(x[S1])

decomposable!

S1 {2, 3}

f is decomposable f (x) g(x[S0], h(x[S1]))

(T, F) is decomposable decomposable discriminant f

Example: concept of “square”

i 1 1 1 0

ii 1 1 1 1

iii 0 1 1 0

iv 1 0 0 1

v 1 1 0 1

1x 2x 3x 4x

1x : the lengths of all edges are equal2x : the number of vertices is 43x : contains a right angle4x : the area is over 100

T

F iii

iv

i ii

v

Example: concept of “square”Square

- the lengths of all edges are equal- the number of vertices is 4

- contains a right angle

- contains a right angle

Square

- rhombus

- the lengths of all edges are equal- the number of vertices is 4

Hierarchical structures and decomposable structures

Concept

attribute attributeattributeattributeattributeattributeattribute

)(xf

Hierarchical structures and decomposable structures

Concept

attribute attributeattributeattribute

attributeattributeattribute

]))[(],[()( 10 SxhSxgxf

Sub-Concept

])[( 1Sxh)(xf

0S

1S

Previous research on decomposability

]))[(],[( 10 SxhSxg),( FT

• Finding basic decomposable functions (e.g, ) for given and attribute sets

• case: polynomial time [Boros, et al. 1994]

• Finding other classes (positive, Horn, and their mixtures ) of decomposable functions for and attribute set

[Makino, et al. 1995]

• Finding a (positive) decomposable function for given ( is not given)

• NP-hard • proposing a heuristic algorithm [Ono, et al. 1999]

),( FT

]))[(],[( 10 SxhSxg

]))[(],[( 10 SxhSxg),( 10 SS),( FT

The number of data and decomposable structures

• Case 1: The size of given data is small.– Advantage:

Less computational time is needed to find a decomposable structure.

– Disadvantage:Decomposable structures easily exist in data(because of less constraints)= Most decomposable structures are deceptive.

The number of data and decomposable structures

• Case 2: The size of given data is large.– Advantage:

Deceptive decomposable structures will not be found.

– Disadvantage:More computational time is needed.

How many data vectors should be prepared

to extract real decomposable structures?

Index of decomposability

(T, F) is decomposable conflict graph of (T, F) is bipartite

Overview of our approach

Assume that (T, F) is the set of l randomly chosen vectors from {0, 1}n.

1. Compute the probability of an edge to appear in the conflict graph

2. Regard the conflict graph as a random graph

Investigate the probability of the conflict graph to be non-bipartite

Conflict graph

1 0 0 1 1

0 1 0 1 0

1 0 0 1 0

1 0 0 0 1

0 1 0 1 1

0 1 0 0 1

0S 1S

T

F00

01

11

10

Conflict graph

1)11( Suppose h

])[( 1Sxh0)01( h 1)10( h

0)11( h

(T, F) is decomposable conflict graph of (T, F) is bipartite

Probability of an edge to appear in conflict graph

0S 1S

T

F yy

a

b

a

b

graph.conflict in the appears ),( Edge bae ),( byay There exists a linked pair .

. and

or , and

TbyFay

FbyTay

A pair of vectors is called linked if ),( byay

otherwise. 0

linked, is ),( 1 byayX ey

0}1,0{ Sy

eye XX

1eX

Define a random variable by

where

edge appears in the conflict graph.

We want to compute .

eX

1Pr)1Pr(0}1,0{ Sy

eye XX

graph.conflict in the appears ),( Edge bae ),( byay There exists a linked pair .

e

Assumptions

• Generation of (T, F)

- |T| + |F| = l vectors are randomly sampled from {0, 1}n without replacement.

- A sampled vector is in T with probability p, and in F with probability q 1 p.

• M 2n

• || 02 Sm

How to compute

1Pr)1Pr(0}1,0{ Sy

eye XX

)1Pr( eyX is easier to compute.

1. Both of2. They have different values (i.e., 0 and 1).

. in chosen are and FTbyay

)),(( 1 baeX ey

)1(

)1(2)1Pr(

MM

llpqX ey

2. 1.

Upper and lower bounds on

)1Pr( eX

)1(

)1(2)1Pr(

MM

llpqX ey

By Markov’s inequality and linearity of expectation,

)1(

)1(2)1Pr(Ex

ExEx)Ex( 00 }1,0{}1,0{

MM

llpqmXmXm

XXX

eyey

yey

yeye

SS

)1Pr( )1Pr( 00 }1,0{',

'}1,0{

SS yyeyey

yey XXX

By the principle of inclusion and exclusion,

)1Pr( eX

Upper Bound

Lower Bound

)1Pr( eX

Approximation of )1Pr( eX

2

2

2)1(

)1(2

M

lpqm

MM

llpqm

)1Pr( eX

holds. )1Pr( , smallFor eX

Random graph

r 1r 0r

rIn our analysis, is assumed to be the probability of an edge to appear in the conflict graph.

Random graph G(N, r)

- N: the number of vertices

- Each edge e (u, v) appears in G(N, r)

with probability r independently

Probability of a random graph to be non-bipartite

Yodd: Random variable representing the number of odd cycles in G(N, r)Pr(Yodd 1): Probability that G(N, r) is not bipartite

odd :

3oddodd 2

Ex 1Pr

kNk

kk

rk

NYY

Markov’s inequality

)1()1( kNNNN k The number of sequences of k vertices

k

kk

k

kNk

kk

rk

Nr

k

NY

odd :

3odd :

3odd 22

Ex

zz

z

1

1ln

2

1

2

1

Taylor series of ln(1 z))10( zNrz

)(zU

)(zU

hold? 1)Ex( doesWhen odd Y

Upper bound:

1 Ex 9950.0 odd YNr

)1( ln42)(

1

2 Ex ε5.0

odd :3

odd :3

odd

ε5.0

ONc

k

c

Nrk

rNY

N

kk

k

kNk

kk

Lower bound when Nr 1:

1 if as Ex odd NrNY

For sufficiently large N, 1 1Ex odd NrY

(c [0, 1) and (0, 0.5) are constants)

1 Ex 9950.0 odd YNr

hold? 1)Ex( doesWhen odd Y

Assumptions

Our index

2

2

2)1Pr(M

lpqmX e

Probability of an edge to appear in conflict graph

Threshold for a random graphto be bipartite or not

1Nr

nM 2 || 02 Sm |||| FTl

)1(Pr and 2 || 1 eS XrN

1

2

2

2||

22)1Pr(21 1

ne

S lpq

M

lpqm

m

MXNr

pql n /2 1

- probabilities p and q are given by p : q |T| : |F|

- conflict graph is a random graph

(|S0| |S1| n)

Our index

pqFT n /2 1

• If , tends to have many deceptive decomposable structures.

• If tends to have no deceptive decomposable structure.

pqFT n /2 |||| 1

,/2 |||| 1 pqFT n ) ,( FT

) ,( FT

1 ,:: qpFTqp

Numerical Experiments

1. Prepare non-decomposable randomly generated functions and construct 10 for each data size ( )

2. Check their decomposability

Randomly generated data Target functions are not decomposable Dimensions of data are n 10, 20 Two types of data:

  are biased and not biasedqp and

|||| FT ),( FT

Randomly generated data

)5.0 ,5.0() ,( ,10 qpn our index

Sampling ratio (%)

Rat

io o

f de

com

posa

ble

(T, F

)s (

%)

Randomly generated data

)1.0 ,9.0() ,( ,10 qpn )5.0 ,5.0() ,( ,20 qpn

Sampling ratio (%) Sampling ratio (%)

Rat

io o

f de

com

posa

ble

(T, F

)s (

%)

Rat

io o

f de

com

posa

ble

(T, F

)s (

%)

our index

Breast Cancer in Wisconsin (a.k.a BCW) Already binarized The dimension is n 11 Comparison with randomly generated data wit

h the same n, p and q

Real-world data

BCW and randomly generated data

)270.0 ,730.0() ,( ,11 qpnBCW Randomly generated data

Sampling ratio (%) Sampling ratio (%)

Rat

io o

f de

com

posa

ble

(T, F

)s (

%)

Rat

io o

f de

com

posa

ble

(T, F

)s (

%)

our index

Discussion and conclusion

1 ,:: /2 1 qpFTqppqFT n

An index to extract reliable decomposable structures

Computational experiments on random & real-world data

- proposed index is a good estimate

- |S0| 1 or |S1| 2 threshold behavior is not clear

Future workAnalyses on sharpness of the threshold behavior:

to know sufficient |T| + |F| to extract reliable decomposable structures

Apply similar approach to other classes of Boolean functions

|T| |F|

#dec

ompo

sabl

e

st

ruct

ures

proposed index

we want to estimate

Recommended