34
An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto Universit y)

An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

Embed Size (px)

Citation preview

Page 1: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

An Index of Data Sizeto Extract Decomposable Structures in LAD

Hirotaka Ono

Mutsunori Yagiura

Toshihide Ibaraki

(Kyoto University)

Page 2: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

Overview1. Overview of LAD2. Decomposability

- Importance & motivation3. An index of decomposability

- #data vectors needed to extract reliable decomposable structures

- Based on probabilistic analyses4. Numerical experiments5. Conclusion

Page 3: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

Logical Analysis of Data (LAD)

Input:

Output: discriminant function

nFT }1 ,0{ ,

Fx

Txxf

for 0

for 1 )(

T: positive examples (the phenomenon occurs)F: negative examples (the phenomenon does not occur)

f(x): a logical explanation of the phenomenon

For a phenomenon

Page 4: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

Example: influenzaFever Headache Cough Snivel Stomachache

1 1 0 1 1

1 0 1 1 1

1 1 1 1 0

1 0 0 1 1

1 1 0 0 0

0 1 0 1 1

T

F

: Set of patients having influenza: Set of patients having common coldF

T

An example of discriminant functions: 431421)( xxxxxxxf

1=Yes, 0=No

5x4x3x1x 2x

Discriminant function f (x) represents knowledge “influenza”.

One kind of knowledge acquisition

Page 5: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

Guideline to find a discriminant function

• Simplicity• Explain the structure of the phenomenon

Page 6: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

x1 x2 x3 x4 x5 h(x[S1])

T

1 1 0 1 1 1

1 0 1 1 1 1

1 1 1 1 0 1

F

1 0 0 1 1 0

1 1 0 0 0 1

0 1 0 1 1 1

Decomposability

S0 {1, 4, 5}

h(x[S1]) x2 x3

f (x) x1x2x4 x1x3x4

x1x4 h(x[S1])

decomposable!

S1 {2, 3}

f is decomposable f (x) g(x[S0], h(x[S1]))

(T, F) is decomposable decomposable discriminant f

Page 7: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

Example: concept of “square”

i 1 1 1 0

ii 1 1 1 1

iii 0 1 1 0

iv 1 0 0 1

v 1 1 0 1

1x 2x 3x 4x

1x : the lengths of all edges are equal2x : the number of vertices is 43x : contains a right angle4x : the area is over 100

T

F iii

iv

i ii

v

Page 8: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

Example: concept of “square”Square

- the lengths of all edges are equal- the number of vertices is 4

- contains a right angle

- contains a right angle

Square

- rhombus

- the lengths of all edges are equal- the number of vertices is 4

Page 9: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

Hierarchical structures and decomposable structures

Concept

attribute attributeattributeattributeattributeattributeattribute

)(xf

Page 10: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

Hierarchical structures and decomposable structures

Concept

attribute attributeattributeattribute

attributeattributeattribute

]))[(],[()( 10 SxhSxgxf

Sub-Concept

])[( 1Sxh)(xf

0S

1S

Page 11: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

Previous research on decomposability

]))[(],[( 10 SxhSxg),( FT

• Finding basic decomposable functions (e.g, ) for given and attribute sets

• case: polynomial time [Boros, et al. 1994]

• Finding other classes (positive, Horn, and their mixtures ) of decomposable functions for and attribute set

[Makino, et al. 1995]

• Finding a (positive) decomposable function for given ( is not given)

• NP-hard • proposing a heuristic algorithm [Ono, et al. 1999]

),( FT

]))[(],[( 10 SxhSxg

]))[(],[( 10 SxhSxg),( 10 SS),( FT

Page 12: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

The number of data and decomposable structures

• Case 1: The size of given data is small.– Advantage:

Less computational time is needed to find a decomposable structure.

– Disadvantage:Decomposable structures easily exist in data(because of less constraints)= Most decomposable structures are deceptive.

Page 13: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

The number of data and decomposable structures

• Case 2: The size of given data is large.– Advantage:

Deceptive decomposable structures will not be found.

– Disadvantage:More computational time is needed.

How many data vectors should be prepared

to extract real decomposable structures?

Index of decomposability

Page 14: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

(T, F) is decomposable conflict graph of (T, F) is bipartite

Overview of our approach

Assume that (T, F) is the set of l randomly chosen vectors from {0, 1}n.

1. Compute the probability of an edge to appear in the conflict graph

2. Regard the conflict graph as a random graph

Investigate the probability of the conflict graph to be non-bipartite

Page 15: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

Conflict graph

1 0 0 1 1

0 1 0 1 0

1 0 0 1 0

1 0 0 0 1

0 1 0 1 1

0 1 0 0 1

0S 1S

T

F00

01

11

10

Conflict graph

1)11( Suppose h

])[( 1Sxh0)01( h 1)10( h

0)11( h

(T, F) is decomposable conflict graph of (T, F) is bipartite

Page 16: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

Probability of an edge to appear in conflict graph

0S 1S

T

F yy

a

b

a

b

graph.conflict in the appears ),( Edge bae ),( byay There exists a linked pair .

. and

or , and

TbyFay

FbyTay

A pair of vectors is called linked if ),( byay

Page 17: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

otherwise. 0

linked, is ),( 1 byayX ey

0}1,0{ Sy

eye XX

1eX

Define a random variable by

where

edge appears in the conflict graph.

We want to compute .

eX

1Pr)1Pr(0}1,0{ Sy

eye XX

graph.conflict in the appears ),( Edge bae ),( byay There exists a linked pair .

e

Page 18: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

Assumptions

• Generation of (T, F)

- |T| + |F| = l vectors are randomly sampled from {0, 1}n without replacement.

- A sampled vector is in T with probability p, and in F with probability q 1 p.

• M 2n

• || 02 Sm

Page 19: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

How to compute

1Pr)1Pr(0}1,0{ Sy

eye XX

)1Pr( eyX is easier to compute.

1. Both of2. They have different values (i.e., 0 and 1).

. in chosen are and FTbyay

)),(( 1 baeX ey

)1(

)1(2)1Pr(

MM

llpqX ey

2. 1.

Page 20: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

Upper and lower bounds on

)1Pr( eX

)1(

)1(2)1Pr(

MM

llpqX ey

By Markov’s inequality and linearity of expectation,

)1(

)1(2)1Pr(Ex

ExEx)Ex( 00 }1,0{}1,0{

MM

llpqmXmXm

XXX

eyey

yey

yeye

SS

)1Pr( )1Pr( 00 }1,0{',

'}1,0{

SS yyeyey

yey XXX

By the principle of inclusion and exclusion,

)1Pr( eX

Upper Bound

Lower Bound

)1Pr( eX

Page 21: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

Approximation of )1Pr( eX

2

2

2)1(

)1(2

M

lpqm

MM

llpqm

)1Pr( eX

holds. )1Pr( , smallFor eX

Page 22: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

Random graph

r 1r 0r

rIn our analysis, is assumed to be the probability of an edge to appear in the conflict graph.

Random graph G(N, r)

- N: the number of vertices

- Each edge e (u, v) appears in G(N, r)

with probability r independently

Page 23: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

Probability of a random graph to be non-bipartite

Yodd: Random variable representing the number of odd cycles in G(N, r)Pr(Yodd 1): Probability that G(N, r) is not bipartite

odd :

3oddodd 2

Ex 1Pr

kNk

kk

rk

NYY

Markov’s inequality

)1()1( kNNNN k The number of sequences of k vertices

Page 24: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

k

kk

k

kNk

kk

rk

Nr

k

NY

odd :

3odd :

3odd 22

Ex

zz

z

1

1ln

2

1

2

1

Taylor series of ln(1 z))10( zNrz

)(zU

)(zU

hold? 1)Ex( doesWhen odd Y

Upper bound:

1 Ex 9950.0 odd YNr

Page 25: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

)1( ln42)(

1

2 Ex ε5.0

odd :3

odd :3

odd

ε5.0

ONc

k

c

Nrk

rNY

N

kk

k

kNk

kk

Lower bound when Nr 1:

1 if as Ex odd NrNY

For sufficiently large N, 1 1Ex odd NrY

(c [0, 1) and (0, 0.5) are constants)

1 Ex 9950.0 odd YNr

hold? 1)Ex( doesWhen odd Y

Page 26: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

Assumptions

Our index

2

2

2)1Pr(M

lpqmX e

Probability of an edge to appear in conflict graph

Threshold for a random graphto be bipartite or not

1Nr

nM 2 || 02 Sm |||| FTl

)1(Pr and 2 || 1 eS XrN

1

2

2

2||

22)1Pr(21 1

ne

S lpq

M

lpqm

m

MXNr

pql n /2 1

- probabilities p and q are given by p : q |T| : |F|

- conflict graph is a random graph

(|S0| |S1| n)

Page 27: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

Our index

pqFT n /2 1

• If , tends to have many deceptive decomposable structures.

• If tends to have no deceptive decomposable structure.

pqFT n /2 |||| 1

,/2 |||| 1 pqFT n ) ,( FT

) ,( FT

1 ,:: qpFTqp

Page 28: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

Numerical Experiments

1. Prepare non-decomposable randomly generated functions and construct 10 for each data size ( )

2. Check their decomposability

Randomly generated data Target functions are not decomposable Dimensions of data are n 10, 20 Two types of data:

  are biased and not biasedqp and

|||| FT ),( FT

Page 29: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

Randomly generated data

)5.0 ,5.0() ,( ,10 qpn our index

Sampling ratio (%)

Rat

io o

f de

com

posa

ble

(T, F

)s (

%)

Page 30: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

Randomly generated data

)1.0 ,9.0() ,( ,10 qpn )5.0 ,5.0() ,( ,20 qpn

Sampling ratio (%) Sampling ratio (%)

Rat

io o

f de

com

posa

ble

(T, F

)s (

%)

Rat

io o

f de

com

posa

ble

(T, F

)s (

%)

our index

Page 31: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

Breast Cancer in Wisconsin (a.k.a BCW) Already binarized The dimension is n 11 Comparison with randomly generated data wit

h the same n, p and q

Real-world data

Page 32: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

BCW and randomly generated data

)270.0 ,730.0() ,( ,11 qpnBCW Randomly generated data

Sampling ratio (%) Sampling ratio (%)

Rat

io o

f de

com

posa

ble

(T, F

)s (

%)

Rat

io o

f de

com

posa

ble

(T, F

)s (

%)

our index

Page 33: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

Discussion and conclusion

1 ,:: /2 1 qpFTqppqFT n

An index to extract reliable decomposable structures

Computational experiments on random & real-world data

- proposed index is a good estimate

- |S0| 1 or |S1| 2 threshold behavior is not clear

Page 34: An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)

Future workAnalyses on sharpness of the threshold behavior:

to know sufficient |T| + |F| to extract reliable decomposable structures

Apply similar approach to other classes of Boolean functions

|T| |F|

#dec

ompo

sabl

e

st

ruct

ures

proposed index

we want to estimate