42
Discovering Markov Blankets: Finding Independencies Among Variables Motivation: Toward Optimal Feature Selection. Koller and Sahami. Proc. 13 th ICML, 1996. Algorithm: Algorithms for Large Scale Markov Blanket Discovery. Tsamardinos, et al. Proc. 16 th FLAIRS, 2003. Applications: HITON, A Novel Markov Blanket Algorithm for Optimal Variable Selection. Aliferis, et al. TR, Vanderbilt University, DSL-03-08, 2003. Presented by: Nakul Verma May 3, 2005.

Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

Discovering Markov Blankets: Finding Independencies Among Variables

Motivation: Toward Optimal Feature Selection. Koller and Sahami. Proc. 13th ICML, 1996.

Algorithm: Algorithms for Large Scale Markov Blanket Discovery. Tsamardinos, et al. Proc. 16th FLAIRS, 2003.

Applications: HITON, A Novel Markov Blanket Algorithm for Optimal Variable Selection. Aliferis, et al. TR, Vanderbilt University, DSL-03-08, 2003.

Presented by: Nakul VermaMay 3, 2005.

Page 2: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

2

Outline

Motivation

Introduction to Bayesian Networks and Markov Blankets

Markov Blanket Discovery algorithms

IAMB algorithm and results

HITON algorithm and results

Page 3: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

3

Outline

Motivation

Introduction to Bayesian Networks and Markov Blankets

Markov Blanket Discovery algorithms

IAMB algorithm and results

HITON algorithm and results

Page 4: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

4

Selecting Optimal Subsets of Features

Idea: Select the most relevant subset of features, that is, a small subset which still provides a high classification accuracy.Feature selection is an effective technique in reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving comprehensibility.

Algorithms for feature selection (FS) fall in two categories:Filter MethodsWrapper Methods

Page 5: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

5

Filter Methods

Filter methods select a subset of features without involving anylearning algorithm. Therefore,

FS is a preprocessing step before induction.FS algorithm and learning algorithm don’t interact.Filter methods don’t inherit any bias of the learning algorithms.

Example: FOCUS algorithm (exhaustive search on all feature subsets) [Almuallim & Dietterich 1991]

Page 6: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

6

Wrapper Methods

The Wrapper methods search through the space of feature subsets using the estimated accuracy from an induction algorithm as the measure of goodness for a particular subset of features. Therefore,

Algorithms using wrapper methods tend to be computationally moreexpensive compared to their ‘filter’ counterparts. A predetermined learning algorithm is needed to measure performance of Wrapper algorithms.Algorithms using the wrapper method tend to give a better performance compared to those using filter methods.

Example: HITON algorithm (discussed in detailed later)

Page 7: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

7

Using Feature Selection forGood Classification

Let F be the complete feature vector taking values f for one example, and C be the class random variable taking values c, then

is the probability that the class is c, given that the feature values are f.

Now considering the reduced feature space,Let G be a subset of F taking values fG (projection of f onto G). We want to choose G such that

is as close to as possible.

)|Pr( fF == cC

)|Pr( GfG == cC

)|Pr( fF == cC

Page 8: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

8

Information-Theoretic Measure ofCloseness of Distributions

Let µ and σ be two distributions over some probability space Ω.Then cross-entropy from µ to σ is defined as:

This is also known as the Kullback-Leibler (KL) distance. Intuitively, it is a distance function from a "true" probabilitydistribution, µ, to a “guessed" probability distribution, σ.

So, in feature subset selection problem, we wantf to be µ (“true”)fG to be σ (“gussed”)

∑ Ω∈=

x xxxD)()(log)(),(

σµµσµ

Page 9: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

9

Information-Theoretic Approach (Cont.)

Thus, we want to find a feature subset G, such that

is close to zero.

Note that the computation requires knowledge of conditional distributions (C given F and C given FG)

∑ ====∆f GG fFfFfF ))|Pr(),|(Pr()Pr( CCDG

Page 10: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

10

Difficulties of this Approach

We only get to observe a small sample of examples, which makes it fairly hard to approximate the true distribution to calculate the relative error.

It is impractical to compute the error . It requires exponential number of computations with respect to number of features in the domain.

We need an alternative to finding conditional distributions.

G∆

Page 11: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

11

Outline

Motivation

Introduction to Bayesian Networks and Markov Blankets

Markov Blanket Discovery algorithms

IAMB algorithm and results

HITON algorithm and results

Page 12: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

12

Bayesian Networks (BNs)

BNs are also known as Belief networks

A BN is a graph, where nodes are random variables and edges are direct relationships between variables.

BNs are used for inference: given observations of some nodes, one wants to know the probability distribution of other nodes.

Inference methods for BN can be classified into two categories: exact reasoning and sampling.

Page 13: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

13

An Example (Pearl ’88)

Earthquake

Daughter Calls

Burglary

Alarm

Watson Calls

News Report

Causal relationships

Need to encode information exponential in the number of parents as Conditional Probability Tables.

Pr(Burglary|Alarm,Report)?Pr(Alarm|¬Burglary)?

Page 14: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

14

Some Properties of BNs

A node is independent of its non-descendants given its parents

A node is independent of all other nodes, given its Markov blanket. So, what is a Markov blanket?

Page 15: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

15

Markov Blanket (MB)

M = MB(x)The Markov Blanket of a node is the set consisting of its parents, children, and spouses.

More formally:Let N be the set of all nodes and M be a set of nodes not containing x, then M is a MB for x if x is conditionally independent of N – M – x given M. M is minimal

Page 16: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

16

Why are MBs interesting?

MBs help in studying how an attribute x “behaves” under the effect of other attributes in the domain, by providing ‘shielding’ information.

MBs can help determine causal relationships among various nodes in a BN

They can help determine the structure of a BN given just its nodes.

They can be used in finding good feature subsets.

Page 17: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

17

Connection between MBsand feature selection

Fx

Feature Fx is independent of all other features given MB(Fx) = U1…Um, Z1j…Znj,

Y1…Yn

Fx gives no extra information regarding the BN given its Markov Blanket.

Features

Page 18: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

18

Using Markov Blankets forBetter Feature Selection

Algorithm idea:If we can find Markov Blanket for a feature Fi, remove feature Fi. Return the remaining features as the minimal set.

But, if we remove feature Fi based on MB M, later, we might remove some other feature Fj ∈M. Does the removal of Fj may make Fi relevant again?

Theorem: Let G be the current set of features, and assume that Fi ∉ G has a MB within G. Let Fj ∈ G be some feature which we are about to remove, Then Fi also has a MB withinG – Fj

[Koller & Sahami, ’96]

Page 19: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

19

Implication of the theorem

Using Markov blankets for feature elimination has desirable properties:

We can eliminate a conditionally independent feature Fi, without increasing our distance, , from the desired distribution.Markov blanket criterion only removes attributes that are reallyunnecessary: attributes that are irrelevant to the target concept, and attributes that are redundant given other attributes.

[Koller & Sahami, ’96]

G∆

Page 20: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

20

Outline

Motivation

Introduction to Bayesian Networks and Markov Blankets

Markov Blanket Discovery algorithms

IAMB algorithm and results

HITON algorithm and results

Page 21: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

21

Markov Blanket Discovery

Some early approaches:KS (Koller-Sahami) algorithm (’96)

For each feature Fi in G, let Mi be the set of K features Fj in G – Fi for which expected cross-entropy is the minimum. Compute error, of (Fi|Mi) for each iChoose the i for which this quantity is the minimal

GS (Grow-Shrink) algorithm (Margaritis and Thrun ’99)Statically orders the variables according to the strength of association with T.Thus, has some limitations of employing potentially inefficient heuristics.

PC (Spirtes et al. ’00)BN learning algorithm, that is, learns the whole network.Starts as fully connected BN graph, and removes redundant edges until a sound BN remains.MB can then be read off from the resulting network.

G∆

Page 22: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

22

Outline

Motivation

Introduction to Bayesian Networks and Markov Blankets

Markov Blanket Discovery algorithms

IAMB algorithm and results

HITON algorithm and results

Page 23: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

23

IAMB algorithm (Tsamardinos, et al.)CMB is the current MB

Heuristic approach for findingMarkov blanket of T.

Conditional Independence test.Pr(X,Y|CMB)=

Pr(X|CMB) Pr(Y|CMB)

Page 24: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

24

IAMB algorithm

IAMB is an abbreviation for:Incremental Association Markov BlanketAs we can see, it is a two phase algorithm

Growing phaseAdds variables which are part of MB(T) – plus more, i.e., false positives

Shrinking phaseRemoves false positives.

Result: Markov blanket for a particular variable T

Page 25: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

25

IAMB algorithm

Heuristic to identify potential Markov Blanket members: Include the variable that maximizes a heuristic functionf (X ; T | CMB). Function f should be non-zero value for every X that is a member of the Markov Blanket of T.Typically it is a measure of association between X and T given CMB.The authors use the Mutual Information formula for f:

f (X;T |CMB) = H(X | CMB) – H(X | T, CMB)The information T tells us about X is the reduction in uncertainty about X due to the knowledge of T, given the CMB. Computationally, it takes linear time in number of variables

Page 26: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

26

IAMB Variants

Authors also present some variations on the IAMB algorithm.InterIAMB

It interleaves the growing phase of IAMB (phase I) with the shrinking phase (phase II) attempting to keep the size of MB(T)as small as possible during all steps of the algorithm’s execution.

IAMBnPCIt substitutes the shrinking phase (phase II) as implemented in IAMB with the PC algorithm instead

InterIAMBnPCCombines the two approaches above to reduce the size of the conditioning sets.

Page 27: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

27

Results

Data-sets:Real world datasets:

ALARM Network (Beinlich, et al. ’89) – BN used in medical domain, having 37 variables.Hailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables.

Randomly generated BNs:BN with 50 nodesBN with 200 nodesBN with 1000 nodes

0-10 Parents chosen randomly for each node

Page 28: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

28

Results

Evaluation metric:

Area under the ROC Curve (explained next). Threshold parameters:

PC: significance levels of G2 statistical test.GS / IAMB Variants: Mutual-Info(X;T|CMB) < thresholdKS: all possible values of the parameter k

Page 29: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

29

Receiver Operating Characteristic (ROC) Curve

ROC plots true positive rate (TPR) against the false positive rate (FPR) for the different possible thresholds of a diagnostic test

Page 30: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

30

ROC Curve (Cont’d)

A ROC curve demonstrates several things:

It shows the tradeoff between TPR and FPR.

The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test.

The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.

The area under the curve (AUC) is a measure of test accuracy. Higher the better.

Page 31: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

31

Results

a

Page 32: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

32

Results

a

Page 33: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

33

Outline

Motivation

Introduction to Bayesian Networks and Markov Blankets

Markov Blanket Discovery algorithms

IAMB algorithm and results

HITON algorithm and results

Page 34: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

34

HITON algorithm (Aliferis, et al.)

Uses MB discovery technique for feature selection.

Algorithm:Identify the Markov Blanket of target T given the data DUse wrapping to remove variables, which are unnecessary for predicting the target T, given algorithm A.Return the minimal set (for predicting target T using algorithm A)

Page 35: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

35

HITON algorithm

Page 36: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

36

Page 37: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

37

HITON algorithm

Aim: Good variable selection with the given classification algorithm, i.e., HITON employs a wrapper approach.

First identify the MB(T), then remove any variables not required for classification given a classifier.

Page 38: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

38

Results (Data-Sets)

A variety of biomedical tasks with different characteristics

Page 39: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

39

Results

Evaluation metric:Area under the ROC curve.

Results:HITON consistently produces the smallest variable sets.It exhibits best classification performance, and maximum variable reduction.

Page 40: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

40

Results

Page 41: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

41

Questions / Discussion

Page 42: Discovering Markov Blankets: Finding …cseweb.ucsd.edu › ~elkan › 254 › Verma.pdfHailfinder (Abramson, et al. ’96) – BN used for modeling weather, with 56 variables. Randomly

42

References

[1] Aliferis, C., Tsamardinos, I., Statnikov A. (2003) HITON, A Novel Markov Blanket Algorithm for Optimal Variable Selection.

[2] Bai X., et al. (2004) PCX: Markov Blanket Classification for Large Data Sets with Few Cases. CMU-CALD.

[3] Koller D., Sahami M. (1996) Toward Optimal Feature Selection. International Conference on Machine Learning, pp. 284-292.

[4] Tsamardinos, I., Aliferis, C., Statnikov A. (2003) Algorithms for Large Scale Markov Blanket Discovery. The 16th International FLAIRS Conference, St. Augustine, Florida, USA.

[5] Yaramakala S. (2004) Fast Markov Blanket Discovery.MS – thesis, Iowa State University.

[6] Yu, L., Liu, H. (2003) Feature Selection for High Dimensional Data: A Fast Correlation Based Filter Solution. ICML – 2003.