A hybrid algorithm for Bayesian network structure learning with … · 2021. 2. 28. · A hybrid algorithm for Bayesian network structure learning with application to multi-label

HAL Id: hal-01153255https://hal.archives-ouvertes.fr/hal-01153255

Submitted on 19 May 2015

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Distributed under a Creative Commons Attribution - NonCommercial - NoDerivatives| 4.0International License

A hybrid algorithm for Bayesian network structurelearning with application to multi-label learning

Maxime Gasse, Alex Aussem, Haytham Elghazel

To cite this version:Maxime Gasse, Alex Aussem, Haytham Elghazel. A hybrid algorithm for Bayesian network structurelearning with application to multi-label learning. Expert Systems with Applications, Elsevier, 2014,41 (15), pp.6755-6772. �10.1016/j.eswa.2014.04.032�. �hal-01153255�

https://hal.archives-ouvertes.fr/hal-01153255

http://creativecommons.org/licenses/by-nc-nd/4.0/



https://hal.archives-ouvertes.fr

A hybrid algorithm for Bayesian network structure learningwith application to multi-label learning

Maxime Gasse, Alex Aussem∗, Haytham Elghazel

Universite de Lyon, CNRS

Universite Lyon 1, LIRIS, UMR5205, F-69622, France

Abstract

We present a novel hybrid algorithm for Bayesian network structure learning, called H2PC. It first reconstructs theskeleton of a Bayesian network and then performs a Bayesian-scoring greedy hill-climbing search to orient the edges.The algorithm is based on divide-and-conquer constraint-based subroutines to learn the local structure around a targetvariable. We conduct two series of experimental comparisons of H2PC against Max-Min Hill-Climbing (MMHC),which is currently the most powerful state-of-the-art algorithm for Bayesian network structure learning. First, we useeight well-known Bayesian network benchmarks with various data sizes to assess the quality of the learned structurereturned by the algorithms. Our extensive experiments show that H2PC outperforms MMHC in terms of goodnessof fit to new data and quality of the network structure with respect to the true dependence structure of the data.Second, we investigate H2PC’s ability to solve the multi-label learning problem. We provide theoretical results tocharacterize and identify graphically the so-called minimal label powersets that appear as irreducible factors in thejoint distribution under the faithfulness condition. The multi-label learning problem is then decomposed into a seriesof multi-class classification problems, where each multi-class variable encodes a label powerset. H2PC is shownto compare favorably to MMHC in terms of global classification accuracy over ten multi-label data sets coveringdifferent application domains. Overall, our experiments support the conclusions that local structural learning withH2PC in the form of local neighborhood induction is a theoretically well-motivated and empirically effective learningframework that is well suited to multi-label learning. The source code (in R) of H2PC as well as all data sets used forthe empirical tests are publicly available.

Keywords: Bayesian networks, Multi-label learning, Markov boundary, Feature subset selection.

1. Introduction

A Bayesian network (BN) is a probabilistic modelformed by a structure and parameters. The structureof a BN is a directed acyclic graph (DAG), whilst itsparameters are conditional probability distributions as-sociated with the variables in the model. The problemof finding the DAG that encodes the conditional inde-pendencies present in the data attracted a great dealof interest over the last years (Rodrigues de Morais &Aussem, 2010a; Scutari, 2010; Scutari & Brogini, 2012;Kojima et al., 2010; Perrier et al., 2008; Villanueva &Maciel, 2012; Pena, 2012; Gasse et al., 2012). The in-ferred DAG is very useful for many applications, includ-ing feature selection (Aliferis et al., 2010; Pena et al.,

∗Corresponding authorEmail address: [email protected] (Alex Aussem)

2007; Rodrigues de Morais & Aussem, 2010b), causalrelationships inference from observational data (Ellis& Wong, 2008; Aliferis et al., 2010; Aussem et al.,2012, 2010; Prestat et al., 2013; Cawley, 2008; Brown& Tsamardinos, 2008) and more recently multi-labellearning (DembczyÅski et al., 2012; Zhang & Zhang,2010; Guo & Gu, 2011).

Ideally the DAG should coincide with the dependencestructure of the global distribution, or it should at leastidentify a distribution as close as possible to the correctone in the probability space. This step, called structurelearning, is similar in approaches and terminology tomodel selection procedures for classical statistical mod-els. Basically, constraint-based (CB) learning methodssystematically check the data for conditional indepen-dence relationships and use them as constraints to con-struct a partially oriented graph representative of a BN

equivalence class, whilst search-and-score (SS) meth-ods make use of a goodness-of-fit score function forevaluating graphical structures with regard to the dataset. Hybrid methods attempt to get the best of bothworlds: they learn a skeleton with a CB approach andconstrain on the DAGs considered during the SS phase.

In this study, we present a novel hybrid algorithm forBayesian network structure learning, called H2PC1. Itfirst reconstructs the skeleton of a Bayesian network andthen performs a Bayesian-scoring greedy hill-climbingsearch to orient the edges. The algorithm is basedon divide-and-conquer constraint-based subroutines tolearn the local structure around a target variable. HPCmay be thought of as a way to compensate for the largenumber of false negatives at the output of the weak PClearner, by performing extra computations. As this mayarise at the expense of the number of false positives,we control the expected proportion of false discover-ies (i.e. false positive nodes) among all the discoveriesmade in PCT . We use a modification of the Incrementalassociation Markov boundary algorithm (IAMB), ini-tially developed by Tsamardinos et al. in (Tsamardi-nos et al., 2003) and later modified by Jose Pena in(Pena, 2008) to control the FDR of edges when learn-ing Bayesian network models. HPC scales to thou-sands of variables and can deal with many fewer sam-ples (n < q). To illustrate its performance by meansof empirical evidence, we conduct two series of exper-imental comparisons of H2PC against Max-Min Hill-Climbing (MMHC), which is currently the most pow-erful state-of-the-art algorithm for BN structure learn-ing (Tsamardinos et al., 2006), using well-known BNbenchmarks with various data sizes, to assess the good-ness of fit to new data as well as the quality of thenetwork structure with respect to the true dependencestructure of the data.

We then address a real application of H2PC wherethe true dependence structure is unknown. More specif-ically, we investigate H2PC’s ability to encode the jointdistribution of the label set conditioned on the inputfeatures in the multi-label classification (MLC) prob-lem. Many challenging applications, such as photo andvideo annotation and web page categorization, can ben-efit from being formulated as MLC tasks with largenumber of categories (DembczyÅski et al., 2012; Readet al., 2009; Madjarov et al., 2012; Kocev et al., 2007;Tsoumakas et al., 2010b). Recent research in MLC fo-cuses on the exploitation of the label conditional de-

1A first version of HP2C without FDR control has been discussedin a paper that appeared in the Proceedings of ECML-PKDD, pages58-73, 2012.

pendency in order to better predict the label combina-tion for each example. We show that local BN struc-ture discovery methods offer an elegant and power-ful approach to solve this problem. We establish twotheorems (Theorem 6 and 7) linking the concepts ofmarginal Markov boundaries, joint Markov boundariesand so-called label powersets under the faithfulness as-sumption. These Theorems offer a simple guideline tocharacterize graphically : i) the minimal label powersetdecomposition, (i.e. into minimal subsets YLP ⊆ Y suchthat YLP ⊥⊥ Y \ YLP | X), and ii) the minimal subset offeatures, w.r.t an Information Theory criterion, neededto predict each label powerset, thereby reducing the in-put space and the computational burden of the multi-label classification. To solve the MLC problem withBNs, the DAG obtained from the data plays a pivotalrole. So in this second series of experiments, we assessthe comparative ability of H2PC and MMHC to encodethe label dependency structure by means of an indirectgoodness of fit indicator, namely the 0/1 loss function,which makes sense in the MLC context.

The rest of the paper is organized as follows: Inthe Section 2, we review the theory of BN and dis-cuss the main BN structure learning strategies. We thenpresent the H2PC algorithm in details in Section 3. Sec-tion 4 evaluates our proposed method and shows resultsfor several tasks involving artificial data sampled fromknown BNs. Then we report, in Section 5, on our exper-iments on real-world data sets in a multi-label learningcontext so as to provide empirical support for the pro-posed methodology. The main theoretical results appearformally as two theorems (Theorem 8 and 9) in Section5. Their proofs are established in the Appendix. Fi-nally, Section 6 raises several issues for future work andwe conclude in Section 7 with a summary of our contri-bution.

2. Preliminaries

We define next some key concepts used along the pa-per and state some results that will support our analysis.In this paper, upper-case letters in italics denote randomvariables (e.g., X, Y) and lower-case letters in italics de-note their values (e.g., x, y). Upper-case bold letters de-note random variable sets (e.g., X,Y,Z) and lower-casebold letters denote their values (e.g., x, y). We denoteby X ⊥⊥ Y | Z the conditional independence between X

and Y given the set of variables Z. To keep the notationuncluttered, we use p(y | x) to denote p(Y = y | X = x).

2

2.1. Bayesian networks

Formally, a BN is a tuple < G, P >, where G =<U,E > is a directed acyclic graph (DAG) with nodesrepresenting the random variables U and P a joint prob-ability distribution in U. In addition, G and P mustsatisfy the Markov condition: every variable, X ∈ U,is independent of any subset of its non-descendant vari-ables conditioned on the set of its parents, denoted byPaG

i. From the Markov condition, it is easy to prove

(Neapolitan, 2004) that the joint probability distributionP on the variables in U can be factored as follows :

P(V) = P(X1, . . . , Xn) =n∏

i=1

P(Xi|PaG

i) (1)

Equation 1 allows a parsimonious decomposition ofthe joint distribution P. It enables us to reduce the prob-lem of determining a huge number of probability valuesto that of determining relatively few.

A BN structureG entails a set of conditional indepen-dence assumptions. They can all be identified by the d-

separation criterion (Pearl, 1988). We use X ⊥G Y |Z todenote the assertion that X is d-separated from Y givenZ inG. Formally, X ⊥G Y |Z is true when for every undi-rected path in G between X and Y, there exists a nodeW in the path such that either (1) W does not have twoparents in the path and W ∈ Z, or (2) W has two parentsin the path and neither W nor its descendants is in Z. If< G, P > is a BN, X ⊥P Y |Z if X ⊥G Y |Z. The conversedoes not necessarily hold. We say that < G, P > satis-fies the faithfulness condition if the d-separations in Gidentify all and only the conditional independencies inP, i.e., X ⊥P Y |Z iff X ⊥G Y |Z.

A Markov blanket MT of T is any set of variablessuch that T is conditionally independent of all the re-maining variables given MT . By extension, a Markovblanket of T in V guarantees that MT ⊆ V, and that T

is conditionally independent of the remaining variablesin V, given MT . A Markov boundary, MBT , of T is anyMarkov blanket such that none of its proper subsets is aMarkov blanket of T .

We denote by PCG

T, the set of parents and children

of T in G, and by SPG

T, the set of spouses of T in G.

The spouses of T are the variables that have commonchildren with T . These sets are unique for all G, suchthat < G, P > satisfies the faithfulness condition and sowe will drop the superscript G. We denote by dSep(X),the set that d-separates X from the (implicit) target T .

Theorem 1. Suppose < G, P > satisfies the faithfulness

condition. Then X and Y are not adjacent in G iff ∃Z ∈

U \ {X, Y} such that X ⊥⊥ Y | Z . Moreover, MBX =

PCX ∪ SPX .

A proof can be found for instance in (Neapolitan,2004).

Two graphs are said equivalent iff they encode thesame set of conditional independencies via the d-separation criterion. The equivalence class of a DAGG is a set of DAGs that are equivalent to G. The nextresult showed by (Pearl, 1988), establishes that equiv-alent graphs have the same undirected graph but mightdisagree on the direction of some of the arcs.

Theorem 2. Two DAGs are equivalent iff they have the

same underlying undirected graph and the same set of

v-structures (i.e. converging edges into the same node,

such as X → Y ← Z).

Moreover, an equivalence class of network structurescan be uniquely represented by a partially directed DAG(PDAG), also called a DAG pattern. The DAG pattern isdefined as the graph that has the same links as the DAGsin the equivalence class and has oriented all and only theedges common to the DAGs in the equivalence class.A structure learning algorithm from data is said to becorrect (or sound) if it returns the correct DAG pattern(or a DAG in the correct equivalence class) under theassumptions that the independence tests are reliable andthat the learning database is a sample from a distributionP faithful to a DAG G, The (ideal) assumption that theindependence tests are reliable means that they decide(in)dependence iff the (in)dependence holds in P.

2.2. Conditional independence properties

The following three theorems, borrowed from Penaet al. (2007), are proven in Pearl (1988):

Theorem 3. Let X,Y,Z and W denote four mutu-

ally disjoint subsets of U. Any probability distribu-

tion p satisfies the following four properties: Symme-

try X ⊥⊥ Y | Z ⇒ Y ⊥⊥ X | Z, Decomposition,

X ⊥⊥ (Y ∪ W) | Z ⇒ X ⊥⊥ Y | Z, Weak Union

X ⊥⊥ (Y∪W) | Z⇒ X ⊥⊥ Y | (Z∪W) and Contraction,

X ⊥⊥ Y | (Z ∪W) ∧ X ⊥⊥W | Z⇒ X ⊥⊥ (Y ∪W) | Z .

Theorem 4. If p is strictly positive, then p satisfies the

previous four properties plus the Intersection property

X ⊥⊥ Y | (Z ∪W) ∧ X ⊥⊥ W | (Z ∪ Y) ⇒ X ⊥⊥ (Y ∪W) | Z. Additionally, each X ∈ U has a unique Markov

boundary, MBX

Theorem 5. If p is faithful to a DAG G, then p satisfies

the previous five properties plus the Composition prop-

erty X ⊥⊥ Y | Z ∧ X ⊥⊥W | Z ⇒ X ⊥⊥ (Y ∪W) | Z and

the local Markov property X ⊥⊥ (NDX \ PaX) | PaX for

each X ∈ U, where NDX denotes the non-descendants

of X in G.

3

2.3. Structure learning strategies

The number of DAGs, G, is super-exponential in thenumber of random variables in the domain and the prob-lem of learning the most probable a posteriori BN fromdata is worst-case NP-hard (Chickering et al., 2004).One needs to resort to heuristical methods in order tobe able to solve very large problems effectively.

Both CB and SS heuristic approaches have advan-tages and disadvantages. CB approaches are relativelyquick, deterministic, and have a well defined stoppingcriterion; however, they rely on an arbitrary significancelevel to test for independence, and they can be unsta-ble in the sense that an error early on in the searchcan have a cascading effect that causes many errors tobe present in the final graph. SS approaches have theadvantage of being able to flexibly incorporate users’background knowledge in the form of prior probabili-ties over the structures and are also capable of dealingwith incomplete records in the database (e.g. EM tech-nique). Although SS methods are favored in practicewhen dealing with small dimensional data sets, they areslow to converge and the computational complexity of-ten prevents us from finding optimal BN structures (Per-rier et al., 2008; Kojima et al., 2010). With currentlyavailable exact algorithms (Koivisto & Sood, 2004; Si-lander & Myllymaki, 2006; Cussens & Bartlett, 2013;Studeny & Haws, 2014) and a decomposable score likeBDeu, the computational complexity remains exponen-tial, and therefore, such algorithms are intractable forBNs with more than around 30 vertices on current work-stations. For larger sets of variables the computationalburden becomes prohibitive and restrictions about thestructure have to be imposed, such as a limit on thesize of the parent sets. With this in mind, the abilityto restrict the search locally around the target variableis a key advantage of CB methods over SS methods.They are able to construct a local graph around the tar-get node without having to construct the whole BN first,hence their scalability (Pena et al., 2007; Rodrigues deMorais & Aussem, 2010b,a; Tsamardinos et al., 2006;Pena, 2008).

With a view to balancing the computation cost withthe desired accuracy of the estimates, several hybridmethods have been proposed recently. Tsamardinos etal. proposed in (Tsamardinos et al., 2006) the Min-Max Hill Climbing (MMHC) algorithm and conductedone of the most extensive empirical comparison per-formed in recent years showing that MMHC was thefastest and the most accurate method in terms of struc-tural error based on the structural hamming distance.More specifically, MMHC outperformed both in termsof time efficiency and quality of reconstruction the PC

(Spirtes et al., 2000), the Sparse Candidate (Friedmanet al., 1999b), the Three Phase Dependency Analysis(Cheng et al., 2002), the Optimal Reinsertion (Moore &Wong, 2003), the Greedy Equivalence Search (Chick-ering, 2002), and the Greedy Hill-Climbing Search ona variety of networks, sample sizes, and parameter val-ues. Although MMHC is rather heuristic by nature (itreturns a local optimum of the score function), it is cur-rently considered as the most powerful state-of-the-artalgorithm for BN structure learning capable of dealingwith thousands of nodes in reasonable time.

In order to enhance its performance on small dimen-sional data sets, Perrier et al. proposed in (Perrier et al.,2008) a hybrid algorithm that can learn an optimal BN(i.e., it converges to the true model in the sample limit)when an undirected graph is given as a structural con-straint. They defined this undirected graph as a super-structure (i.e., every DAG considered in the SS phase iscompelled to be a subgraph of the super-structure). Thisalgorithm can learn optimal BNs containing up to 50vertices when the average degree of the super-structureis around two, that is, a sparse structural constraint isassumed. To extend its feasibility to BN with a fewhundred of vertices and an average degree up to four,Kojima et al. proposed in (Kojima et al., 2010) to dividethe super-structure into several clusters and perform anoptimal search on each of them in order to scale up tolarger networks. Despite interesting improvements interms of score and structural hamming distance on sev-eral benchmark BNs, they report running times about103 times longer than MMHC on average, which is stillprohibitive.

Therefore, there is great deal of interest in hybridmethods capable of improving the structural accuracyof both CB and SS methods on graphs containing upto thousands of vertices. However, they make thestrong assumption that the skeleton (also called super-structure) contains at least the edges of the true networkand as small as possible extra edges. While control-ling the false discovery rate (i.e., false extra edges) inBN learning has attracted some attention recently (Ar-men & Tsamardinos, 2011; Pena, 2008; Tsamardinos& Brown, 2008), to our knowledge, there is no work oncontrolling actively the rate of false-negative errors (i.e.,false missing edges).

2.4. Constraint-based structure learning

Before introducing the H2PC algorithm, we discussthe general idea behind CB methods. The induction oflocal or global BN structures is handled by CB methodsthrough the identification of local neighborhoods (i.e.,PCX), hence their scalability to very high dimensional

4

data sets. CB methods systematically check the data forconditional independence relationships in order to in-fer a target’s neighborhood. Typically, the algorithmsrun either a G2 or a χ2 independence test when the dataset is discrete and a Fisher’s Z test when it is continu-ous in order to decide on dependence or independence,that is, upon the rejection or acceptance of the null hy-pothesis of conditional independence. Since we arelimiting ourselves to discrete data, both the global andthe local distributions are assumed to be multinomial,and the latter are represented as conditional probabil-ity tables. Conditional independence tests and networkscores for discrete data are functions of these condi-tional probability tables through the observed frequen-cies {ni jk; i = 1, . . . ,R; j = 1, . . . ,C; k = 1, . . . , L} forthe random variables X and Y and all the configurationsof the levels of the conditioning variables Z. We useni+k as shorthand for the marginal

∑

j ni jk and similarlyfor ni+k, n++k and n+++ = n. We use a classic conditionalindependence test based on the mutual information. Themutual information is an information-theoretic distancemeasure defined as

MI(X, Y |Z) =R∑

i=1

C∑

j=1

L∑

k=1

ni jk

nlog

ni jkn++k

ni+kn+ jk

It is proportional to the log-likelihood ratio test G2

(they differ by a 2n factor, where n is the sample size).The asymptotic null distribution is χ2 with (R − 1)(C −1)L degrees of freedom. For a detailed analysis of theirproperties we refer the reader to (Agresti, 2002). Themain limitation of this test is the rate of convergenceto its limiting distribution, which is particularly prob-lematic when dealing with small samples and sparsecontingency tables. The decision of accepting or re-jecting the null hypothesis depends implicitly upon thedegree of freedom which increases exponentially withthe number of variables in the conditional set. Sev-eral heuristic solutions have emerged in the literature(Spirtes et al., 2000; Rodrigues de Morais & Aussem,2010a; Tsamardinos et al., 2006; Tsamardinos & Bor-boudakis, 2010) to overcome some shortcomings of theasymptotic tests. In this study we use the two follow-ing heuristics that are used in MMHC. First, we do notperform MI(X, Y |Z) and assume independence if thereare not enough samples to achieve large enough power.We require that the average sample per count is abovea user defined parameter, equal to 5, as in (Tsamardi-nos et al., 2006). This heuristic is called the power rule.Second, we consider as structural zero either case n+ jk

or ni+k = 0. For example, if n+ jk = 0, we consider y

as a structurally forbidden value for Y when Z = z andwe reduce R by 1 (as if we had one column less in thecontingency table where Z = z). This is known as thedegrees of freedom adjustment heuristic.

3. The H2PC algorithm

In this section, we present our hybrid algorithmfor Bayesian network structure learning, called Hy-brid HPC (H2PC). It first reconstructs the skeleton ofa Bayesian network and then performs a Bayesian-scoring greedy hill-climbing search to filter and orientthe edges. It is based on a CB subroutine called HPC tolearn the parents and children of a single variable. So,we shall discuss HPC first and then move to H2PC.

3.1. Parents and Children Discovery

HPC (Algorithm 1) can be viewed as an ensemblemethod for combining many weak PC learners in an at-tempt to produce a stronger PC learner. HPC is basedon three subroutines: Data-Efficient Parents and Chil-

dren Superset (DE-PCS), Data-Efficient Spouses Super-

set (DE-SPS), and Incremental Association Parents and

Children with false discovery rate control (FDR-IAPC),a weak PC learner based on FDR-IAMB (Pena, 2008)that requires little computation. HPC receives a targetnode T, a data set D and a set of variables U as inputand returns an estimation of PCT . It is hybrid in thatit combines the benefits of incremental and divide-and-conquer methods. The procedure starts by extractinga superset PCST of PCT (line 1) and a superset SPST

of SPT (line 2) with a severe restriction on the max-imum conditioning size (Z <= 2) in order to signifi-cantly increase the reliability of the tests. A first can-didate PC set is then obtained by running the weak PClearner called FDR-IAPC on PCST ∪SPST (line 3). Thekey idea is the decentralized search at lines 4-8 that in-cludes, in the candidate PC set, all variables in the su-perset PCST \ PCT that have T in their vicinity. So,HPC may be thought of as a way to compensate for thelarge number of false negatives at the output of the weakPC learner, by performing extra computations. Notethat, in theory, X is in the output of FDR-IAPC(Y) ifand only if Y is in the output of FDR-IAPC(X). How-ever, in practice, this may not always be true, particu-larly when working in high-dimensional domains (Pena,2008). By loosening the criteria by which two nodes aresaid adjacent, the effective restrictions on the size of theneighborhood are now far less severe. The decentral-ized search has a significant impact on the accuracy ofHPC as we shall see in in the experiments. We proved in

5

(Rodrigues de Morais & Aussem, 2010a) that the orig-inal HPC(T) is consistent, i.e. its output converges inprobability to PCT , if the hypothesis tests are consis-tent. The proof also applies to the modified version pre-sented here.

We now discuss the subroutines in more detail. FDR-IAPC (Algorithm 2) is a fast incremental method thatreceives a data set D and a target node T as its inputand promptly returns a rough estimation of PCT , hencethe term “weak” PC learner. In this study, we use FDR-IAPC because it aims at controlling the expected pro-portion of false discoveries (i.e., false positive nodes inPCT ) among all the discoveries made. FDR-IAPC is astraightforward extension of the algorithm IAMBFDRdeveloped by Jose Pena in (Pena, 2008), which is itselfa modification of the incremental association Markovboundary algorithm (IAMB) (Tsamardinos et al., 2003),to control the expected proportion of false discover-ies (i.e., false positive nodes) in the estimated Markovboundary. FDR-IAPC simply removes, at lines 3-6, thespouses SPT from the estimated Markov boundary MBT

output by IAMBFDR at line 1, and returns PCT assum-ing the faithfulness condition.

The subroutines DE-PCS (Algorithm 3) and DE-SPS(Algorithm 4) search a superset of PCT and SPT respec-tively with a severe restriction on the maximum condi-tioning size (|Z| <= 1 in DE-PCS and |Z| <= 2 in DE-SPS) in order to significantly increase the reliability ofthe tests. The variable filtering has two advantages :i) it allows HPC to scale to hundreds of thousands ofvariables by restricting the search to a subset of relevantvariables, and ii) it eliminates many true or approximatedeterministic relationships that produce many false neg-ative errors in the output of the algorithm, as explainedin (Rodrigues de Morais & Aussem, 2010b,a). DE-SPSworks in two steps. First, a growing phase (lines 4-8)adds the variables that are d-separated from the targetbut still remain associated with the target when condi-tioned on another variable from PCST . The shrinkingphase (lines 9-16) discards irrelevant variables that areancestors or descendants of a target’s spouse. Pruningsuch irrelevant variables speeds up HPC.

3.2. Hybrid HPC (H2PC)

In this section, we discuss the SS phase. The follow-ing discussion draws strongly on (Tsamardinos et al.,2006) as the SS phase in Hybrid HPC and MMHCare exactly the same. The idea of constraining thesearch to improve time-efficiency first appeared in theSparse Candidate algorithm (Friedman et al., 1999b).It results in efficiency improvements over the (uncon-strained) greedy search. All recent hybrid algorithms

Algorithm 1 HPC

Require: T : target;D: data set; U: set of variablesEnsure: PCT : Parents and Children of T

1: [PCST ,dSep]← DE-PCS(T,D,U)2: SPST ← DE-SPS(T,D,U,PCST , dSep)3: PCT ← FDR-IAPC(T,D, (T ∪ PCST ∪ SPST ))4: for all X ∈ PCST \ PCT do

5: if T ∈ FDR-IAPC(X,D, (T ∪ PCST ∪ SPST )) then

6: PCT ← PCT ∪ X

7: end if

8: end for

Algorithm 2 FDR-IAPC

Require: T : target; D: data set; U: set of variablesEnsure: PCT : Parents and children of T ;

* Learn the Markov boundary of T

1: MBT ← IAMBFDR(X,D,U)* Remove spouses of T from MBT

2: PCT ←MBT

3: for all X ∈MBT do

4: if ∃Z ⊆ (MBT \ X) such that T ⊥ X | Z then

5: PCT ← PCT \ X

6: end if

7: end for

Algorithm 3 DE-PCS

Require: T : target;D: data set; U: set of variables;Ensure: PCST : parents and children superset of T ; dSep: d-

separating sets;

Phase I: Remove X if T ⊥ X

1: PCST ← U \ T

2: for all X ∈ PCST do

3: if (T ⊥ X) then

4: PCST ← PCST \ X

5: dSep(X)← ∅6: end if

7: end for

Phase II: Remove X if T ⊥ X|Y


9: for all Y ∈ PCST \ X do

10: if (T ⊥ X | Y) then

11: PCST ← PCST \ X

12: dSep(X)← Y

13: break loop FOR

14: end if

15: end for

16: end for

6

Algorithm 4 DE-SPS

Require: T : target; D: data set; U: the set of variables;PCST : parents and children superset of T ; dSep: d-separating sets;

Ensure: SPST : Superset of the spouses of T ;

1: SPST ← ∅


3: SPSXT ← ∅

4: for all Y ∈ U \ {T ∪ PCST } do

5: if (T 6⊥ Y |dSep(Y) ∪ X) then

6: SPSXT ← SPSX

T ∪ Y

7: end if

8: end for

9: for all Y ∈ SPSXT do

10: for all Z ∈ SPSXT \ Y do

11: if (T ⊥ Y |X ∪ Z) then

12: SPSXT ← SPSX

T \ Y

13: break loop FOR

14: end if

15: end for

16: end for

17: SPST ← SPST ∪ SPSXT

18: end for

Algorithm 5 Hybrid HPC

Require: D: data set; U: set of variablesEnsure: A DAG G on the variables U

1: for all pair of nodes X,Y ∈ U do

2: Add X in PCY and Add Y in PCX if X ∈ HPC(Y) andY ∈ HPC(X)

3: end for

4: Starting from an empty graph, perform greedy hill-climbing with operators add-edge, delete-edge, reverse-

edge. Only try operator add-edge X → Y if Y ∈ PCX

build on this idea, but employ a sound algorithm foridentifying the candidate parent sets. The Hybrid HPCfirst identifies the parents and children set of each vari-able, then performs a greedy hill-climbing search in thespace of BN. The search begins with an empty graph.The edge addition, deletion, or direction reversal thatleads to the largest increase in score (the BDeu scorewith uniform prior was used) is taken and the searchcontinues in a similar fashion recursively. The impor-tant difference from standard greedy search is that thesearch is constrained to only consider adding an edge ifit was discovered by HPC in the first phase. We extendthe greedy search with a TABU list (Friedman et al.,1999b). The list keeps the last 100 structures explored.Instead of applying the best local change, the best lo-cal change that results in a structure not on the list isperformed in an attempt to escape local maxima. When15 changes occur without an increase in the maximumscore ever encountered during search, the algorithm ter-minates. The overall best scoring structure is then re-turned. Clearly, the more false positives the heuristic al-lows to enter candidate PC set, the more computationalburden is imposed in the SS phase.

4. Experimental validation on synthetic data

Before we proceed to the experiments on real-worldmulti-label data with H2PC, we first conduct an exper-imental comparison of H2PC against MMHC on syn-thetic data sets sampled from eight well-known bench-marks with various data sizes in order to gauge the prac-tical relevance of the H2PC. These BNs that have beenpreviously used as benchmarks for BN learning algo-rithms (see Table 1 for details). Results are reported interms of various performance indicators to investigatehow well the network generalizes to new data and howwell the learned dependence structure matches the truestructure of the benchmark network. We implementedH2PC in R (R Core Team, 2013) and integrated the codeinto the bnlearn package from (Scutari, 2010). MMHCwas implemented by Marco Scutari in bnlearn. Thesource code of H2PC as well as all data sets used forthe empirical tests are publicly available 2. The thresh-old considered for the type I error of the test is 0.05.Our experiments were carried out on PC with Intel(R)Core(TM) i5-3470M CPU @3.20 GHz 4Go RAM run-ning under Linux 64 bits.

We do not claim that those data sets resemble real-world problems, however, they make it possible to com-

2https://github.com/madbix/bnlearn-clone-3.4

7

Table 1: Description of the BN benchmarks used in the experiments.

network # var. # edgesmax. degree domain min/med/max

in/out range PC set size

child 20 25 2/7 2-6 1/2/8insurance 27 52 3/7 2-5 1/3/9mildew 35 46 3/3 3-100 1/2/5alarm 37 46 4/5 2-4 1/2/6

hailfinder 56 66 4/16 2-11 1/1.5/17munin1 186 273 3/15 2-21 1/3/15

pigs 441 592 2/39 3-3 1/2/41link 724 1125 3/14 2-4 0/2/17

pare the outputs of the algorithms with the known struc-ture. All BN benchmarks (structure and probability ta-bles) were downloaded from the bnlearn repository3.Ten sample sizes have been considered: 50, 100, 200,500, 1000, 2000, 5000, 10000, 20000 and 50000. Allexperiments are repeated 10 times for each sample sizeand each BN. We investigate the behavior of both algo-rithms using the same parametric tests as a reference.

4.1. Performance indicators

We first investigate the quality of the skeleton re-turned by H2PC during the CB phase. To this end, wemeasure the false positive edge ratio, the precision (i.e.,the number of true positive edges in the output dividedby the number of edges in the output), the recall (i.e., thenumber of true positive edges divided the true numberof edges) and a combination of precision and recall de-fined as

√

(1 − precision)2 + (1 − recall)2, to measurethe Euclidean distance from perfect precision and re-call, as proposed in (Pena et al., 2007). Second, to as-sess the quality of the final DAG output at the end ofthe SS phase, we report the five performance indicators(Scutari & Brogini, 2012) described below:

• the posterior density of the network for the datait was learned from, as a measure of goodness offit. It is known as the Bayesian Dirichlet equivalentscore (BDeu) from (Heckerman et al., 1995; Bun-tine, 1991) and has a single parameter, the equiv-alent sample size, which can be thought of as thesize of an imaginary sample supporting the priordistribution. The equivalent sample size was set to10 as suggested in (Koller & Friedman, 2009);

• the BIC score (Schwarz, 1978) of the network forthe data it was learned from, again as a measure ofgoodness of fit;

3http://www.bnlearn.com/bnrepository

• the posterior density of the network for a new dataset, as a measure of how well the network general-izes to new data;

• the BIC score of the network for a new data set,again as a measure of how well the network gener-alizes to new data;

• the Structural Hamming Distance (SHD) betweenthe learned and the true structure of the network,as a measure of the quality of the learned depen-dence structure. The SHD between two PDAGs isdefined as the number of the following operatorsrequired to make the PDAGs match: add or deletean undirected edge, and add, remove, or reverse theorientation of an edge.

For each data set sampled from the true probabilitydistribution of the benchmark, we first learn a networkstructure with the H2PC and MMHC and then we com-pute the relevant performance indicators for each pairof network structures. The data set used to assess howwell the network generalizes to new data is generatedagain from the true probability structure of the bench-mark networks and contains 50000 observations.

Notice that using the BDeu score as a metric of recon-struction quality has the following two problems. First,the score corresponds to the a posteriori probability of anetwork only under certain conditions (e.g., a Dirichletdistribution of the hyper parameters); it is unknown towhat degree these assumptions hold in distributions en-countered in practice. Second, the score is highly sensi-tive to the equivalent sample size (set to 10 in our exper-iments) and depends on the network priors used. Since,typically, the same arbitrary value of this parameter isused both during learning and for scoring the learnednetwork, the metric favors algorithms that use the BDeuscore for learning. In fact, the BDeu score does not relyon the structure of the original, gold standard network atall; instead it employs several assumptions to score thenetworks. For those reasons, in addition to the score wealso report the BIC score and the SHD metric.

4.2. Results

In Figure 1, we report the quality of the skeleton ob-tained with HPC over that obtained with MMPC (be-fore the SS phase) as a function of the sample size.Results for each benchmark are not shown here in de-tail due to space restrictions. For sake of conciseness,the performance values are averaged over the 8 bench-marks depicted in Table 1. The increase factor for agiven performance indicator is expressed as the ratio of

8

0

1000

0

2000

0

3000

0

4000

0

5000

00.0

0.2

0.4

0.6

0.8

sample size

mea

n ra

te

False negative rate(lower is better)

HPCMMPC

0

1000

0

2000

0

3000

0

4000

0

5000

0

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

sample sizem

ean

rate

False positive rate(lower is better)

HPCMMPC

50 100

200

500

1000

2000

5000

1000

020

000

5000

0

0

2

4

6

8

sample size

incr

ease

fact

or

Recall(higher is better)

50 100

200

500

1000

2000

5000

1000

020

000

5000

00.00.51.01.52.02.53.03.5

sample size

incr

ease

fact

or

Precision(higher is better)

50 100

200

500

1000

2000

5000

1000

020

000

5000

0

0.0

0.2

0.4

0.6

0.8

1.0

1.2

sample size

incr

ease

fact

or

Euclidian distance(lower is better)

50 100

200

500

1000

2000

5000

1000

020

000

5000

0

0

5

10

15

20

25

sample size

incr

ease

fact

or

Number of statistical tests

Figure 1: Quality of the skeleton obtained with HPC over that obtained with MMPC (after the CB phase). The 2 first figures present mean valuesaggregated over the 8 benchmarks. The 4 last figures present increase factors of HPC /MMPC, with the median, quartile, and most extreme values(green boxplots), along with the mean value (black line).

9

the performance value obtained with HPC over that ob-tained with MMPC (the gold standard). Note that forsome indicators, an increase is actually not an improve-ment but is worse (e.g., false positive rate, Euclideandistance). For clarity, we mention explicitly on the sub-plots whether an increase factor > 1 should be inter-preted as an improvement or not. Regarding the qual-ity of the superstructure, the advantages of HPC againstMMPC are noticeable. As observed, HPC consistentlyincreases the recall and reduces the rate of false neg-ative edges. As expected this benefit comes at a lit-tle expense in terms of false positive edges. HPC alsoimproves the Euclidean distance from perfect precisionand recall on all benchmarks, while increasing the num-ber of independence tests and thus the running time inthe CB phase (see number of statistical tests). It is worthnoting that HPC is capable of reducing by 50% the Eu-clidean distance with 50000 samples (lower left plot).These results are very much in line with other exper-iments presented in (Rodrigues de Morais & Aussem,2010a; Villanueva & Maciel, 2012).

In Figure 2, we report the quality of the final DAG ob-tained with H2PC over that obtained with MMHC (af-ter the SS phase) as a function of the sample size. Re-garding BDeu and BIC on both training and test data,the improvements are noteworthy. The results in termsof goodness of fit to training data and new data usingH2PC clearly dominate those obtained using MMHC,whatever the sample size considered, hence its ability togeneralize better. Regarding the quality of the networkstructure itself (i.e., how close is the DAG to the truedependence structure of the data), this is pretty mucha dead heat between the 2 algorithms on small sam-ple sizes (i.e., 50 and 100), however we found H2PCto perform significantly better on larger sample sizes.The SHD increase factor decays rapidly (lower is bet-ter) as the sample size increases (lower left plot). For50 000 samples, the SHD is on average only 50% that ofMMHC. Regarding the computational burden involved,we may observe from Table 2 that H2PC has a littlecomputational overhead compared to MMHC. The run-ning time increase factor grows somewhat linearly withthe sample size. With 50000 samples, H2PC is approxi-mately 10 times slower on average than MMHC. This ismainly due to the computational expense incurred in ob-taining larger PC sets with HPC, compared to MMPC.

5. Application to multi-label learning

In this section, we address the problem of multi-label learning with H2PC. MLC is a challenging prob-lem in many real-world application domains, where

each instance can be assigned simultaneously to mul-tiple binary labels (DembczyÅski et al., 2012; Readet al., 2009; Madjarov et al., 2012; Kocev et al., 2007;Tsoumakas et al., 2010b). Formally, learning frommulti-label examples amounts to finding a mappingfrom a space of features to a space of labels. We shallassume throughout that X (a random vector in Rd) is thefeature set, Y (a random vector in {0, 1}n) is the labelset, U = X ∪ Y and p a probability distribution definedover U. Given a multi-label training set D, the goal ofmulti-label learning is to find a function which is ableto map any unseen example to its proper set of labels.From the Bayesian point of view, this problem amountsto modeling the conditional joint distribution p(Y | X).

5.1. Related work

This MLC problem may be tackled in various ways(Luaces et al., 2012; Alvares-Cherman et al., 2011;Read et al., 2009; Blockeel et al., 1998; Kocev et al.,2007). Each of these approaches is supposed to capture- to some extent - the relationships between labels. Thetwo most straightforward meta-learning methods (Mad-jarov et al., 2012) are: Binary Relevance (BR) (Luaceset al., 2012) and Label Powerset (LP) (Tsoumakas &Vlahavas, 2007; Tsoumakas et al., 2010b). Both meth-ods can be regarded as opposite in the sense that BRdoes consider each label independently, while LP con-siders the whole label set at once (one multi-class prob-lem). An important question remains: what shall wecapture from the statistical relationships between labelsexactly to solve the multi-label classification problem?The problem attracted a great deal of interest (Dem-bczyÅski et al., 2012; Zhang & Zhang, 2010). It is wellbeyond the scope and purpose of this paper to delvedeeper into these approaches, we point the reader to(DembczyÅski et al., 2012; Tsoumakas et al., 2010b)for a review. The second fundamental problem that wewish to address involves finding an optimal feature sub-set selection of a label set, w.r.t an Information The-ory criterion Koller & Sahami (1996). As in the single-label case, multi-label feature selection has been studiedrecently and has encountered some success (Gu et al.,2011; Spolaor et al., 2013).

5.2. Label powerset decomposition

We shall first introduce the concept of label powersetthat will play a pivotal role in the factorization of theconditional distribution p(Y | X).

Definition 1. YLP ⊆ Y is called a label powerset iff

YLP ⊥⊥ Y \ YLP | X. Additionally, YLP is said minimal

10

50 100

200

500

1000

2000

5000

1000

020

000

5000

00.5

0.6

0.7

0.8

0.9

1.0

sample size

incr

ease

fact

or

log(BDeu post.) on train data(lower is better)

50 100

200

500

1000

2000

5000

1000

020

000

5000

0

0.5

0.6

0.7

0.8

0.9

1.0

sample size

incr

ease

fact

or

BIC on train data(lower is better)

50 100

200

500

1000

2000

5000

1000

020

000

5000

0

0.5

0.6

0.7

0.8

0.9

1.0

sample size

incr

ease

fact

or

log(BDeu post.) on test data(lower is better)

50 100

200

500

1000

2000

5000

1000

020

000

5000

00.5

0.6

0.7

0.8

0.9

1.0

sample size

incr

ease

fact

or

BIC on test data(lower is better)

50 100

200

500

1000

2000

5000

1000

020

000

5000

0

0.0

0.2

0.4

0.6

0.8

1.0

1.2

sample size

incr

ease

fact

or

Structural Hamming Distance(lower is better)

50 100

200

500

1000

2000

5000

1000

020

000

5000

0

2

4

6

8

sample size

incr

ease

fact

or

Number of scores

Figure 2: Quality of the final DAG obtained with H2PC over that obtained with MMHC (after the SS phase). The 6 figures present increase factorsof HPC /MMPC, with the median, quartile, and most extreme values (green boxplots), along with the mean value (black line).

11

Table 2: Total running time ratio R (H2PC/MMHC). White cells indicate a ratio R < 1 (in favor of H2PC), while shaded cells indicate a ratio R > 1(in favor of MMHC). The darker, the larger the ratio.

NetworkSample Size

50 100 200 500 1000 2000 5000 10000 20000 50000

child0.94 0.87 1.14 1.99 2.26 2.12 2.36 2.58 1.75 1.78±0.1 ±0.1 ±0.1 ±0.2 ±0.1 ±0.2 ±0.4 ±0.3 ±0.6 ±0.5

insurance0.96 1.09 1.56 2.93 3.06 3.48 3.69 4.10 3.76 3.75±0.1 ±0.1 ±0.1 ±0.2 ±0.3 ±0.4 ±0.3 ±0.4 ±0.6 ±0.5

mildew0.77 0.80 0.79 0.94 1.01 1.23 1.74 2.14 3.26 6.20±0.1 ±0.1 ±0.1 ±0.1 ±0.1 ±0.1 ±0.2 ±0.2 ±0.6 ±1.0

alarm0.88 1.11 1.75 2.43 2.55 2.71 2.65 2.80 2.49 2.18±0.1 ±0.1 ±0.1 ±0.1 ±0.1 ±0.1 ±0.2 ±0.2 ±0.3 ±0.6

hailfinder0.85 0.85 1.40 1.69 1.83 2.06 2.13 2.12 1.95 1.96±0.1 ±0.1 ±0.1 ±0.1 ±0.1 ±0.1 ±0.1 ±0.2 ±0.2 ±0.6

munin10.77 0.85 0.93 1.35 2.11 4.30 12.92 23.32 24.95 24.76±0.0 ±0.0 ±0.0 ±0.0 ±0.0 ±0.2 ±0.7 ±2.6 ±5.1 ±6.7

pigs0.80 0.80 4.55 4.71 5.00 5.62 7.63 11.10 14.02 11.74±0.0 ±0.0 ±0.1 ±0.1 ±0.2 ±0.2 ±0.3 ±0.6 ±1.7 ±3.2

link1.16 1.93 2.76 5.55 7.04 8.19 10.00 13.87 15.32 24.74±0.0 ±0.0 ±0.0 ±0.1 ±0.2 ±0.2 ±0.3 ±0.4 ±2.5 ±4.2

all0.89 1.04 1.86 2.70 3.11 3.71 5.39 7.75 8.44 9.64±0.1 ±0.1 ±1.2 ±1.5 ±1.9 ±2.2 ±4.0 ±7.3 ±8.4 ±9.7

if it is non empty and has no label powerset as propersubset.

Let LP denote the set of all powersets defined overU, and minLP the set of all minimal label powersets.It is easily shown {Y, ∅} ⊆ LP. The key idea behindlabel powersets is the decomposition of the conditionaldistribution of the labels into a product of factors

p(Y | X) =L∏

j=1

p(YLP j| X) =

L∏

j=1

p(YLP j|MLP j

)

where {YLP1 , . . . ,YLPL} is a partition of label power-

sets and MLP jis a Markov blanket of YLP j

in X. Fromthe above definition, we have YLPi

⊥⊥ YLP j| X, ∀i , j.

In the framework of MLC, one can consider a multi-tude of loss functions. In this study, we focus on a nonlabel-wise decomposable loss function called the sub-set 0/1 loss which generalizes the well-known 0/1 lossfrom the conventional to the multi-label setting. Therisk-minimizing prediction for subset 0/1 loss is sim-ply given by the mode of the distribution (DembczyÅskiet al., 2012). Therefore, we seek a factorization into aproduct of minimal factors in order to facilitate the esti-mation of the mode of the conditional distribution (alsocalled the most probable explanation (MPE)):

maxy

p(y | x) =L∏

j=1

maxyLP j

p(yLP j| x) =

L∏

j=1

maxyLP j

p(yLP j| mLP j

)

The next section aims to obtain theoretical results forthe characterization of the minimal label powersets YLP j

and their Markov boundaries MLP jfrom a DAG, in or-

der to be able to estimate the MPE more effectively.

5.3. Label powerset characterization

In this section, we show that minimal label powersetscan be depicted graphically when p is faithful to a DAG,

Theorem 6. Suppose p is faithful to a DAG G. Then,

Yi and Y j belong to the same minimal label powerset if

and only if there exists an undirected path in G between

nodes Yi and Y j in Y such that all intermediate nodes Z

are either (i) Z ∈ Y, or (ii) Z ∈ X and Z has two parents

in Y (i.e. a collider of the form Yp → X ← Yq).

The proof is given in the Appendix. We shall nowaddress the following questions: What is the Markovboundary in X of a minimal label powerset? Answeringthis question for general distributions is not trivial. Weestablish a useful relation between a label powerset andits Markov boundary in X when p is faithful to a DAG,

12

Theorem 7. Suppose p is faithful to a DAG G. Let

YLP = {Y1, Y2, . . . , Yn} be a label powerset. Then, its

Markov boundary M in U is also its Markov boundary

in X, and is given in G by M =⋃n

j=1{PCY j∪ SPY j

} \ Y.

The proof is given in the Appendix.

5.4. Experimental setup

The MLC problem is decomposed into a series ofmulti-class classification problems, where each multi-class variable encodes one label powerset, with as manyclasses as the number of possible combinations of la-bels, or those present in the training data. At this point,it should be noted that the LP method is a special case ofthis framework since the whole label set is a particularlabel powerset (not necessarily minimal though). Theabove procedure can been summarized as follows:

1. Learn the BN local graph G around the label set;2. Read off G the minimal label powersets and their

respective Markov boundaries;3. Train an independent multi-class classifier on each

minimal LP, with the input space restricted to itsMarkov boundary in G;

4. Aggregate the prediction of each classifierto output the most probable explanation, i.e.arg maxy p(y | x).

To assess separately the quality of the minimal labelpowerset decomposition and the feature subset selectionwith Markov boundaries, we investigate 4 scenarios:

• BR without feature selection (denoted BR): a clas-sifier is trained on each single label with all fea-tures as input. This is the simplest approach as itdoes not exploit any label dependency. It serves asa baseline learner for comparison purposes.

• BR with feature selection (denoted BR+MB): aclassifier is trained on each single label with theinput space restricted to its Markov boundary inG. Compared to the previous strategy, we evaluatehere the effectiveness of the feature selection task.

• Minimum label powerset method without featureselection (denoted MLP): the minimal label pow-ersets are obtained from the DAG. All features areused as inputs.

• Minimum label powerset method with feature se-lection (denoted MLP+MB): the minimal labelpowersets are obtained from the DAG. the inputspace is restricted to the Markov boundary of thelabels in that powerset.

We use the same base learner in each meta-learningmethod: the well-known Random Forest classifier(Breiman, 2001). RF achieves good performance instandard classification as well as in multi-label prob-lems (Kocev et al., 2007; Madjarov et al., 2012), andis able to handle both continuous and discrete data eas-ily, which is much appreciated. The standard RF imple-mentation in R (Liaw & Wiener, 2002)4 was used. Forpractical purposes, we restricted the forest size of RF to100 trees, and left the other parameters to their defaultvalues.

5.4.1. Data and performance indicators

A total of 10 multi-label data sets are collected forexperiments in this section, whose characteristics aresummarized in Table 3. These data sets come fromdifferent problem domains including text, biology, andmusic. They can be found on the Mulan5 repository,except for image which comes from Zhou6 (Maron &Ratan, 1998). Let D be the multi-label data set, we use|D|, dim(D), L(D), F(D) to represent the number of ex-amples, number of features, number of possible labels,and feature type respectively. DL(D) = |{Y |∃x : (x, Y) ∈D}| counts the number of distinct label combinations ap-pearing in the data set. Continues values are binarizedduring the BN structure learning phase.

Table 3: Data sets characteristics

data set domain |D| dim(D) F(D) L(D) DL(D)

emotions music 593 72 cont. 6 27yeast biology 2417 103 cont. 14 198image images 2000 135 cont. 5 20scene images 2407 294 cont. 6 15slashdot text 3782 1079 disc. 22 156genbase biology 662 1186 disc. 27 32medical text 978 1449 disc. 45 94enron text 1702 1001 disc. 53 753bibtex text 7395 1836 disc. 159 2856corel5k images 5000 499 disc. 374 3175

The performance of a multi-label classifier can beassessed by several evaluation measures (Tsoumakaset al., 2010a). We focus on maximizing a non-decomposable score function: the global accuracy (alsotermed subset accuracy, complement of the 0/1-loss),which measures the correct classification rate of thewhole label set (exact match of all labels required).

4http://cran.r-project.org/web/packages/

randomForest5http://mulan.sourceforge.net/datasets.html6http://lamda.nju.edu.cn/data_MIMLimage.ashx

13

Note that the global accuracy implicitly takes into ac-count the label correlations. It is therefore a very strictevaluation measure as it requires an exact match of thepredicted and the true set of labels. It was recentlyproved in (DembczyÅski et al., 2012) that BR is opti-mal for decomposable loss functions (e.g., the hammingloss), while non-decomposable loss functions (e.g. sub-set loss) inherently require the knowledge of the labelconditional distribution. 10-fold cross-validation wasperformed for the evaluation of the MLC methods.

5.4.2. Results

Table 4 reports the outputs of H2PC and MMHC, Ta-ble 5 shows the global accuracy of each method on the10 data sets. Table 6 reports the running time and theaverage node degree of the labels in the DAGs obtainedwith both methods. Figures 3 up to 7 display graph-ically the local DAG structures around the labels, ob-tained with H2PC and MMHC, for illustration purposes.

Several conclusions may be drawn from these exper-iments. First, we may observe by inspection of the av-erage degree of the label nodes in Table 6 that severalDAGs are densely connected, like scene or bibtex, whileothers are rather sparse, like genbase, medical, corel5k.The DAGs displayed in Figures 3 up to 7 lend them-selves to interpretation. They can be used for encod-ing as well as portraying the conditional independen-cies, and the d-sep criterion can be used to read themoff the graph. Many label powersets that are reportedin Table 4 can be identified by graphical inspection ofthe DAGs. For instance, the two label powersets inyeast (Figure 3, bottom plot) are clearly noticeable inboth DAGs. Clearly, BNs have a number of advantagesover alternative methods. They lay bare useful infor-mation about the label dependencies which is crucial ifone is interested in gaining an understanding of the un-derlying domain. It is however well beyond the scopeof this paper to delve deeper into the DAG interpreta-tion. Overall, it appears that the structures recoveredby H2PC are significantly denser, compared to MMHC.On bibtex and enron, the increase of the average labeldegree is the most spectacular. On bibtex (resp. enron),the average label degree has raised from 2.6 to 6.4 (resp.from 1.2 to 3.3). This result is in nice agreement withthe experiments in the previous section, as H2PC wasshown to consistently reduce the rate of false negativeedges with respect to MMHC (at the cost of a slightlyhigher false discovery rate).

Second, Table 4 is very instructive as it reveals thaton emotions, image, and scene, the MLP approach boilsdown to the LP scheme. This is easily seen as there isonly one minimal label powerset extracted on average.

In contrast, on genbase the MLP approach boils downto the simple BR scheme. This is confirmed by inspect-ing Table 5: the performances of MMHC and H2PC(without feature selection) are equal. An interesting ob-servation upon looking at the distribution of the labelpowerset size shows that for the remaining data sets, theMLP mostly decomposes the label set in two parts: oneon which it performs BR and the other one on whichit performs LP. Take for instance enron, it can be seenfrom Table 4 that there are approximately 23 label sin-gletons and a powerset of 30 labels with H2PC for a to-tal of 53 labels. The gain in performance with MLP overBR, our baseline learner, can be ascribed to the qualityof the label powerset decomposition as BR is ignoringlabel dependencies. As expected, the results using MLPclearly dominate those obtained using BR on all datasets except genbase. The DAGs display the minimumlabel powersets and their relevant features.

Third, H2PC compares favorably to MMHC. Onscene for instance, the accuracy of MLP+MB has raisedfrom 20% (with MMHC) to 56% (with H2PC). Onyeast, it has raised from 7% to 23%. Without featureselection, the difference in global accuracy is less pro-nounced but still in favor of H2PC. A Wilcoxon signedrank paired test reveals statistically significant improve-ments of H2PC over MMHC in the MLP approach with-out feature selection (p < 0.02). This trend is more pro-nounced when the feature selection is used (p < 0.001),using MLP-MB. Note however that the LP decompo-sition for H2PC and MMHC on yeast and image arestrictly identical, hence the same accuracy values in Ta-ble 5 in the MLP column.

Fourth, regarding the utility of the feature selection,it is difficult to reach any conclusion. Whether BR orMLP is used, the use of the selected features as inputsto the classification model is not shown greatly benefi-cial in terms of global accuracy on average. The per-formance of BR and our MLP with all the features out-performs that with the selected features in 6 data setsbut the feature selection leads to actual improvementsin 3 data sets. The difference in accuracy with and with-out the feature selection was not shown to be statisti-cally significant (p > 0.20 with a Wilcoxon signed rankpaired test). Surprisingly, the feature selection did ex-ceptionally well on genbase. On this data set, the in-crease in accuracy is the most impressive: it raised from7% to 98% which is atypical. The dramatic increase inaccuracy on genbase is due solely to the restricted fea-ture set as input to the classification model. This is alsoobserved on medical, to a lesser extent though. Inter-estingly, on large and densely connected networks (e.g.bibtex, slashdot and corel5K), the feature selection per-

14

formed very well in terms of global accuracy and signif-icantly reduced the input space which is noticeable. Onemotions, yeast, image, scene and genbase, the methodreduced the feature set down to nearly 1/100 its origi-nal size. The feature selection should be evaluated inview of its effectiveness at balancing the increasing er-ror and the decreasing computational burden by drasti-cally reducing the feature space. We should also keepin mind that the feature relevance cannot be defined in-dependently of the learner and the model-performancemetric (e.g., the loss function used). Admittedly, ourfeature selection based on the Markov boundaries is notnecessarily optimal for the base MLC learner used here,namely the Random Forest model.

As far as the overall running time performance is con-cerned, we see from Table 6 that for both methods, therunning time grows somewhat exponentially with thesize of the Markov boundary and the number of fea-tures, hence the considerable rise in total running timeon bibtex. H2PC takes almost 200 times longer on bib-

tex (1826 variables) and enron (1001 variables) whichis quite considerable but still affordable (44 hours ofsingle-CPU time on bibtex and 13 hours on enron). Wealso observe that the size of the parent set with H2PCis 2.5 (on bibtex) and 3.6 (on enron) times larger thanthat of MMHC (which may hurt interpretability). Infact, the running time is known to increase exponen-tially with the parent set size of the true underlying net-work. This is mainly due the computational overheadof greedy search-and-score procedure with larger par-ent sets which is the most promising part to optimize interms of computational gains as we discuss in the Con-clusion.

6. Discussion & practical applications

Our prime conclusion is that H2PC is a promisingapproach to constructing BN global or local structuresaround specific nodes of interest, with potentially thou-sands of variables. Concentrating on higher recall val-ues while keeping the false positive rate as low as possi-ble pays off in terms of goodness of fit and structure ac-curacy. Historically, the main practical difficulty in theapplication of BN structure discovery approaches hasbeen a lack of suitable computing resources and relevantaccessible software. The number of variables which canbe included in exact BN analysis is still limited. As aguide, this might be less than about 40 variables for ex-act structural search techniques (Perrier et al., 2008; Ko-jima et al., 2010). In contrast, constraint-based heuris-tics like the one presented in this study is capable of pro-cessing many thousands of features within hours on a

Table 4: Distribution of the number and the size of the minimal labelpowersets output by H2PC (top) and MMHC (bottom). On each dataset: mean number of powersets, minimum/median/maximum numberof labels per powerset, and minimum/median/maximum number ofdistinct classes per powerset. The total number of labels and distinctlabels combinations is recalled for convenience.

H2PC

dataset # LPs# labels / LP

L(D)# classes / LP

DL(D)min/med/max min/med/max

emotions 1.0 ± 0.0 6 / 6 / 6 (6) 26 / 27 / 27 (27)yeast 2.0 ± 0.0 2 / 7 / 12 (14) 3 / 64 / 135 (198)image 1.0 ± 0.0 5 / 5 / 5 (5) 19 / 20 / 20 (20)scene 1.0 ± 0.0 6 / 6 / 6 (6) 14 / 15 / 15 (15)slashdot 9.5 ± 0.8 1 / 1 / 14 (22) 1 / 2 / 111 (156)genbase 26.9 ± 0.3 1 / 1 / 2 (27) 1 / 2 / 2 (32)medical 38.6 ± 1.8 1 / 1 / 6 (45) 1 / 2 / 10 (94)enron 24.5 ± 1.9 1 / 1 / 30 (53) 1 / 2 / 334 (753)bibtex 39.6 ± 4.3 1 / 1 / 112 (159) 2 / 2 / 1588 (2856)corel5k 97.7 ± 3.9 1 / 1 / 263 (374) 1 / 2 / 2749 (3175)

MMHC

dataset # LPs# labels / LP

L(D)# classes / LP

DL(D)min/med/max min/med/max

emotions 1.2 ± 0.4 1 / 6 / 6 (6) 2 / 26 / 27 (27)yeast 2.0 ± 0.0 1 / 7 / 13 (14) 2 / 89 / 186 (198)image 1.0 ± 0.0 5 / 5 / 5 (5) 19 / 20 / 20 (20)scene 1.0 ± 0.0 6 / 6 / 6 (6) 14 / 15 / 15 (15)slashdot 15.1 ± 0.6 1 / 1 / 9 (22) 1 / 2 / 46 (156)genbase 27.0 ± 0.0 1 / 1 / 1 (27) 1 / 2 / 2 (32)medical 41.7 ± 1.4 1 / 1 / 4 (45) 1 / 2 / 7 (94)enron 36.8 ± 2.7 1 / 1 / 12 (53) 1 / 2 / 119 (753)bibtex 77.2 ± 3.1 1 / 1 / 28 (159) 2 / 2 / 233 (2856)corel5k 165.3 ± 5.5 1 / 1 / 177 (374) 1 / 2 / 2294 (3175)

Table 5: Global classification accuracies using 4 learning methods(BR, MLP, BR+MB, MLP+MB) based on the DAG obtained withH2PC and MMHC. Best values between H2PC and MMHC are bold-faced.

dataset BRMLP BR+MB MLP+MB

MMHC H2PC MMHC H2PC MMHC H2PC

emotions 0.327 0.371 0.391 0.147 0.266 0.189 0.285

yeast 0.169 0.273 0.271 0.062 0.155 0.073 0.233

image 0.342 0.504 0.504 0.227 0.252 0.308 0.360

scene 0.580 0.733 0.733 0.123 0.361 0.199 0.565

slashdot 0.355 0.419 0.474 0.274 0.373 0.278 0.439

genbase 0.069 0.069 0.069 0.974 0.976 0.974 0.976

medical 0.454 0.499 0.502 0.630 0.651 0.636 0.669

enron 0.134 0.165 0.180 0.024 0.079 0.079 0.157

bibtex 0.108 0.115 0.167 0.120 0.133 0.121 0.177

corel5k 0.002 0.030 0.043 0.000 0.002 0.010 0.034

15

Table 6: DAG learning time (in seconds) and average degree of thelabel nodes.

datasetMMHC H2PC

time label degree time label degree

emotions 1.5 2.6 ± 1.1 16.2 4.7 ± 1.8yeast 9.0 3.0 ± 1.6 90.9 5.0 ± 2.9image 3.1 4.0 ± 0.8 28.4 4.6 ± 1.0scene 9.9 3.1 ± 1.0 662.9 6.6 ± 1.8slashdot 44.8 2.7 ± 1.4 822.3 4.0 ± 5.2genbase 12.5 1.0 ± 0.3 12.4 1.1 ± 0.3medical 43.8 1.7 ± 1.1 334.8 2.9 ± 2.4enron 199.8 1.2 ± 1.5 47863.0 4.3 ± 5.2bibtex 960.8 2.6 ± 1.3 159469.6 6.4 ± 10.3corel5k 169.6 1.5 ± 1.4 2513.0 2.9 ± 2.8

personal computer, while maintaining a very high struc-tural accuracy. H2PC and MMHC could potentiallyhandle up to 10,000 labels in a few days of single-CPUtime and far less by parallelizing the skeleton identifica-tion algorithm as discussed in (Tsamardinos et al., 2006;Villanueva & Maciel, 2014).

The advantages in terms of structure accuracy and itsability to scale to thousands of variables opens up manyavenues of future possible applications of H2PC in var-ious domains as we shall discuss next. For example,BNs have especially proven to be useful abstractions incomputational biology (Nagarajan et al., 2013; Scutari& Nagarajan, 2013; Prestat et al., 2013; Aussem et al.,2012, 2010; Pena, 2008; Pena et al., 2005). Identify-ing the gene network is crucial for understanding thebehavior of the cell which, in turn, can lead to better di-agnosis and treatment of diseases. This is also of greatimportance for characterizing the function of genes andthe proteins they encode in determining traits, psychol-ogy, or development of an organism. Genome sequenc-ing uses high-throughput techniques like DNA microar-rays, proteomics, metabolomics and mutation analysisto describe the function and interactions of thousandsof genes (Zhang & Zhou., 2006). Learning BN modelsof gene networks from these huge data is still a difficulttask (Badea, 2004; Bernard & Hartemink, 2005; Fried-man et al., 1999a; Ott et al., 2004; Peer et al., 2001). Inthese studies, the authors had decide in advance whichgenes were included in the learning process, in all thecases less than 1000, and which genes were excludedfrom it (Pena, 2008). H2PC overcome the problem byfocusing the search around a targeted gene: the key stepis the identification of the vicinity of a node X (Penaet al., 2005).

Our second objective in this study was to demonstratethe potential utility of hybrid BN structure discovery tomulti-label learning. In multi-label data where many

inter-dependencies between the labels may be present,explicitly modeling all relationships between the labelsis intuitively far more reasonable (as demonstrated inour experiments). BNs explicitly account for such inter-dependencies and the DAG allows us to identify an op-timal set of predictors for every label powerset. The ex-periments presented here support the conclusion that lo-cal structural learning in the form of local neighborhoodinduction and Markov blanket is a theoretically well-motivated approach that can serve as a powerful learn-ing framework for label dependency analysis gearedtoward multi-label learning. Multi-label scenarios arefound in many application domains, such as multimediaannotation (Snoek et al., 2006; Trohidis et al., 2008), tagrecommendation, text categorization (McCallum, 1999;Zhang & Zhou., 2006), protein function classification(Roth & Fischer, 2007), and antiretroviral drug catego-rization (Borchani et al., 2013).

7. Conclusion & avenues for future research

We first discussed a hybrid algorithm for globalor local (around target nodes) BN structure learn-ing called Hybrid HPC (H2PC). Our extensive experi-ments showed that H2PC outperforms the state-of-the-art MMHC by a significant margin in terms of edgerecall without sacrificing the number of extra edges,which is crucial for the soundness of the super-structureused during the second stage of hybrid methods, likethe ones proposed in (Perrier et al., 2008; Kojimaet al., 2010). The code of H2PC is open-source andpublicly available online at https://github.com/

madbix/bnlearn-clone-3.4. Second, we discussedan application of H2PC to the multi-label learning prob-lem which is a challenging problem in many real-worldapplication domains. We established theoretical results,under the faithfulness condition, in order to characterizegraphically the so-called minimal label powersets thatappear as irreducible factors in the joint distribution andtheir respective Markov boundaries. As far as we know,this is the first investigation of Markov boundary princi-ples to the optimal variable/feature selection problem inmulti-label learning. These formal results offer a simpleguideline to characterize graphically : i) the minimal la-bel powerset decomposition, (i.e. into minimal subsetsYLP ⊆ Y such that YLP ⊥⊥ Y \ YLP | X), and ii) theminimal subset of features, w.r.t an Information Theorycriterion, needed to predict each label powerset, therebyreducing the input space and the computational burdenof the multi-label classification. The theoretical analysislaid the foundation for a practical multi-label classifica-tion procedure. Another set of experiments were carried

16

out on a broad range of multi-label data sets from dif-ferent problem domains to demonstrate its effectiveness.H2PC was shown to outperform MMHC by a significantmargin in terms of global accuracy.

We suggest several avenues for future research. Asfar as BN structure learning is concerned, future workwill aim at: 1) ascertaining which independence test(e.g. tests targeting specific distributions, employingparametric assumptions etc.) is most suited to the dataat hand (Tsamardinos & Borboudakis, 2010; Scutari,2011); 2) controlling the false discovery rate of theedges in the graph output by H2PC (Pena, 2008) espe-cially when dealing with more nodes than samples, e.g.learning gene networks from gene expression data. Inthis study, H2PC was run independently on each nodewithout keeping track of the dependencies found previ-ously. This lead to some loss of efficiency due to re-dundant calculations. The optimization of the H2PCcode is currently being undertaken to lower the compu-tational cost while maintaining its performance. Theseoptimizations will include the use of a cache to storethe (in)dependencies and the use of a global structure.Other interesting research avenues to explore are exten-sions and modifications to the greedy search-and-scoreprocedure which is the most promising part to optimizein terms of computational gains. Regarding the multi-label learning problem, experiments with several thou-sands of labels are currently been conducted and willbe reported in due course. We also intend to work onrelaxing the faithfulness assumption and derive a prac-tical multi-label classification algorithm based on H2PCthat is correct under milder assumptions underlying thejoint distribution (e.g. Composition, Intersection). Thisneeds further substantiation through more analysis.

Acknowledgments

The authors thank Marco Scutari for sharing his bn-

learn package in R. The research was also supportedby grants from the European ENIAC Joint Undertak-ing (INTEGRATE project) and from the French Rhone-Alpes Complex Systems Institute (IXXI).

Appendix

Lemma 8. ∀YLPi,YLP j

∈ LP, then YLPi∩ YLP j

∈ LP.

Proof. To keep the notation uncluttered and for the sakeof simplicity, consider a partition {Y1,Y2,Y3,Y4} of Y

such that YLPi= Y1 ∪ Y2 , YLP j

= Y2 ∪ Y3 . Fromthe label powerset assumption for YLPi

and YLP j, we

have (Y1 ∪ Y2) ⊥⊥ (Y3 ∪ Y4) | X and (Y2 ∪ Y3) ⊥⊥

(Y1 ∪ Y4) | X . From the Decomposition, we have thatY2 ⊥⊥ (Y3 ∪ Y4) | X . Using the Weak union, we obtainthat Y2 ⊥⊥ Y1 | (Y3 ∪ Y4 ∪ X) . From these two facts,we can use the Contraction property to show that Y2 ⊥⊥

(Y1 ∪ Y3 ∪ Y4) | X . Therefore, Y2 = YLPi∩ YLP j

is alabel powerset by definition.

Lemma 9. Let Yi and Y j denote two distinct labels in

Y and define by YLPiand YLP j

their respective minimal

label powerset. Then we have,

∃Z ⊆ Y \ {Yi, Y j}, {Yi} 6⊥⊥ {Y j} | (X ∪ Z)⇒ YLPi= YLP j

Proof. Let us suppose YLPi, YLP j

. By the label pow-erset definition for YLPi

, we have YLPi⊥⊥ Y \ YLPi

| X.As YLPi

∩ YLP j= ∅ owing to Lemma 8, we have

that Y j < YLPi. ∀Z ⊆ Y \ {Yi, Y j}, Z can be decom-

posed as Zi ∪ Z j such that Zi = Z ∩ (YLPi\ {Yi}) and

Z j = Z \ Zi . Using the Decomposition property, wehave that ({Yi} ∪Zi) ⊥⊥ ({Y j} ∪Z j) | X . Using the Weakunion property, we have that {Yi} ⊥⊥ {Y j} | (X ∪ Z) . Asthis is true ∀Z ⊆ Y \ {Yi, Y j}, then ∄Z ⊆ Y \ {Yi, Y j}, {Yi} 6

⊥⊥ {Y j} | (X ∪ Z) , which completes the proof.

Lemma 10. Consider YLP a minimal label powerset.

Then, if p satisfies the Composition property, Z 6⊥⊥ YLP \

Z | X.

Proof. By contradiction, suppose a nonempty Z exists,such that Z ⊥⊥ YLP \ Z | X . From the label powersetassumption of YLP, we have that YLP ⊥⊥ Y \ YLP | X

. From these two facts, we can use the Compositionproperty to show that Z ⊥⊥ Y \ Z | X which contradictsthe minimal label powerset assumption of YLP. Thisconcludes the proof.

Theorem 7. Suppose p is faithful to a DAG G. Then,

Yi and Y j belong to the same minimal label powerset if

and only if there exists an undirected path in G between

nodes Yi and Y j in Y such that all intermediate nodes Z

are either (i) Z ∈ Y, or (ii) Z ∈ X and Z has two parents

in Y (i.e. a collider of the form Yp → X ← Yq).

Proof. Suppose such a path exists. By conditioning onall the intermediate colliders Yk in Y of the form Yp →

Yk ← Yq along an undirected path in G between nodesYi and Y j, we ensure that ∃Z ⊆ Y \ {Yi, Y j}, dSep(Yi, Y j |

X ∪ Z). Due to the faithfulness, this is equivalent to{Yi} 6⊥⊥ {Y j} | (X∪Z). From Lemma 9, we conclude thatYi and Y j belong to the same minimal label powerset.

17

To show the inverse, note that owing to Lemma 10, weknow that there exists no partition {Zi,Z j} of Z such that({Yi} ∪ Zi) ⊥⊥ ({Y j} ∪ Z j) | X. Due to the faithfulness,there exists at least a link between ({Yi}∪Zi) and ({Y j}∪

Z j) in the DAG, hence by recursion, there exists a pathbetween Yi and Y j such that all intermediate nodes Z areeither (i) Z ∈ Y, or (ii) Z ∈ X and Z has two parents inY (i.e. a collider of the form Yp → X ← Yq).

Theorem 8. Suppose p is faithful to a DAG G. Let

YLP = {Y1, Y2, . . . , Yn} be a label powerset. Then, its

Markov boundary M in U is also its Markov boundary

in X, and is given in G by M =⋃n

j=1{PCY j∪ SPY j

} \ Y.

Proof. First, we prove that M is a Markov boundaryof YLP in U. Define Mi the Markov boundary of Yi inU, and M′

i = Mi \ Y. From Theorem 5, Mi is givenin G by Mi = PCY j

∪ SPY j. We may now prove that

M′1∪· · ·∪M′

n is a Markov boundary of {Y1, Y2, . . . , Yn}

in U. We show first that the statement holds for n = 2and then conclude that it holds for all n by induction.Let W denote U \ (Y1 ∪ Y2 ∪M′

1 ∪M′2), and define

Y1 = {Y1} and Y2 = {Y2}. From the Markov blanketassumption for Y1 we have Y1 ⊥⊥ U \ (Y1 ∪M1) | M1

. Using the Weak Union property, we obtain that Y1 ⊥⊥

W | M′1 ∪M′

2 ∪ Y2 . Similarly we can derive Y2 ⊥⊥

W |M′2 ∪M′

1 ∪Y1 . Combining these two statementsyields Y1∪Y2 ⊥⊥W |M′

1∪M′2 due to the Intersection

property. Let M =M′1 ∪M′

2 the last expression can beformulated as Y1 ∪Y2 ⊥⊥ U \ (Y1 ∪Y2 ∪M) |M whichis the definition of a Markov blanket of Y1 ∪ Y2 in U.We shall now prove that M is minimal. Let us supposethat it is not the case, i.e., ∃Z ⊂M such that Y1 ∪Y2 ⊥⊥

Z∪(U\(Y1∪Y2∪M)) |M\Z . Define Z1 = Z∩M1 andZ2 = Z∩M2, we may apply the Weak Union property toget Y1 ⊥⊥ Z1 | (M\Z)∪Y2∪Z2∪(U\(Y∪M)) which canbe rewritten more compactly as Y1 ⊥⊥ Z1 | (M1 \ Z1) ∪(U \ (Y1∪M1)) . From the Markov blanket assumption,we have Y1 ⊥⊥ U \ (Y1 ∪ M1) | M1 . We may nowapply the Intersection property on these two statementsto obtain Y1 ⊥⊥ Z1∪(U\(Y1∪M1)) |M1\Z1 . Similarly,we can derive Y2 ⊥⊥ Z2 ∪ (U \ (Y2 ∪M2)) | M2 \ Z2 .From the Markov boundary assumption of M1 and M2,we have necessarily Z1 = ∅ and Z2 = ∅, which in turnyields Z = ∅. To conclude for any n > 2, it suffices to setY1 =

⋃n−1j=1 {Y j} and Y2 = {Yn} to conclude by induction.

Second, we prove that M is a Markov blanket of YLP

in X. Define Z = M ∩ Y. From the label powersetdefinition, we have YLP ⊥⊥ Y \ YLP | X . Using theWeak Union property, we obtain YLP ⊥⊥ Z | U\(YLP∪Z)which can be reformulated as YLP ⊥⊥ Z | (U \ (YLP ∪

M))∪(M\Z) . Now, the Markov blanket assumption forYLP in U yields YLP ⊥⊥ U \ (YLP∪M) |M which can berewritten as YLP ⊥⊥ U \ (YLP ∪M) | (M \Z) ∪Z . Fromthe Intersection property, we get YLP ⊥⊥ Z∪ (U \ (YLP ∪

M)) | M \ Z . From the Markov boundary assumptionof M in U, we know that there exists no proper subsetof M which satisfies this statement, and therefore Z =

M∩Y = ∅. From the Markov blanket assumption of M

in U, we have YLP ⊥⊥ U \ (YLP ∪M) | M . Using theDecomposition property, we obtain YLP ⊥⊥ X \M | M

which, together with the assumption M ∩ Y = ∅, is thedefinition of a Markov blanket of YLP in X.

Finally, we prove that M is a Markov boundary ofYLP in X. Let us suppose that it is not the case, i.e.,∃Z ⊂ M such that YLP ⊥⊥ Z ∪ (X \ M) | M \ Z

. From the label powerset assumption of YLP, wehave YLP ⊥⊥ Y \ YLP | X which can be rewritten asYLP ⊥⊥ Y \ YLP | (X \M) ∪ (M \ Z) ∪ Z . Due to theContraction property, combining these two statementsyields YLP ⊥⊥ Z ∪ (U \ (M ∪ YLP) | M \ Z . Fromthe Markov boundary assumption of M in U, we havenecessarily Z = ∅, which suffices to prove that M is aMarkov boundary of YLP in X.

18

amazed−suprised

happy−pleased

relaxing−calm

quiet−still

sad−lonely

angry−aggresive

Mean_Acc1298_Mean_Mem40_MFCC_4

Mean_Acc1298_Mean_Mem40_Flux

BH_HighPeakAmp

BHSUM3


Mean_Acc1298_Std_Mem40_Flux

Mean_Acc1298_Std_Mem40_Rolloff

Mean_Acc1298_Std_Mem40_MFCC_4Std_Acc1298_Std_Mem40_Flux

amazed−suprised

happy−pleased

relaxing−calm

quiet−still

sad−lonely

angry−aggresive

Std_Acc1298_Std_Mem40_MFCC_4

Mean_Acc1298_Mean_Mem40_Rolloff



Mean_Acc1298_Mean_Mem40_Flux

BH_HighPeakAmp



Mean_Acc1298_Std_Mem40_MFCC_4

Std_Acc1298_Mean_Mem40_MFCC_4




Mean_Acc1298_Mean_Mem40_Centroid

Mean_Acc1298_Std_Mem40_Rolloff










Mean_Acc1298_Std_Mem40_Flux

BH_LowPeakAmp



Std_Acc1298_Std_Mem40_Flux

BHSUM3

BH_HighPeakBPM


BHSUM2


Std_Acc1298_Mean_Mem40_Flux

Std_Acc1298_Std_Mem40_Centroid

(a) Emotions

Class1

Class2

Class3

Class6

Class4

Class10

Class11

Class5

Class7Class8

Class9

Class12

Class13

Class14

Att88

Att96

Att89

Att100

Att84

Att87Att93

Class1

Class2

Class3

Class6

Class4

Class10 Class11Class5

Class7

Class8

Class9

Class12Class13

Class14

Att94

Att83

Att66

Att103

Att43

Att3

Att68

Att92

Att86

Att85

Att96

Att91

Att89

Att102

Att63

Att65

Att78

Att46Att37

Att77

Att47

Att84

Att80

Att88

Att42

Att38

Att62

Att71

Att12

Att4Att5

Att2Att29

Att1

Att36

Att44

Att11

Att67

Att13

Att16

(b) Yeast

Figure 3: The local BN structures learned by MMHC (left plot) and H2PC (right plot) on a single cross-validation split, on Emotions and Yeast.

19

desert

mountains

sea

sunset

trees

X9

X21

X78

X8

X18

X6

X20

X12

X24

X77

X69

X75 desertmountains

sea

sunset

trees

X21

X7

X76

X37

X2

X20

X12

X24X129

X23

X8

X16

X4

X89

X61X77 X67

X79

X73

X28

X46

X40

X118

X38

X3

X1

X11

(a) Image

Beach

Sunset

FallFoliage

Field

Mountain

Urban

Att45

Att46

Att44

Att38

Beach

Sunset

FallFoliage

Field

MountainUrban

Att1

Att115

Att78

Att85

Att238

Att26

Att27

Att263

Att271

Att68

Att22

Att173

Att2

Att8

Att3

Att114

Att116

Att122

Att108

Att59

Att213

Att71

Att29

Att86

Att281

Att92

Att36

Att91

Att183

Att237

Att231

Att245

Att280

Att294

Att25

Att19

Att75

Att33

Att28

Att34

Att20

Att76

Att262

Att264

Att165

Att270

Att67

Att256

Att69

Att61

Att166

Att23

Att15

Att64

Att172

Att174

Att180

Att121

(b) Scene

Figure 4: The local BN structures learned by MMHC (left plot) and H2PC (right plot) on a single cross-validation split, on Image and Scene.

20

Entertainment

Interviews

Main

Developers

Apache

News

Search

Mobile

Science

IT

BSD

Idle

Games

YourRightsOnline

AskSlashdot

Apple

BookReviews

Hardware

Meta

Linux

Politics

Technology

movie

format

television

answers

founder

developers

programming

perl

foundation

ruphus13

dead

google

standard

mobile

iphone

phonenetbookslaptop

scientists

X4writes

game

games

gaming

riaa

work

apple

mac

macbook

book

stoolpigeon

michael

intel

nvidia

amd

pc

linux

ubuntu

source

kernel

obamaelection

presidentialvoting

political

coondoggie

official

legal

tv

shows

questions

sendspoints

readers

notes

open

left

chrome

android

search

information

update

devices

broadband

notebook

app

cell

netbook

laptops

olpc

developed

research

video

developer

industry

newyorkcountrylawyer

company

experience

os

pro

lines

books

author

demo

graphics

gpu

ii

eeebarence

windows

X8code

barack

transition

discuss

machines

party

million

Entertainment

Interviews

Main

Developers

Apache

News

Search

Mobile

Science

IT

BSD

Idle

GamesYourRightsOnline

AskSlashdot

Apple

BookReviews

Hardware

Meta

Linux

Politics

Technology

movie

format

television

copyright

trek

answers

writes

founder

director

commission

developers

programming

java

project

developerlanguage

perl

snydeq

development

javascript

microsoft

foundation

ruphus13

dead

google

web

search

standard

mobile

iphone

phonenetbookslaptop

early

text laptops

blackberry

notebook

wireless

narramissic

devices

phones

X3g

t.mobile

android

scientists

universe

dna

planet

water

human

users

surface

quantum

humans

astronomers

evidence

physics

discovery

brain

solar

team

study

spacecraft

telescope

earth

discovered

dk

kentuckyfc

scientific

found

moon

research

mission

scientist

mars

nasa

science

researchers

cisco

released

X4

don

readers

sends

data

dog

future

interesting

isn

baby

japanese

officialsguy

recentreports

santa

disagree

women

story

announced

woman

man

game games

gaming

series

bring

trailer

details

edge

popular

players

drm

play

littlebigplanet

coming

massively

worlds

e3

fallout

interview

entertainment

warcraft

mmos

expansion

content

activision

downloadableonline

X1up

gameplay

gamer

spokeblizzard

console

gamers

kotaku

nintendo

upcoming

xbox

runningwii

ea

mmo

gamasutra

riaa

police

names

p2p

legal

government

isps

eu

music

bill

internet

dmca

rights

law

property

court

work

server

tricks good

email

time

small

books

computerpay

local boss

thinking

job

recently

apple

mactakes

itunes

macbook

steve

book

stoolpigeon

michael

management

ben

intel

nvidiaamd

pc

desktop

memory hp

chips

hardware

supercomputer

cpuchip

arcticstoat

linux

ubuntu

source

kernel

active

open.source

free

obama

election

presidential

voting

political palin

telecom

votes

states

debate

congress

gov

campaign

vote

mccain

coondoggie

insiderocket

silicon

system

prototype

tech

technology

developed

studios

official

consumer

war

tv switch

act

star

things

questions

policy

army

asked

points

notes

blog talks

eldavojohn

newyorkcountrylawyer

tips

anonymous

guide

education

federalsun

python

flash

open

olpc

gave

single

infoworld

exception

companies

paul

francisco

creator

committee

windows

electronic

announcement

zdnet

mentioned

left

chromeapps

yahoosite

browser

services

X0sites

applications

information

update

lamont

broadband

app

cell

callsnetbook

cheap

X360

child

atari

lenovo

reported

networkfcc

nokia

connection

g1

canadian

create

artificial

magnetic

red

energy

cancer

service

website

image

blacktaking

european

size

nobel

professor

japan

power

month

cells

transition

international

dark

conducted

finds risk

peopleshows

suggests

robots

million

hubble

space

survey

billionsatellite

matt

american

lunar

indiaimpact

institute

launch

original

holy

landerice

centerengineers

news

university

roland

working

replace

created

releases

version

release

today

firmware

security

X1

imaginary

feel

sending

wrote

written

death

bbc

excerpt

jagslive

word

anti.globalism

begins

article

firm

lucas123

trial

president

prettypickens

country

couple

X50

financial

technica

mail

slashdot

claimissued

men

washington

announces

september

plans

mondaysuit

wife

back

contest

video

piracysales

independent

learning

modern

industry

show

final

X3d

mirror

magazine

money

run

user

november

guitar

pointed

player

virtual

bit

battle begin

year

conference

X3

australian

reilly

evolution

sony

digital

world

king

call

set

warhammer

patent

tour

author

videosways

ceo

employees

makers

part

social

chance

live

feature

piece

discussing

sporeaction

wars

mediasentry

led

block

received

ruling

chinese

plan

agreement

warner

illegal

streaming

passed

pdfproposed

house

senate

censorship

china

access

bittorrent

submitted

speech

attempt

intellectual

supreme

appeals

ruled

case

filed

district

order

judge

appeal

company

experience

stupid problem

gmail

spam

long

computersattack

org ordered

student

makes

vista

close

jobs

offers

decided

psystar

os

youtube

store

prolines

line

include

noted

read

current

demoperformance

core

graphics

gpudriver ii

eee

market

pcsbarence

demonstrated

move

top X500

designed

revealed

hat

X6

X8

mark

code

networks

software

download

barack

america

night

tuesday

issues

machines

problems

party

state

apparently private

account

reporting

united

europe

privacy

change

register

john

challenge

pictures

ad

high

operating

mike

screen battery

built

unveiled

(a) Slashdot

PDOC00154

PDOC00343

PDOC00271

PDOC00064

PDOC00791

PDOC00380

PDOC50007

PDOC00224

PDOC00100

PDOC00670

PDOC50002

PDOC50106

PDOC00561

PDOC50017

PDOC50003

PDOC50006

PDOC50156

PDOC00662

PDOC00018

PDOC50001

PDOC00014

PDOC00750

PDOC50196

PDOC50199

PDOC00660

PDOC00653

PDOC00030

PS50072

PS50067

PS50053

PS50065

PS01031

PS50004

PS50007

PS50008

PS50049

PS50011

PS50052

PS50002

PS50106

PS50050

PS50017

PS50003

PS50006

PS50156

PS50051

PS00018

PS50001

PS00014

PS50235

PS50199

PS50102

PS00027

PDOC00154

PDOC00343

PDOC00271

PDOC00064

PDOC00791

PDOC00380

PDOC50007

PDOC00224

PDOC00100

PDOC00670

PDOC50002

PDOC50106

PDOC00561

PDOC50017

PDOC50003

PDOC50006PDOC50156

PDOC00662PDOC00018

PDOC50001

PDOC00014

PDOC00750

PDOC50196

PDOC50199

PDOC00660PDOC00653

PDOC00030

PS50072

PS50067

PS50053

PS50065

PS00318

PS00066

PS01031

PS50004

PS50008

PS50007

PS50049

PS50011

PS50052

PS50002

PS50106

PS50050

PS50017

PS50003

PS50006PS50156

PS50051PS00018

PS50001

PS00014

PS50235

PS50229

PS50199

PS50245PS50804

PS50102

(b) Genbase

Figure 5: The local BN structures learned by MMHC (left plot) and H2PC (right plot) on a single cross-validation split, on Slashdot and Genbase.

21

Class−0−593_70

Class−1−079_99

Class−2−786_09 Class−3−759_89

Class−4−753_0

Class−5−786_2

Class−6−V72_5

Class−7−511_9

Class−8−596_8

Class−9−599_0

Class−10−518_0

Class−11−593_5

Class−12−V13_09

Class−13−791_0

Class−14−789_00

Class−15−593_1

Class−16−462

Class−17−592_0

Class−18−786_59

Class−19−785_6

Class−20−V67_09

Class−21−795_5

Class−22−789_09

Class−23−786_50

Class−24−596_54

Class−25−787_03

Class−26−V42_0

Class−27−786_05

Class−28−753_21

Class−29−783_0

Class−30−277_00

Class−31−780_6

Class−32−486

Class−33−788_41 Class−34−V13_02

Class−35−493_90

Class−36−788_30

Class−37−753_3

Class−38−593_89

Class−39−758_6

Class−40−741_90

Class−41−591

Class−42−599_7

Class−43−279_12

Class−44−786_07

reflux

grade

syndrome

hemihypertrophy

cough

unilateral

drainage

present

uti

atelectasis

hydroureter

proteinuria

pain

abdominal

pharyngitis

calculi

10−year−9−month

followup

ppd

positive

chest

neurogenic

vomiting

transplant

shortness

appetite

fibrosis

yearly

pneumonia

fever

intraluminal

asthma

features

pyelectasis

moderate

pyelocaliectasis

horseshoe

duplication

duplicated

enuresis

incontinence

2;

myelomeningocele

meningomyelocele

turner

hydronephrosis

hematuria

wheezing

vesicoureteral

cystogram

ii

low 2

aldrich

lobe

9−day

alternatively

count

echogenicity

recent

normal

lung

patchy

proximally

flank

mass

conveyed

evidence

pole

growth

routine

study

antibody

renal

kidneys

accounts

aspirated

marrow

cystic

evaluation

viral focal

ago

year−old radiographic

distentionresidual

caliber

change

partial

collecting

calcium

date

infections

grossly

kidney

gross

microscopic

reactive

Class−0−593_70

Class−1−079_99

Class−2−786_09

Class−3−759_89

Class−4−753_0

Class−5−786_2

Class−6−V72_5

Class−7−511_9

Class−8−596_8

Class−9−599_0

Class−10−518_0

Class−11−593_5

Class−12−V13_09

Class−13−791_0

Class−14−789_00

Class−15−593_1

Class−16−462

Class−17−592_0

Class−18−786_59

Class−19−785_6

Class−20−V67_09

Class−21−795_5

Class−22−789_09

Class−23−786_50

Class−24−596_54

Class−25−787_03

Class−26−V42_0

Class−27−786_05Class−28−753_21

Class−29−783_0

Class−30−277_00

Class−31−780_6

Class−32−486

Class−33−788_41

Class−34−V13_02

Class−35−493_90

Class−36−788_30

Class−37−753_3

Class−38−593_89

Class−39−758_6

Class−40−741_90

Class−41−591

Class−42−599_7

Class−43−279_12

Class−44−786_07

reflux

grade

growth

syndrome

hemihypertrophy

mass

hypoventilation

cough

weeks

x−ray

unilateral

drainage

present

tract

uti

febrile

atelectasis

hydroureter

decreased

woman

proteinuria

pain

abdominal

pharyngitis

calculi

10−year−9−month

followup

ppd

positive

flank

chest

neurogenic

vomiting

transplant

shortness

appetite

fibrosis

yearly

pneumonia

fever

intraluminal

asthma

reactive

features

ago

utis

recent

pyelectasis

pyelocaliectasis

abnormal

mild

horseshoe

duplication

duplicated

kidneys

kidney

enuresis

incontinence

2;

anuresis

night

ultrasound myelomeningocele

meningomyelocele

turner

hydronephrosis

improvement

unchangeddecrease

bilateral

moderate

wiskott

hematuria

wheezing

consistent

vesicoureteral

cystogram

voiding

left

history

deflux

low

4

2

nuclear

ii

interval

wiedemann

utero

thickening

clearprior

dilatation

lobe

pleural

stable

bladder

streaky

approximately

2−3

evidence

9−day

appearance

echogenicity

thoraxwhite

urinary

infection

partial

lung

band

lobar

area

patchy

density

opacities

consolidation

subsegmental

represent

sounds

junction

conveyed

stool

underlying

pole

5

reimplantation

duplex

routine

studytiterrenal

normal

views

radiographic

radiograph

started

web

marrow

cystic

evaluation

viral

examination

male

rulefocal

days

disease

hyperinflation

findings

airway

airways

year−old

nonconventional

etiology

9;

month

degrees

years

2001

pyelonephritis

bilaterally

resolves

change

increased

residual

gross

related

appears

distention

calix

ureteral

ureters

cortical

collecting

kidney;

partially

sonographic

appearing

lungs

size

made

infections

r=9

wetting

row

day

including

vesicostomy

grossly

resolution

urogenitalprenatal

significant

parenchymal

technical

9−month−24−day

microscopic

airminimally

(a) Medical

A.A1

A.A2

A.A3

A.A4

A.A5

A.A6

A.A7

A.A8

B.B1

B.B10

B.B11

B.B12

B.B13

B.B2

B.B3

B.B4

B.B5

B.B6

B.B7

B.B8

B.B9

C.C1

C.C10

C.C11

C.C12

C.C13

C.C2

C.C3

C.C4

C.C5

C.C6

C.C7

C.C8

C.C9

D.D1

D.D10

D.D11

D.D12

D.D13D.D14

D.D15

D.D16D.D17D.D18

D.D19

D.D2

D.D3

D.D4

D.D5

D.D6

D.D7D.D8

D.D9

meeting

job

draft attached

doc

amto

article

http

media

california

confidential

comments

proposal

language

thursday

staff

securitiesfile

cc

enroncc

11

pmto

attorney

intended

legal

www

visit

hit

ca

deregulation

A.A1

A.A2

A.A3

A.A4

A.A5

A.A6

A.A7

A.A8

B.B1

B.B10

B.B11

B.B12

B.B13

B.B2

B.B3

B.B4

B.B5

B.B6

B.B7

B.B8

B.B9

C.C1

C.C10

C.C11

C.C12

C.C13

C.C2

C.C3

C.C4

C.C5

C.C6

C.C7

C.C8

C.C9

D.D1

D.D10

D.D11

D.D12

D.D13D.D14

D.D15

D.D16

D.D17

D.D18D.D19

D.D2

D.D3

D.D4

D.D5

D.D6

D.D7

D.D8

D.D9

press

kaminski

vince

00

meeting

job

draft

forwarded

news

attached

doc

weeks

subject

original

pmto

copyright

article

angeles

homes

bill

http

ferc

market

confidential

points

californiadiego

procedures

release

caps

head

solution

government

message

june

mailto

stanford

risk

london

22

tuesday

3

121

5

thursday

yesterday

development

staff

talk

discussion

concern

board

jandidn

doesnpuc

comments

language

proposal

williams

pm

stevenhou

cc

08

28 1001

02

05

09

fw

2001

data

friday

july

monday

amto

100

week

prices

good

utility

big

sale

million

securities

james

mara

find

file

cpuc

scott

trading

lineago

office republican

questions

15

number

kean

2000

3123

16

06

0403

enron

enroncc

reserved

ordered

27 world

shut

budget

losnet

held

la

spokesman

ed

temporary

valley

reached

cut

megawatts

senate

bills

chris

political

effect

al

law

bad

www

expense

siteproducers

total

iso

order

commission

alan

september

settlement

sarah

rtolinda

wood

4

issues

investigation

reasonable

commissioners

filing

filed

authority

wholesale

powerprice

interest fall

makingreal including

industrybased

markets

intended

cash

system

legal

information

attorney

talking

key

assets

america

crisis

deregulation

ca

pacific

san

residential

sdg

wayspotential

passed

lawmakerslegislators

legislature rate

shortages

policies

required

(b) Enron

Figure 6: The local BN structures learned by MMHC (left plot) and H2PC (right plot) on a single cross-validation split, on Medical and Enron.

22

TAG_2005

TAG_2006

TAG_2007

TAG_agdetection

TAG_algorithms

TAG_amperometry

TAG_analysis

TAG_and

TAG_annotation

TAG_antibody

TAG_apob

TAG_architecture

TAG_article

TAG_bettasplendens

TAG_bibteximport

TAG_book

TAG_children

TAG_classification

TAG_clustering

TAG_cognition

TAG_collaboration

TAG_collaborative

TAG_community

TAG_competition

TAG_complex

TAG_complexity

TAG_compounds

TAG_computer

TAG_computing

TAG_concept

TAG_context

TAG_cortex

TAG_critical

TAG_data

TAG_datamining

TAG_date

TAG_design

TAG_development

TAG_diffusion

TAG_diplomathesis

TAG_disability TAG_dynamics

TAG_education

TAG_elearning

TAG_electrochemistryTAG_elisa

TAG_empirical

TAG_energy

TAG_engineering

TAG_epitope

TAG_equation

TAG_evaluation

TAG_evolution

TAG_fca

TAG_folksonomy

TAG_formal

TAG_fornepomuk

TAG_games

TAG_granular

TAG_graph

TAG_hci

TAG_homogeneous

TAG_imaging

TAG_immunoassay

TAG_immunoelectrode

TAG_immunosensor

TAG_information

TAG_informationretrieval

TAG_kaldesignresearch

TAG_kinetic

TAG_knowledge

TAG_knowledgemanagement

TAG_langen

TAG_language

TAG_ldl

TAG_learning

TAG_liposome

TAG_litreview

TAG_logic

TAG_maintenance

TAG_management

TAG_mapping

TAG_marotzkiwinfried

TAG_mathematics

TAG_mathgamespatterns

TAG_methodology

TAG_metrics

TAG_mining

TAG_model

TAG_modeling

TAG_models

TAG_molecular

TAG_montecarlo

TAG_myown

TAG_narrative

TAG_nepomuk

TAG_network

TAG_networks

TAG_nlp

TAG_nonequilibrium

TAG_notag

TAG_objectoriented

TAG_of

TAG_ontologies

TAG_ontology

TAG_pattern

TAG_patterns

TAG_phase

TAG_physics

TAG_process

TAG_programming

TAG_prolearn

TAG_psycholinguistics

TAG_quantum

TAG_random

TAG_rdf

TAG_representation

TAG_requirements

TAG_research

TAG_review

TAG_science

TAG_search

TAG_semantic

TAG_semantics

TAG_semanticweb

TAG_sequence

TAG_simulation

TAG_simulations

TAG_sna

TAG_social

TAG_socialnets

TAG_software

TAG_spin

TAG_statistics

TAG_statphys23

TAG_structure

TAG_survey

TAG_system

TAG_systems

TAG_tagging TAG_technology

TAG_theory

TAG_topic1

TAG_topic10

TAG_topic11

TAG_topic2

TAG_topic3

TAG_topic4

TAG_topic6

TAG_topic7

TAG_topic8

TAG_topic9

TAG_toread

TAG_transition

TAG_visual

TAG_visualization

TAG_web

TAG_web20

TAG_wiki

2007

algorithm

algorithms

amperometric

analysis

patients

annotation

metadata

monoclonal

constants

apolipoprotein lipoprotein

apo

apob

b

architecture

architectures

reading

fish

fighting

calculations

molecular

journal

energy

initio

functional

review

children

child

classification

categories

clustering

cognitive

collaborative

collaboration

community

communities

link

entities

competitive

competition

complex

complexity

compounds

semantic

word

computer

computing

la

concept

context

aware

cortex

critical

automation

consumption

power

design

diffusion

assessment

dynamics

education

learning

electrochemical

electrode

glucose

assay

linked

empirical

engineering

equation

evaluation

evaluating

evolution

evolutionary

change

formal

games

gas

velocity

graph

interfaces

homogeneous

imaging

brain

images

immunoassay

antibody

serum

information

entropy

kinetic

constant

knowledge

management

language

languages

ldl

lipid

membrane

technology

logic

description

maintenance

reasoning

mapping

n

u

mathematics

mathematical

designing

antigen

metrics

mining

model

models

modeling

modelling

network

networks

equilibrium

extraction

object

ontologies

ontology

pattern

patterns

process

programming

programs

hypermedia

adaptive

quantum

random

rdf

representation

representations

requirements

specification

reviewed

science

scientific

search

semantics

sequence

sequences

simulation

simulations

social

latent

probabilistic

software

spin

statistics

structure

survey

systems

tagging

theory

boltzmann

proteins

biological

dna

financial

surfaces

liquid

polymer

gap

glass

relaxation

transition

visual visualization

web

first

discovery

data

machine

training

anti

antibodies

novel

efficient

techniques

analyses

disease

clin

left

primary

reaction

calculated

values

density

res

c

v

p

platform

built

siamese

molecules

protein

conference

american

psychology

physics

free

surface

ab

equations

nodes

physical

letters

young

care

vector

hierarchical

cluster

computers

supported

human

society

participants

practice

links

relationships

relations

co

determination

formation

reduce

system

applications

text

words

factors

ieee

internet

concepts

contexts

personal

exponents

application

law

coefficient

test

dynamical

students

educational

school

chem

electron

concentration

enzyme

method

theoretical

development

evaluated

evolving

genetic

biology

changes

world

wide

users

pages

services

game

phase

particles

particle

motionflow

graphs

interface

user

heterogeneous

resonance

image

retrieval

about

processing

production

maximum

measure

kinetics

rate

managing

natural

receptor

cholesterol

cell

technologies

map

onto

sci

s

interactive

non

oriented

objects

processes

program

multimedia

adaptation

mechanical

mechanics

classical

stochastic

notes

engine

numerical

onlinecultural

people

indexing

probability

magnetic

structures

its

gene

genome

markets

solid

crystal

chain

temperature

state

second

then

TAG_2005

TAG_2006

TAG_2007

TAG_agdetection

TAG_algorithms

TAG_amperometry

TAG_analysis

TAG_and

TAG_annotation

TAG_antibody TAG_apob

TAG_architecture

TAG_article

TAG_bettasplendens

TAG_bibteximport

TAG_book

TAG_children

TAG_classification

TAG_clustering

TAG_cognition

TAG_collaboration

TAG_collaborative

TAG_community

TAG_competition

TAG_complex

TAG_complexity

TAG_compounds

TAG_computer

TAG_computing

TAG_concept

TAG_context

TAG_cortex

TAG_critical

TAG_data

TAG_datamining

TAG_date

TAG_design

TAG_development

TAG_diffusion

TAG_diplomathesis

TAG_disability

TAG_dynamics

TAG_education

TAG_elearning

TAG_electrochemistry

TAG_elisa

TAG_empirical

TAG_energy

TAG_engineering

TAG_epitope

TAG_equation

TAG_evaluation

TAG_evolution

TAG_fca

TAG_folksonomy

TAG_formal

TAG_fornepomuk

TAG_games

TAG_granular

TAG_graph

TAG_hci

TAG_homogeneous

TAG_imaging

TAG_immunoassay

TAG_immunoelectrode

TAG_immunosensor

TAG_information

TAG_informationretrieval

TAG_kaldesignresearch

TAG_kinetic

TAG_knowledge

TAG_knowledgemanagement

TAG_langen

TAG_language

TAG_ldl

TAG_learning

TAG_liposome

TAG_litreview

TAG_logic

TAG_maintenance

TAG_management

TAG_mapping

TAG_marotzkiwinfried

TAG_mathematics

TAG_mathgamespatterns

TAG_methodology

TAG_metrics

TAG_mining

TAG_model

TAG_modeling

TAG_models

TAG_molecular

TAG_montecarlo

TAG_myown

TAG_narrative

TAG_nepomuk

TAG_network

TAG_networks

TAG_nlp

TAG_nonequilibrium

TAG_notag

TAG_objectoriented

TAG_of

TAG_ontologies

TAG_ontology

TAG_pattern

TAG_patterns

TAG_phase

TAG_physics

TAG_process

TAG_programming

TAG_prolearn

TAG_psycholinguistics

TAG_quantum

TAG_random

TAG_rdf

TAG_representation

TAG_requirements

TAG_research

TAG_review

TAG_science

TAG_search

TAG_semantic

TAG_semantics

TAG_semanticweb

TAG_sequence

TAG_simulation

TAG_simulations

TAG_sna

TAG_social

TAG_socialnets

TAG_software

TAG_spin

TAG_statistics

TAG_statphys23

TAG_structure

TAG_survey

TAG_system

TAG_systems

TAG_tagging

TAG_technology

TAG_theory

TAG_topic1

TAG_topic10

TAG_topic11

TAG_topic2

TAG_topic3

TAG_topic4

TAG_topic6

TAG_topic7TAG_topic8

TAG_topic9

TAG_toread

TAG_transition

TAG_visual

TAG_visualization

TAG_webTAG_web20

TAG_wiki

05

mining

2007

algorithm

algorithms

amperometric

working

competitive

glucose

analysis

patients

annotation

metadata

monoclonal

determining

mass

association

constants

apolipoprotein

lipoprotein

apo

apob

b

through

ratio

95

approach

variation

united

gene

significantly

particles

isolated

samples

containing

subjects

disease

protein

serum

clin

human

cholesterol

lipid

plasma

lipoproteins

ldl

architecture

architectures

in

reading

conference

movement

fishfighting

betta

neural

psychology

2004

behavior

display

male

calculations

molecular

energy

functional

density

could

science

carbon

employed

second

self

series

approximate

calculate

spin

acid

yield

species

coupled

gaussian

experiment

contributions

comparison

predicted

chem

set

x

treatment

relative

studied

interactions

site

level

active

structures

consistent

been

obtained

transfer

carried

frequencies

ion

geometry

compared

approximation

theoretical

found

activation

gas

sets

optimized

potentials catalytic

perturbation

complexes

study

accurate

computed

transition

effects

correlation6

computational

electron

molecule

agreement

charge

structure

experimental

exchange

calculation

electrostatic

methods

combined

barrier

reactions

atoms

method

hydrogen

mol

basis

potential

american

electronic

calculated solvent

reaction

molecules

c

bond

mechanical

dft

hybrid

qm

energies

quantum

journal

chemical

chemistry

initio

networks

physical

children

child

numbers

classification

text

categories

clustering

constraints

cognitive

collaborative

collaboration

shared

accuracy

community communities

entities

link

enzyme

competition

labeled

complex

complexity

compounds

proceedings

interpretation

target

similarity

semantic

words

language

word

computer

school

computing

la

concept

conceptual

context

aware

contexts

cortex

input

critical

data

genome

automation

consumption

power

average

technique

reduction

multi

policy

date

memory

reducing

dynamic

performance

embedded

design

creating

development

diffusion

ontology

hierarchical

assessment

identity

2006

basic

developmental

dynamics

physics

phys

exhibits

education

educational

learning

und

e

electrochemical

electrode

limits

polymer

current

direct

signal

vs

analytical

modified

range

oxidation

sensor

immobilized

immunoassay

antibody

linked

empirical

software

arise

engineering

proteins

identification

distinct

located

specific

fragments

expression

equation

evaluation

evaluating

a

evolution

requirements

evolutionary

changes

change

evolving

maintenance

web

share

tagging

formal

games

game

velocity

homogeneous

media

graph

interfaces

addition

separation

imaging

brain

medical

directions

visualization

image

multiple

images

matter

procedure

4

min

developed

assays

sensitivity

ml

antibodies

detection

assay

response

information

entropy

retrieval

mathematics

computers

teaching

constant

kinetic

antigen

knowledge

management

the

experiences

personal

languages

teachers

membrane

technology

was

access

devices

health

logic

description

reasoning

mapping

n

u

der

mathematical

references

culture

perspective

practices

environments

designing

methodology

metrics

model

models

modeling

international

modelling

proc

generation

06 network

equilibrium

extraction

documents

object

oriented

and

ontologies

to

pattern

patterns

phase

surface

process

programming

programs

hypermedia

course

standards

adaptation

adaptive

were

argued

novel

experiments

combination

principle

random

book

rdf

graphs

representation

representations

illustrated

re

specification

research

review

reviewed

synthesis

cause

enzymes

scientific

search

manner

semantics

sequence

partial

sequences

simulation

simulations

social

latent

probabilistic

source

statistics

xxiii

survey

system

systems

library

users

communication

adapted

theory

biological

dna

cells

financial

optimization

driven

formation

surfaces

temporal

growth

liquid

gap

glass

visual

views

exploration

graphical

w

0

workshop

background

first

p

02

associated

objects

vision

enabling

meaning

machine

automatically

discovery

training

situations

tools

experience

practice

tasks

activities

learned

learn

students

relevant

98

institute

statistical

letters

degrees

curves

chosen

conducted

prototype

fully

smaller

uniform

automated

similar

benefits

possibility

goal

discrete

expected

region

equal

connected

expansion

depend

considered

leads

content

strongly

short

origin

linear

introduced

corresponding

particular

shown

chains

induced

wave

bulk

forces

motion

stationary

force

local

coupling

observed

square

parameters

generalized

interacting

relaxation

also

symmetry

dependence

states

exact

particle

thermodynamic

equations

point

mean

are

scaling

field

numerical

finite

techniques

on

vector

we

proposed

problem

efficient

based

bioinformatics

produced

determination

this

developers

group

mediated

differential

an

flow

substrate

by

comparative

analyze

revealed

analyzing

analyses

is

of

that

combiningstability

improved

elements

domainsinfluence

components

importance

characteristics

highly

main

processes

four

further

significant

defined

among

including

applications

under

several

three

properties

s

its

other

different

both

between

for

daily

age

conclusion

clinical

creation

anti

directly

determine

capture

mixture

features

greater

detect

free

50

using

against

j

concentrations

bound

concentration

affinity

binding

recognition

directed

cell

previously

expressed

role

thermal

non

action

law

annual

rate

values

overall

identical

contrast

20

normal

amino

assessed

low

respectively

res

v

view

close

g

1noise

abstract

2005

lettour

matching

approaches

coefficients

coefficient

variations

national

genetic

genes

than

results

interact

fraction

charged

fluid

sample

as

subject

risk

individuals

major

beta

run

reference

those

l

factors

interaction

activity

extent

biol

profile

40

levels

total

amounts

composition

very

rich

contained

enhanced

modification

had

decreased

recognized

increased

receptor

paper

built

execution

platform

infrastructure

reveals

turn

kinds

namely

characterization

relatively

light

complete

across

lack

little

involved

various

space

out

terms

large

about

present

or

who

his

1998

least

european

instance

transactions

procedures

allows

show

control

behavioursiamese

splendens

processing

rev

behaviors

differ

24

pairs

trans

performed

base

hard

shift

weight

small

structural

atomic

mechanics

water

minimumground

state

improvement

mixed

iii

dependent

whereas

double

gradient

be

notes

progress

order

two

organized

distribution

magnetic

25

side

decreasepathway

acids

ions

cluster

functions

contribution

their

predict

prediction

predictions

fit

y 2

10

alpha

theoretically

effect

attractive

sites

lattice

residues

high

at

mechanism

with

has

have

implemented

identified

far

past

however

recently

frequency

configuration

modeled

intelligence

framework

no

quality

formed

case

qualitative

character

findings

intermediate

occurs

transitions

temperature

r

strong

parameter

good

correlated

correlations

5

7

8

18

11

12

3

restricted

computation

cost

excellent

experimentally

full

strength

boltzmann

applicable

reliable

best

reverse

kinetics

realistic

one

extendedconventional

estimate

applied

measurement

estimated

society

solution

combines

chain

indicates

intrinsic

involving

path

step

reserved

h

2003

2000

1997

o

d

systematic

mm

environment

give

body

treated

fock

classical

studies

obtain

tested

ab

nodes

emergence

links

degree

scale

topology

young

early

year

skills

years

play

care

developing

number

probabilityopportunities

document

articles

clusters

composed

parallel

controlled

selected

actual

line

positive

resulting

during

product

standard

availablefinally

variety

presented

single

within

function

simple

when

into

new

time

strategies

arguetheories

user

virtual

impact

technical

functionality

decisions

goals

building

project

designed

principles

implementation

co

supporting

online

success

members

errors

participants

relationships

relations

ph

determined

reduce

symposium

2002

joint

conditions

acm www

definition

higher

measure

measures

not

natural

defining

acquisition

written

query

xml

decision

sense

supported

reports

ieee

internet

mobile

de

des

descriptions

concepts

rapidly

situation

there

left

area

period

layer

responses

maps

primary

orientation

output

arbitrary

exponents

from

prior

database

sources

collected

databases

whole

achieve

strategy

off

i

sizes

error

distance

agent

rather

test

comprehensive

storage

static

task

precision

better

efficiency

created

challenges

create

successful

develop

volume

mobility

domain

enable

assess

reliability

choice

examined

ability

dynamicalsimulated

fluctuations

regime

2001

1999

while

settings

die

dem

auch

sich

eine

sind

f

ist

mit

zur

als

den

auf

im

zu

von

limit

some

future

art

trends

currently

technologies

30

wide

long

broad monitoring

contact

measurements

resonance

onto

advantages

sensitive

detected

evidence

identify

industrial

reuse

program

managing

projects

code

conf

testing

changing

supports

needs

discuss

section

integrated

concernssupport

each

derived

begin

evaluated

evaluate

criteria

researchers

every

biology

selection

events

world

pages

topic

http

usage

intelligent

tool

sharing navigation

resources

services

they

generated

popular

look

video

market

others

heterogeneous

multimedia

medium

interface

specificity

resolution

areas

regions

since

interactive

seenonlinear

17

used

protocoldevice

quantitative

cross

solid

100

achieved

rapid

described

format

being

types

enables

existing

organizational

provides

people

provide

production

maximum

indexing

life

historical

ideas

solving

apparent

offers

measuring

amount

maintaining

integrating

making

construction

representing

providing

know

service

organizations

business

al

general

over

regardingmight

technological

innovation

what

digital

demonstrated

indicated

investigated

did

showed

conclusions

describing

already

parts

roles

map

such

sci

m

cultural

distributions

these

unified

int

next

generating

automatic

presentation

resource

distributed

connections

global

aspect

applying

family

establish

extend

consequences

construct

entire

facilitate

meaningful

allowing

secondary

comparable

completely

center

easy

perform

generate

constructed

interesting

generally

extensive

itself

required

appear

leading

furthermore

established

additional

upon

points

groups

limited

alternative

rates

likely

requires

increasing

examine

following

mechanismsuses

provided

initial

improve

together

key

individual

become

investigate

respect

lead

value

increase

must

allow

need

therefore

made

type

understand

according

effective

application

make

without

thus

up

able

way

them

most

related

alldue

use

but

more

it

which

many

will

you

clear

right

work

shows

like

find

phenomena

diagramphases

continuous

etc

included

day

60

suggesting

subsequent

can

analyzed

affected

15

five

seven

either

differences

less

having

special

answer

issues

foundation

academic

needed

papers

focuses

topics

discusses

efforts

emerging

questions

overviewunderstanding

how

developments

literature

occur

engine

relevance

queries

now

explicit

define

length

peptide

monte

behavioral

public

kind

political

organization

sciences

class

advances

libraries

communications

discussion

bases

markets

spatial

relation

growing

crystal

materials

attention

exploring

explore

clearly

13

where

lower

t

9

towards

then

(a) Bibtex

city

mountain

sky

sun

water

clouds

tree

bay

lake

sea

beach

boats

people

branch

leaf

grass

plain

palm

horizon

shell

hills

waves

birds

land

dog

bridge

ships

buildings

fence

island

storm peaks

jet

plane

runway

basket

flight

flag

helicopter

boeing

prop

f−16

tails

smoke

formation

bear polar

snow

tundra

ice

head

black

reflection

ground

forest

fall

river

field

flowers

stream

meadow

rocks

hillside

shrubs

close−up

grizzly

cubs

drum

log

hut

sunset

display

plants

pool

coral

fan

anemone

fish

ocean

diver

sunrise

face

sand

rainbow

farms

reefs

vegetation

house

village

carvings

path

wood

dress

coast

sailboats

cattiger

bengal

fox

kit

run

shadows winter

autumn

cliff

bush

rockface

pair

den

coyote

light

arctic

shore

town

road

chapel

moon

harbor

windmills

restaurant

wall

skyline

window

clothes

shops

street

cafe

tables

nets

crafts

roofs

ruins

stone

cars

castle

courtyard

statue

stairs

costume

sponges

sign

palace

paintings

sheep

valley

balcony

post

gate

plaza

festival

temple

sculpture

museum

hotel

art

fountain

market

door

mural

garden

star

butterfly

angelfish

lion

cave

crab

grouper

pagoda

buddha

decoration

monastery

landscape

detail

writing

sails

food

room

entrance

fruit

night

perch

cow

figures

facadechairs

guard

pond

church

park

barn

arch

hats

cathedral

ceremony

crowd glass

shrine

model

pillar

carpet

monument

floor

vines cottage

poppies

lawn

tower

vegetables

bench

rose

tulip

canal

cheese

railing

dock

horses

petals

umbrella

column

waterfalls

elephant

monks pattern

interior

vendor

silhouette

architecture

blossoms

athlete

parade

ladder

sidewalk

store

steps

relief

fog

frost

frozen

rapids

crystals

spider needles

stick

mist

doorway

vineyard

pottery

pots

military

designs

mushrooms

terrace

tent

bulls

giant

tortoise

wingsalbatross

booby

nest

hawk

iguana

lizard

marine

penguin

deer

white−tailed

horns

slope

mule

fawn

antlers

elk

caribou

herd

moose

clearing

mare

foals

orchid

lily

stems

row

chrysanthemums

blooms

cactus

saguarogiraffe

zebra

tusks

hands

train

desertdunes

canyon

lighthousemast

seals

texture

dust

pepper

swimmers

pyramid

mosque

sphinx

truck

fly

trunk

baby

eagle

lynx

rodent

squirrel

goat

marsh

wolf

pack

dall

porcupine

whales

rabbit

tracks

crops

animals

moss

trail

locomotive

railroad

vehicle aerial range

insect

manwoman

rice

prayer

glacier

harvest

girl

indian

pole

dance

africanshirt

buddhist

tomboutside

shade

formula

turn

straightaway

prototype

steel

scotland

ceiling

furniture

lichen

pups

antelope

pebbles

remains leopard

jeep

calf

reptile

snake

cougar

oahu

kauai

maui

school

canoe

race

hawaii

Cluster442

Cluster380

Cluster20

Cluster317

Cluster243

Cluster157

Cluster364

Cluster423

Cluster21

Cluster133

Cluster113

Cluster93

Cluster412

Cluster383

Cluster399

Cluster408

Cluster252

Cluster268

Cluster426

Cluster170

Cluster446

Cluster241

Cluster166

Cluster478

Cluster82

Cluster137

Cluster341

Cluster145

Cluster471

Cluster96

Cluster79

Cluster230

Cluster260

Cluster461

Cluster362

Cluster409

Cluster492

Cluster42

Cluster59

Cluster205

Cluster458

Cluster81

Cluster347

Cluster129

Cluster427

Cluster348

Cluster128

Cluster101

Cluster235

Cluster351

Cluster165

Cluster390

Cluster150

Cluster39

Cluster144

Cluster392

Cluster70

Cluster493

Cluster9

Cluster305

Cluster281

Cluster193

Cluster267

Cluster126

Cluster232

Cluster159

Cluster138

Cluster173Cluster482

Cluster22

Cluster29

Cluster49

Cluster470

Cluster353


Cluster326

Cluster209

Cluster110

Cluster278

Cluster322

Cluster256

Cluster185

Cluster282

Cluster431

Cluster17

Cluster117

Cluster318

Cluster114

Cluster149

Cluster244

Cluster95

Cluster56


Cluster424

Cluster479

Cluster356

Cluster286

Cluster313

Cluster344

Cluster201

Cluster291

Cluster457

Cluster420

Cluster330

Cluster23

Cluster62

Cluster498

Cluster405

Cluster443

Cluster214

Cluster152

Cluster1

Cluster394

Cluster284

Cluster190

Cluster439

Cluster345

Cluster227

Cluster436

Cluster308

Cluster195

Cluster403

city

mountain

sky

sun

water

clouds

tree

bay

lake

sea

beach

boats

people

branch

leaf

grass

plain

palm

horizon

shell

hills

waves

birds

land

dog

bridge

ships

buildings

fence

island

storm

peaks

jet

plane

runway

basket

flight

flag

helicopter

boeing

prop

f−16

tails

smokeformation

bear

polar

snow

tundra

ice

head

black

reflection

ground

forest

fall

river

field

flowers

stream

meadow

rocks

hillside

shrubs

close−up

grizzly

cubs

drum

log

hut

sunset

display

plants

pool

coral

fan

anemone

fish

ocean

diver

sunrise

face

sand

rainbow

farms

reefs

vegetation

house

village

carvings

path

wood

dress

coast

sailboats

cat

tiger

bengal

fox

kit run

shadows

winter

autumn

cliff

bush

rockface

pair

den

coyote

light

arctic

shore

town

road

chapel

moon

harbor

windmills

restaurant

wall

skyline

window

clothes

shops

street

cafe

tables

nets

crafts

roofs

ruins

stone

cars

castle

courtyard

statuestairs

costume

sponges

sign

palace

paintings

sheep

valley

balcony

post

gate

plaza

festival

temple

sculpture

museum

hotel

art

fountain

market

door

mural

garden

star

butterfly

angelfish

lion

cave

crab

grouper

pagoda

buddha

decoration

monastery

landscape

detail

writing

sails

food

room

entrance

fruit

night

perch

cow

figures

facade

chairs

guard

pond

church

park

barnarchhats

cathedral

ceremony

crowd

glass

shrine

model

pillar

carpet

monument

floor

vines

cottage

poppies

lawn

tower

vegetables

bench

rose

tulip

canal

cheese

railing

dock

horses

petals

umbrella

column

waterfalls

elephant

monks

pattern

interior

vendor

silhouette

architecture

blossoms

athlete

parade

ladder

sidewalk

store

steps

relief

fog

frostfrozen

rapids

crystals

spider

needles

stick

mist

doorway

vineyard

pottery

pots

military

designs

mushrooms

terrace

tent

bulls

gianttortoise

wings

albatross

booby

nest

hawk

iguana

lizard

marine

penguin

deer

white−tailed

horns

slope

mule

fawn

antlers

elk

caribouherd

moose

clearing

marefoals

orchidlily

stems

row

chrysanthemums

blooms

cactus

saguaro

giraffe

zebra

tusks

hands

train

desert

dunes

canyon

lighthouse

mast

seals

texture

dustpepper

swimmers

pyramid

mosque

sphinx

truck

fly

trunk

baby

eagle

lynx

rodent

squirrel

goat

marsh

wolf

pack

dall

porcupine

whales

rabbit

tracks

crops

animals

moss

trail

locomotiverailroad

vehicleaerial

range

insect

man

woman

rice

prayer

glacier

harvest

girl

indian

pole

dance

african

shirt

buddhist

tomb

outside

shade

formula

turnstraightaway

prototype

steel

scotland

ceilingfurniture

lichen

pups

antelope

pebbles

remains

leopard

jeep

calf

reptile

snake

cougar

oahu

kauai

maui

school

canoe race

hawaii


Cluster94

Cluster352

Cluster96

Cluster395

Cluster442

Cluster433

Cluster118

Cluster365

Cluster83

Cluster106

Cluster489

Cluster71

Cluster89

Cluster390 Cluster165

Cluster495

Cluster380

Cluster20

Cluster120

Cluster471

Cluster57

Cluster144

Cluster150

Cluster60

Cluster449

Cluster411

Cluster224

Cluster275

Cluster84

Cluster53

Cluster88

Cluster325

Cluster50

Cluster112

Cluster288

Cluster317Cluster9

Cluster490

Cluster66

Cluster214

Cluster397

Cluster100

Cluster336

Cluster85

Cluster18

Cluster74

Cluster413

Cluster297

Cluster364

Cluster11

Cluster369

Cluster243

Cluster185

Cluster273

Cluster427

Cluster279

Cluster194

Cluster385

Cluster231

Cluster396

Cluster5

Cluster157

Cluster22

Cluster423

Cluster159

Cluster121

Cluster180

Cluster498


Cluster358

Cluster432

Cluster254

Cluster366

Cluster456

Cluster62

Cluster493

Cluster21

Cluster494

Cluster282

Cluster492

Cluster228

Cluster133

Cluster258

Cluster204

Cluster274

Cluster440

Cluster444

Cluster113

Cluster460


Cluster399

Cluster326

Cluster148

Cluster192

Cluster19

Cluster459

Cluster278

Cluster408

Cluster441

Cluster463

Cluster482

Cluster376

Cluster304

Cluster252

Cluster268

Cluster27

Cluster149

Cluster426

Cluster410

Cluster79

Cluster318

Cluster170

Cluster55

Cluster166

Cluster321

Cluster389

Cluster333

Cluster377

Cluster52

Cluster241

Cluster138

Cluster345

Cluster163

Cluster398

Cluster294

Cluster154

Cluster437

Cluster414

Cluster48

Cluster417

Cluster331

Cluster2

Cluster415

Cluster467

Cluster12

Cluster478

Cluster209

Cluster174

Cluster82

Cluster137

Cluster354

Cluster341

Cluster123

Cluster308

Cluster238

Cluster465

Cluster117

Cluster43


Cluster379

Cluster462

Cluster178

Cluster72

Cluster115

Cluster458

Cluster260

Cluster461

Cluster234

Cluster436

Cluster182

Cluster169

Cluster359


Cluster51

Cluster38

Cluster244

Cluster263

Cluster295

Cluster59

Cluster205

Cluster301

Cluster293

Cluster420

Cluster230

Cluster281

Cluster160

Cluster242

Cluster37


Cluster81

Cluster438

Cluster227

Cluster292

Cluster191

Cluster322

Cluster406

Cluster448

Cluster349

Cluster450

Cluster15

Cluster291

Cluster129

Cluster394

Cluster491

Cluster348

Cluster128

Cluster101

Cluster305

Cluster130

Cluster45

Cluster220

Cluster4

Cluster486

Cluster156

Cluster235

Cluster221

Cluster386

Cluster316

Cluster151

Cluster351

Cluster276

Cluster381

Cluster289

Cluster422

Cluster404

Cluster481

Cluster173

Cluster267


Cluster393

Cluster454

Cluster127

Cluster477

Cluster499

Cluster378

Cluster430

Cluster35

Cluster136

Cluster416

Cluster98

Cluster119

Cluster7Cluster264

Cluster271

Cluster39

Cluster392

Cluster330

Cluster391

Cluster114

Cluster186

Cluster164

Cluster58

Cluster162

Cluster246

Cluster483

Cluster14

Cluster435

Cluster199

Cluster215

Cluster6

Cluster111

Cluster70


Cluster167

Cluster451

Cluster239

Cluster202

Cluster25

Cluster306

Cluster211

Cluster168

Cluster217

Cluster457

Cluster146

Cluster485

Cluster219

Cluster47

Cluster155

Cluster67

Cluster193

Cluster363

Cluster40

Cluster75

Cluster455

Cluster299

Cluster405

Cluster323

Cluster372

Cluster468

Cluster475

Cluster313

Cluster104

Cluster223

Cluster311

Cluster124

Cluster213

Cluster8

Cluster122

Cluster284

Cluster190

Cluster350

Cluster181

Cluster210

Cluster41

Cluster434

Cluster183

Cluster474

Cluster126

Cluster44

Cluster232

Cluster147

Cluster63

Cluster342

Cluster46

Cluster140

Cluster309

Cluster403

Cluster470

Cluster357

Cluster152

Cluster207

Cluster340


Cluster262

Cluster233

Cluster269

Cluster387

Cluster68

Cluster34

Cluster208

Cluster23

Cluster375

Cluster29

Cluster310

Cluster49Cluster30

Cluster1

Cluster353

Cluster425

Cluster327

Cluster110

Cluster497

Cluster402

Cluster257

Cluster476


Cluster197

Cluster175

Cluster76

Cluster256

Cluster179

Cluster338

Cluster222

Cluster187

Cluster370

Cluster384

Cluster277

Cluster431

Cluster17

Cluster109

Cluster265

Cluster447

Cluster407

Cluster226

Cluster56

Cluster360

Cluster153

Cluster184

Cluster445

Cluster400

Cluster382

Cluster91

Cluster247

Cluster255

Cluster142

Cluster424


Cluster356

Cluster196

Cluster97

Cluster195


Cluster361

Cluster206

Cluster259

Cluster480

Cluster3

Cluster347

Cluster419

Cluster107

Cluster443

Cluster87

Cluster240

Cluster24

Cluster280

Cluster61

Cluster225


Cluster270

Cluster296

Cluster488

Cluster16

Cluster116

Cluster302

Cluster26

Cluster487

Cluster171

Cluster143

Cluster249

Cluster77

Cluster464

(b) Corel5k

Figure 7: The local BN structures learned by MMHC (left plot) and H2PC (right plot) on a single cross-validation split, on Bibtex and Corel5k.

23

References

Agresti, A. (2002). Categorical Data Analysis. Wiley, 2nd edition.Aliferis, C. F., Statnikov, A. R., Tsamardinos, I., Mani, S., & Kout-

soukos, X. D. (2010). Local causal and markov blanket inductionfor causal discovery and feature selection for classification part i:Algorithms and empirical evaluation. Journal of Machine Learn-

ing Research, JMLR, 11, 171–234.Alvares-Cherman, E., Metz, J., & Monard, M. (2011). Incorporating

label dependency into the binary relevance framework for multi-label classification. Expert Systems With Applications, ESWA, 39,1647–1655.

Armen, A. P., & Tsamardinos, I. (2011). A unified approach to esti-mation and control of the false discovery rate in bayesian networkskeleton identification. In European Symposium on Artificial Neu-

ral Networks, ESANN.Aussem, A., Rodrigues de Morais, S., & Corbex, M. (2012). Analysis

of nasopharyngeal carcinoma risk factors with bayesian networks.Artificial Intelligence in Medicine, 54.

Aussem, A., Tchernof, A., Rodrigues de Morais, S., & Rome, S.(2010). Analysis of lifestyle and metabolic predictors of visceralobesity with bayesian networks. BMC Bioinformatics, 11, 487.

Badea, A. (2004). Determining the direction of causal influence inlarge probabilistic networks: A constraint-based approach. In Pro-

ceedings of the Sixteenth European Conference on Artificial Intel-

ligence (pp. 263–267).Bernard, A., & Hartemink, A. (2005). Informative structure priors:

Joint learning of dynamic regulatory networks from multiple typesof data. In Proceedings of the Pacific Symposium on Biocomputing

(pp. 459–470).Blockeel, H., Raedt, L. D., & Ramon, J. (1998). Top-down induction

of clustering trees. In J. W. Shavlik (Ed.), International Conference

on Machine Learning, ICML (pp. 55–63). Morgan Kaufmann.Borchani, H., Bielza, C., Toro, C., & Larranaga, P. (2013). Predicting

human immunodeficiency virus inhibitors using multi-dimensionalbayesian network classifiers. Artificial Intelligence in Medicine,57, 219–229.

Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.Brown, L. E., & Tsamardinos, I. (2008). A strategy for making predic-

tions under manipulation. Journal of Machine Learning Research,

JMLR, 3, 35–52.Buntine, W. (1991). Theory refinement on Bayesian networks. In

B. D’Ambrosio, P. Smets, & P. Bonissone (Eds.), Uncertainty in

Artificial Intelligence, UAI (pp. 52–60). San Mateo, CA, USA:Morgan Kaufmann Publishers.

Cawley, G. (2008). Causal and non-causal feature selection for ridgeregression. JMLR: Workshop and Conference Proceedings, 3,107–128.

Cheng, J., Greiner, R., Kelly, J., Bell, D. A., & Liu, W. (2002). Learn-ing Bayesian networks from data: An information-theory basedapproach. Artificial Intelligence, 137, 43–90.

Chickering, D., Heckerman, D., & Meek, C. (2004). Large-samplelearning of Bayesian networks is NP-hard. Journal of Machine

Learning Research, JMLR, 5, 1287–1330.Chickering, D. M. (2002). Optimal structure identification with

greedy search. Journal of Machine Learning Research, JMLR, 3,507–554.

Cussens, J., & Bartlett, M. (2013). Advances in Bayesian NetworkLearning using Integer Programming. In Uncertainty in Artificial

Intelligence (pp. 182–191).DembczyÅski, K., Waegeman, W., Cheng, W., & Hullermeier, E.

(2012). On label dependence and loss minimization in multi-labelclassification. Machine Learning, 88, 5–45.

Ellis, B., & Wong, W. H. (2008). Learning causal bayesian network

structures from experimental data. Journal of the American Statis-

tical Association, 103, 778–789.Friedman, N., Nachman, I., , & Peer, D. (1999a). Learning bayesian

network structure from massive datasets: The sparse candidate al-gorithm. In Proceedings of the Fifteenth Conference on Uncer-

tainty in Artificial Intelligence (pp. 206–215).Friedman, N., Nachman, I., & Pe’er, D. (1999b). Learning bayesian

network structure from massive datasets: the ”sparse candidate”algorithm. In K. B. Laskey, & H. Prade (Eds.), Uncertainty in

Artificial Intelligence, UAI (pp. 21–30). Morgan Kaufmann Pub-lishers.

Gasse, M., Aussem, A., & Elghazel, H. (2012). Comparison of hybridalgorithms for bayesian network structure learning. In European

Conference on Machine Learning and Principles and Practice of

Knowledge Discovery in Databases, ECML-PKDD (pp. 58–73).Springer volume 7523 of Lecture Notes in Computer Science.

Gu, Q., Li, Z., & Han, J. (2011). Correlated multi-label feature selec-tion. In C. Macdonald, I. Ounis, & I. Ruthven (Eds.), CIKM (pp.1087–1096). ACM.

Guo, Y., & Gu, S. (2011). Multi-label classification using conditionaldependency networks. In International Joint Conference on Artifi-

cial Intelligence, IJCAI (pp. 1300–1305). AAAI Press.Heckerman, D., Geiger, D., & Chickering, D. (1995). Learning

bayesian networks: The combination of knowledge and statisticaldata. Machine Learning, 20, 197–243.

Kocev, D., Vens, C., Struyf, J., & Dzeroski, S. (2007). Ensemblesof multi-objective decision trees. In J. N. Kok (Ed.), European

Conference on Machine Learning, ECML (pp. 624–631). Springervolume 4701 of Lecture Notes in Artificial Intelligence.

Koivisto, M., & Sood, K. (2004). Exact bayesian structure discov-ery in bayesian networks. Journal of Machine Learning Research,

JMLR, 5, 549–573.Kojima, K., Perrier, E., Imoto, S., & Miyano, S. (2010). Optimal

search on clustered structural constraint for learning bayesian net-work structure. Journal of Machine Learning Research, JMLR, 11,285–310.

Koller, D., & Friedman, N. (2009). Probabilistic Graphical Models:

Principles and Techniques. MIT Press.Koller, D., & Sahami, M. (1996). Toward optimal feature selection. In

International Conference on Machine Learning, ICML (pp. 284–292).

Liaw, A., & Wiener, M. (2002). Classification and regression by ran-domforest. R News, 2, 18–22.

Luaces, O., Dıez, J., Barranquero, J., del Coz, J. J., & Bahamonde,A. (2012). Binary relevance efficacy for multilabel classification.Progress in AI, 1, 303–313.

Madjarov, G., Kocev, D., Gjorgjevikj, D., & Dzeroski, S. (2012).An extensive experimental comparison of methods for multi-labellearning. Pattern Recognition, 45, 3084–3104.

Maron, O., & Ratan, A. L. (1998). Multiple-Instance Learning forNatural Scene Classification. In International Conference on Ma-

chine Learning, ICML (pp. 341–349). Citeseer volume 7.McCallum, A. (1999). Multi-label text classification with a mixture

model trained by em. In AAAI Workshop on Text Learning.Moore, A., & Wong, W. (2003). Optimal reinsertion: A new

search operator for accelerated and more accurate Bayesian net-work structure learning. In T. Fawcett, & N. Mishra (Eds.), Inter-

national Conference on Machine Learning, ICML.Nagarajan, R., Scutari, M., & Lbre, S. (2013). Bayesian Networks in

R: with Applications in Systems Biology (Use R!). Springer.Neapolitan, R. E. (2004). Learning Bayesian Networks. Upper Saddle

River, NJ: Pearson Prentice Hall.Ott, S., Imoto, S., & Miyano, S. (2004). Finding optimal models for

small gene networks. In Proceedings of the Pacific Symposium on

Biocomputing (pp. 557–567).

24

Pena, J., Nilsson, R., Bjorkegren, J., & Tegner, J. (2007). Towardsscalable and data efficient learning of Markov boundaries. Inter-

national Journal of Approximate Reasoning, 45, 211–232.Pena, J. M., Bjorkegren, J., & Tegner, J. (2005). Growing Bayesian

network models of gene networks from seed genes. Bioinformat-

ics, 40, 224–229.Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Net-

works of Plausible Inference.. San Francisco, CA, USA: MorganKaufmann.

Pena, J. M. (2008). Learning gaussian graphical models of gene net-works with false discovery rate control. In European Conference

on Evolutionary Computation, Machine Learning and Data Min-

ing in Bioinformatics 6 (pp. 165–176).Pena, J. M. (2012). Finding consensus bayesian network structures.

Journal of Artificial Intelligence Research, 42, 661–687.Perrier, E., Imoto, S., & Miyano, S. (2008). Finding optimal bayesian

network given a super-structure. Journal of Machine Learning Re-

search, JMLR, 9, 2251–2286.Peer, D., Regev, A., Elidan, G., & Friedman, N. (2001). Inferring

subnetworks from perturbed expression profiles. Bioinformatics,17, 215–224.

Prestat, E., Rodrigues de Morais, S., Vendrell, J., Thollet, A., Gautier,C., Cohen, P., & Aussem, A. (2013). Learning the local Bayesiannetwork structure around the ZNF217 oncogene in breast tumours.Computers in Biology and Medicine, 4, 334–341.

R Core Team (2013). R: A Language and Environment for Statisti-

cal Computing. R Foundation for Statistical Computing Vienna,Austria. URL: http://www.R-project.org.

Read, J., Pfahringer, B., Holmes, G., & Frank, E. (2009). Classi-fier chains for multi-label classification. In European Conference

on Machine Learning and Principles and Practice of Knowledge

Discovery in Databases, ECML-PKDD (pp. 254–269). Springervolume 5782 of Lecture Notes in Computer Science.

Rodrigues de Morais, S., & Aussem, A. (2010a). An efficient learningalgorithm for local bayesian network structure discovery. In Euro-

pean Conference on Machine Learning and Principles and Prac-

tice of Knowledge Discovery in Databases, ECML-PKDD (pp.164–169).

Rodrigues de Morais, S., & Aussem, A. (2010b). A novel Markovboundary based feature subset selection algorithm. Neurocomput-

ing, 73, 578–584.Roth, V., & Fischer, B. (2007). Improved functional prediction of pro-

teins by learning kernel combinations in multilabel settings. BMC

Bioinformatics, 8, S12+.Schwarz, G. E. (1978). Estimating the dimension of a model. Journal

of Biomedical Informatics, 6, 461–464.Scutari, M. (2010). Learning bayesian networks with the bnlearn R

package. Journal of Statistical Software, 35, 1–22.Scutari, M. (2011). Measures of Variability for Graphical Models.

Ph.D. thesis School in Statistical Sciences, University of Padova.Scutari, M., & Brogini, A. (2012). Bayesian network structure learn-

ing with permutation tests. Communications in Statistics Theory

and Methods, 41, 3233–3243.Scutari, M., & Nagarajan, R. (2013). Identifying significant edges in

graphical models of molecular networks. Artificial Intelligence in

Medicine, 57, 207–217.Silander, T., & Myllymaki, P. (2006). A Simple Approach for Finding

the Globally Optimal Bayesian Network Structure. In Uncertainty

in Artificial Intelligence, UAI (pp. 445–452).Snoek, C., Worring, M., Gemert, J. V., Geusebroek, J., & Smeulders,

A. (2006). The challenge problem for automated detection of 101semantic concepts in multimedia. In ACM International Confer-

ence on Multimedia (pp. 421–430). ACM Press.Spirtes, P., Glymour, C., & Scheines, R. (2000). Causation, Predic-

tion, and Search. (2nd ed.). The MIT Press.

Spolaor, N., Cherman, E. A., Monard, M. C., & Lee, H. D. (2013).A comparison of multi-label feature selection methods using theproblem transformation approach. Electronic Notes in Theoretical

Computer Science, 292, 135–151.Studeny, M., & Haws, D. (2014). Learning Bayesian network struc-

ture: Towards the essential graph by integer linear programmingtools. International Journal of Approximate Reasoning, 55, 1043–1071.

Trohidis, K., Tsoumakas, G., Kalliris, G., & Vlahavas, I. (2008).Multi-label classification of music into emotions. In ISMIR (pp.325–330).

Tsamardinos, I., Aliferis, C., & Statnikov, A. (2003). Algorithms forlarge scale Markov blanket discovery. In Florida Artificial Intelli-

gence Research Society Conference FLAIRS’03 (pp. 376–381).Tsamardinos, I., & Borboudakis, G. (2010). Permutation testing im-

proves bayesian network learning. In European Conference on Ma-

chine Learning and Knowledge Discovery in Databases, ECML-

PKDD (pp. 322–337).Tsamardinos, I., Brown, L., & Aliferis, C. (2006). The Max-Min

Hill-Climbing Bayesian Network Structure Learning Algorithm.Machine Learning, 65, 31–78.

Tsamardinos, I., & Brown, L. E. (2008). Bounding the false discoveryrate in local Bayesian network learning. In AAAI Conference on

Artificial Intelligence (pp. 1100–1105).Tsoumakas, G., Katakis, I., & Vlahavas, I. (2010a). Mining Multi-

label Data. Transformation, 135, 1–20.Tsoumakas, G., Katakis, I., & Vlahavas, I. (2010b). Random k-

labelsets for Multi-Label Classification. IEEE Transactions on

Knowledge and Data Engineering, TKDE, 23, 1–12.Tsoumakas, G., & Vlahavas, I. (2007). Random k-labelsets: An en-

semble method for multilabel classification. Proceedings of the

18th European Conference on Machine Learning, 4701, 406–417.Villanueva, E., & Maciel, C. (2012). Optimized algorithm for learning

bayesian network superstructures. In International Conference on

Pattern Recognition Applications and Methods, ICPRAM.Villanueva, E., & Maciel, C. D. (2014). Efficient methods for learning

bayesian network super-structures. Neurocomputing, (pp. 3–12).Zhang, M. L., & Zhang, K. (2010). Multi-label learning by ex-

ploiting label dependency. In European Conference on Machine

Learning and Principles and Practice of Knowledge Discovery in

Databases, ECML PKDD (p. 999). ACM Press volume 16 ofKnowledge Discovery in Databases, KDD.

Zhang, M.-L., & Zhou., Z.-H. (2006). Multilabel neural networkswith applications to functional genomics and text categorization.IEEE Transactions on Knowledge and Data Engineering, 18, 1338– 1351.

25

Documents

A hybrid algorithm for Bayesian network structure learning with … · 2021. 2. 28. · A hybrid algorithm for Bayesian network structure learning with application to multi-label