Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
HAL Id: hal-01153255https://hal.archives-ouvertes.fr/hal-01153255
Submitted on 19 May 2015
HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.
Distributed under a Creative Commons Attribution - NonCommercial - NoDerivatives| 4.0International License
A hybrid algorithm for Bayesian network structurelearning with application to multi-label learning
Maxime Gasse, Alex Aussem, Haytham Elghazel
To cite this version:Maxime Gasse, Alex Aussem, Haytham Elghazel. A hybrid algorithm for Bayesian network structurelearning with application to multi-label learning. Expert Systems with Applications, Elsevier, 2014,41 (15), pp.6755-6772. �10.1016/j.eswa.2014.04.032�. �hal-01153255�
A hybrid algorithm for Bayesian network structure learningwith application to multi-label learning
Maxime Gasse, Alex Aussem∗, Haytham Elghazel
Universite de Lyon, CNRS
Universite Lyon 1, LIRIS, UMR5205, F-69622, France
Abstract
We present a novel hybrid algorithm for Bayesian network structure learning, called H2PC. It first reconstructs theskeleton of a Bayesian network and then performs a Bayesian-scoring greedy hill-climbing search to orient the edges.The algorithm is based on divide-and-conquer constraint-based subroutines to learn the local structure around a targetvariable. We conduct two series of experimental comparisons of H2PC against Max-Min Hill-Climbing (MMHC),which is currently the most powerful state-of-the-art algorithm for Bayesian network structure learning. First, we useeight well-known Bayesian network benchmarks with various data sizes to assess the quality of the learned structurereturned by the algorithms. Our extensive experiments show that H2PC outperforms MMHC in terms of goodnessof fit to new data and quality of the network structure with respect to the true dependence structure of the data.Second, we investigate H2PC’s ability to solve the multi-label learning problem. We provide theoretical results tocharacterize and identify graphically the so-called minimal label powersets that appear as irreducible factors in thejoint distribution under the faithfulness condition. The multi-label learning problem is then decomposed into a seriesof multi-class classification problems, where each multi-class variable encodes a label powerset. H2PC is shownto compare favorably to MMHC in terms of global classification accuracy over ten multi-label data sets coveringdifferent application domains. Overall, our experiments support the conclusions that local structural learning withH2PC in the form of local neighborhood induction is a theoretically well-motivated and empirically effective learningframework that is well suited to multi-label learning. The source code (in R) of H2PC as well as all data sets used forthe empirical tests are publicly available.
Keywords: Bayesian networks, Multi-label learning, Markov boundary, Feature subset selection.
1. Introduction
A Bayesian network (BN) is a probabilistic modelformed by a structure and parameters. The structureof a BN is a directed acyclic graph (DAG), whilst itsparameters are conditional probability distributions as-sociated with the variables in the model. The problemof finding the DAG that encodes the conditional inde-pendencies present in the data attracted a great dealof interest over the last years (Rodrigues de Morais &Aussem, 2010a; Scutari, 2010; Scutari & Brogini, 2012;Kojima et al., 2010; Perrier et al., 2008; Villanueva &Maciel, 2012; Pena, 2012; Gasse et al., 2012). The in-ferred DAG is very useful for many applications, includ-ing feature selection (Aliferis et al., 2010; Pena et al.,
∗Corresponding authorEmail address: [email protected] (Alex Aussem)
2007; Rodrigues de Morais & Aussem, 2010b), causalrelationships inference from observational data (Ellis& Wong, 2008; Aliferis et al., 2010; Aussem et al.,2012, 2010; Prestat et al., 2013; Cawley, 2008; Brown& Tsamardinos, 2008) and more recently multi-labellearning (DembczyÅski et al., 2012; Zhang & Zhang,2010; Guo & Gu, 2011).
Ideally the DAG should coincide with the dependencestructure of the global distribution, or it should at leastidentify a distribution as close as possible to the correctone in the probability space. This step, called structurelearning, is similar in approaches and terminology tomodel selection procedures for classical statistical mod-els. Basically, constraint-based (CB) learning methodssystematically check the data for conditional indepen-dence relationships and use them as constraints to con-struct a partially oriented graph representative of a BN
equivalence class, whilst search-and-score (SS) meth-ods make use of a goodness-of-fit score function forevaluating graphical structures with regard to the dataset. Hybrid methods attempt to get the best of bothworlds: they learn a skeleton with a CB approach andconstrain on the DAGs considered during the SS phase.
In this study, we present a novel hybrid algorithm forBayesian network structure learning, called H2PC1. Itfirst reconstructs the skeleton of a Bayesian network andthen performs a Bayesian-scoring greedy hill-climbingsearch to orient the edges. The algorithm is basedon divide-and-conquer constraint-based subroutines tolearn the local structure around a target variable. HPCmay be thought of as a way to compensate for the largenumber of false negatives at the output of the weak PClearner, by performing extra computations. As this mayarise at the expense of the number of false positives,we control the expected proportion of false discover-ies (i.e. false positive nodes) among all the discoveriesmade in PCT . We use a modification of the Incrementalassociation Markov boundary algorithm (IAMB), ini-tially developed by Tsamardinos et al. in (Tsamardi-nos et al., 2003) and later modified by Jose Pena in(Pena, 2008) to control the FDR of edges when learn-ing Bayesian network models. HPC scales to thou-sands of variables and can deal with many fewer sam-ples (n < q). To illustrate its performance by meansof empirical evidence, we conduct two series of exper-imental comparisons of H2PC against Max-Min Hill-Climbing (MMHC), which is currently the most pow-erful state-of-the-art algorithm for BN structure learn-ing (Tsamardinos et al., 2006), using well-known BNbenchmarks with various data sizes, to assess the good-ness of fit to new data as well as the quality of thenetwork structure with respect to the true dependencestructure of the data.
We then address a real application of H2PC wherethe true dependence structure is unknown. More specif-ically, we investigate H2PC’s ability to encode the jointdistribution of the label set conditioned on the inputfeatures in the multi-label classification (MLC) prob-lem. Many challenging applications, such as photo andvideo annotation and web page categorization, can ben-efit from being formulated as MLC tasks with largenumber of categories (DembczyÅski et al., 2012; Readet al., 2009; Madjarov et al., 2012; Kocev et al., 2007;Tsoumakas et al., 2010b). Recent research in MLC fo-cuses on the exploitation of the label conditional de-
1A first version of HP2C without FDR control has been discussedin a paper that appeared in the Proceedings of ECML-PKDD, pages58-73, 2012.
pendency in order to better predict the label combina-tion for each example. We show that local BN struc-ture discovery methods offer an elegant and power-ful approach to solve this problem. We establish twotheorems (Theorem 6 and 7) linking the concepts ofmarginal Markov boundaries, joint Markov boundariesand so-called label powersets under the faithfulness as-sumption. These Theorems offer a simple guideline tocharacterize graphically : i) the minimal label powersetdecomposition, (i.e. into minimal subsets YLP ⊆ Y suchthat YLP ⊥⊥ Y \ YLP | X), and ii) the minimal subset offeatures, w.r.t an Information Theory criterion, neededto predict each label powerset, thereby reducing the in-put space and the computational burden of the multi-label classification. To solve the MLC problem withBNs, the DAG obtained from the data plays a pivotalrole. So in this second series of experiments, we assessthe comparative ability of H2PC and MMHC to encodethe label dependency structure by means of an indirectgoodness of fit indicator, namely the 0/1 loss function,which makes sense in the MLC context.
The rest of the paper is organized as follows: Inthe Section 2, we review the theory of BN and dis-cuss the main BN structure learning strategies. We thenpresent the H2PC algorithm in details in Section 3. Sec-tion 4 evaluates our proposed method and shows resultsfor several tasks involving artificial data sampled fromknown BNs. Then we report, in Section 5, on our exper-iments on real-world data sets in a multi-label learningcontext so as to provide empirical support for the pro-posed methodology. The main theoretical results appearformally as two theorems (Theorem 8 and 9) in Section5. Their proofs are established in the Appendix. Fi-nally, Section 6 raises several issues for future work andwe conclude in Section 7 with a summary of our contri-bution.
2. Preliminaries
We define next some key concepts used along the pa-per and state some results that will support our analysis.In this paper, upper-case letters in italics denote randomvariables (e.g., X, Y) and lower-case letters in italics de-note their values (e.g., x, y). Upper-case bold letters de-note random variable sets (e.g., X,Y,Z) and lower-casebold letters denote their values (e.g., x, y). We denoteby X ⊥⊥ Y | Z the conditional independence between X
and Y given the set of variables Z. To keep the notationuncluttered, we use p(y | x) to denote p(Y = y | X = x).
2
2.1. Bayesian networks
Formally, a BN is a tuple < G, P >, where G =<U,E > is a directed acyclic graph (DAG) with nodesrepresenting the random variables U and P a joint prob-ability distribution in U. In addition, G and P mustsatisfy the Markov condition: every variable, X ∈ U,is independent of any subset of its non-descendant vari-ables conditioned on the set of its parents, denoted byPaG
i. From the Markov condition, it is easy to prove
(Neapolitan, 2004) that the joint probability distributionP on the variables in U can be factored as follows :
P(V) = P(X1, . . . , Xn) =n∏
i=1
P(Xi|PaG
i) (1)
Equation 1 allows a parsimonious decomposition ofthe joint distribution P. It enables us to reduce the prob-lem of determining a huge number of probability valuesto that of determining relatively few.
A BN structureG entails a set of conditional indepen-dence assumptions. They can all be identified by the d-
separation criterion (Pearl, 1988). We use X ⊥G Y |Z todenote the assertion that X is d-separated from Y givenZ inG. Formally, X ⊥G Y |Z is true when for every undi-rected path in G between X and Y, there exists a nodeW in the path such that either (1) W does not have twoparents in the path and W ∈ Z, or (2) W has two parentsin the path and neither W nor its descendants is in Z. If< G, P > is a BN, X ⊥P Y |Z if X ⊥G Y |Z. The conversedoes not necessarily hold. We say that < G, P > satis-fies the faithfulness condition if the d-separations in Gidentify all and only the conditional independencies inP, i.e., X ⊥P Y |Z iff X ⊥G Y |Z.
A Markov blanket MT of T is any set of variablessuch that T is conditionally independent of all the re-maining variables given MT . By extension, a Markovblanket of T in V guarantees that MT ⊆ V, and that T
is conditionally independent of the remaining variablesin V, given MT . A Markov boundary, MBT , of T is anyMarkov blanket such that none of its proper subsets is aMarkov blanket of T .
We denote by PCG
T, the set of parents and children
of T in G, and by SPG
T, the set of spouses of T in G.
The spouses of T are the variables that have commonchildren with T . These sets are unique for all G, suchthat < G, P > satisfies the faithfulness condition and sowe will drop the superscript G. We denote by dSep(X),the set that d-separates X from the (implicit) target T .
Theorem 1. Suppose < G, P > satisfies the faithfulness
condition. Then X and Y are not adjacent in G iff ∃Z ∈
U \ {X, Y} such that X ⊥⊥ Y | Z . Moreover, MBX =
PCX ∪ SPX .
A proof can be found for instance in (Neapolitan,2004).
Two graphs are said equivalent iff they encode thesame set of conditional independencies via the d-separation criterion. The equivalence class of a DAGG is a set of DAGs that are equivalent to G. The nextresult showed by (Pearl, 1988), establishes that equiv-alent graphs have the same undirected graph but mightdisagree on the direction of some of the arcs.
Theorem 2. Two DAGs are equivalent iff they have the
same underlying undirected graph and the same set of
v-structures (i.e. converging edges into the same node,
such as X → Y ← Z).
Moreover, an equivalence class of network structurescan be uniquely represented by a partially directed DAG(PDAG), also called a DAG pattern. The DAG pattern isdefined as the graph that has the same links as the DAGsin the equivalence class and has oriented all and only theedges common to the DAGs in the equivalence class.A structure learning algorithm from data is said to becorrect (or sound) if it returns the correct DAG pattern(or a DAG in the correct equivalence class) under theassumptions that the independence tests are reliable andthat the learning database is a sample from a distributionP faithful to a DAG G, The (ideal) assumption that theindependence tests are reliable means that they decide(in)dependence iff the (in)dependence holds in P.
2.2. Conditional independence properties
The following three theorems, borrowed from Penaet al. (2007), are proven in Pearl (1988):
Theorem 3. Let X,Y,Z and W denote four mutu-
ally disjoint subsets of U. Any probability distribu-
tion p satisfies the following four properties: Symme-
try X ⊥⊥ Y | Z ⇒ Y ⊥⊥ X | Z, Decomposition,
X ⊥⊥ (Y ∪ W) | Z ⇒ X ⊥⊥ Y | Z, Weak Union
X ⊥⊥ (Y∪W) | Z⇒ X ⊥⊥ Y | (Z∪W) and Contraction,
X ⊥⊥ Y | (Z ∪W) ∧ X ⊥⊥W | Z⇒ X ⊥⊥ (Y ∪W) | Z .
Theorem 4. If p is strictly positive, then p satisfies the
previous four properties plus the Intersection property
X ⊥⊥ Y | (Z ∪W) ∧ X ⊥⊥ W | (Z ∪ Y) ⇒ X ⊥⊥ (Y ∪W) | Z. Additionally, each X ∈ U has a unique Markov
boundary, MBX
Theorem 5. If p is faithful to a DAG G, then p satisfies
the previous five properties plus the Composition prop-
erty X ⊥⊥ Y | Z ∧ X ⊥⊥W | Z ⇒ X ⊥⊥ (Y ∪W) | Z and
the local Markov property X ⊥⊥ (NDX \ PaX) | PaX for
each X ∈ U, where NDX denotes the non-descendants
of X in G.
3
2.3. Structure learning strategies
The number of DAGs, G, is super-exponential in thenumber of random variables in the domain and the prob-lem of learning the most probable a posteriori BN fromdata is worst-case NP-hard (Chickering et al., 2004).One needs to resort to heuristical methods in order tobe able to solve very large problems effectively.
Both CB and SS heuristic approaches have advan-tages and disadvantages. CB approaches are relativelyquick, deterministic, and have a well defined stoppingcriterion; however, they rely on an arbitrary significancelevel to test for independence, and they can be unsta-ble in the sense that an error early on in the searchcan have a cascading effect that causes many errors tobe present in the final graph. SS approaches have theadvantage of being able to flexibly incorporate users’background knowledge in the form of prior probabili-ties over the structures and are also capable of dealingwith incomplete records in the database (e.g. EM tech-nique). Although SS methods are favored in practicewhen dealing with small dimensional data sets, they areslow to converge and the computational complexity of-ten prevents us from finding optimal BN structures (Per-rier et al., 2008; Kojima et al., 2010). With currentlyavailable exact algorithms (Koivisto & Sood, 2004; Si-lander & Myllymaki, 2006; Cussens & Bartlett, 2013;Studeny & Haws, 2014) and a decomposable score likeBDeu, the computational complexity remains exponen-tial, and therefore, such algorithms are intractable forBNs with more than around 30 vertices on current work-stations. For larger sets of variables the computationalburden becomes prohibitive and restrictions about thestructure have to be imposed, such as a limit on thesize of the parent sets. With this in mind, the abilityto restrict the search locally around the target variableis a key advantage of CB methods over SS methods.They are able to construct a local graph around the tar-get node without having to construct the whole BN first,hence their scalability (Pena et al., 2007; Rodrigues deMorais & Aussem, 2010b,a; Tsamardinos et al., 2006;Pena, 2008).
With a view to balancing the computation cost withthe desired accuracy of the estimates, several hybridmethods have been proposed recently. Tsamardinos etal. proposed in (Tsamardinos et al., 2006) the Min-Max Hill Climbing (MMHC) algorithm and conductedone of the most extensive empirical comparison per-formed in recent years showing that MMHC was thefastest and the most accurate method in terms of struc-tural error based on the structural hamming distance.More specifically, MMHC outperformed both in termsof time efficiency and quality of reconstruction the PC
(Spirtes et al., 2000), the Sparse Candidate (Friedmanet al., 1999b), the Three Phase Dependency Analysis(Cheng et al., 2002), the Optimal Reinsertion (Moore &Wong, 2003), the Greedy Equivalence Search (Chick-ering, 2002), and the Greedy Hill-Climbing Search ona variety of networks, sample sizes, and parameter val-ues. Although MMHC is rather heuristic by nature (itreturns a local optimum of the score function), it is cur-rently considered as the most powerful state-of-the-artalgorithm for BN structure learning capable of dealingwith thousands of nodes in reasonable time.
In order to enhance its performance on small dimen-sional data sets, Perrier et al. proposed in (Perrier et al.,2008) a hybrid algorithm that can learn an optimal BN(i.e., it converges to the true model in the sample limit)when an undirected graph is given as a structural con-straint. They defined this undirected graph as a super-structure (i.e., every DAG considered in the SS phase iscompelled to be a subgraph of the super-structure). Thisalgorithm can learn optimal BNs containing up to 50vertices when the average degree of the super-structureis around two, that is, a sparse structural constraint isassumed. To extend its feasibility to BN with a fewhundred of vertices and an average degree up to four,Kojima et al. proposed in (Kojima et al., 2010) to dividethe super-structure into several clusters and perform anoptimal search on each of them in order to scale up tolarger networks. Despite interesting improvements interms of score and structural hamming distance on sev-eral benchmark BNs, they report running times about103 times longer than MMHC on average, which is stillprohibitive.
Therefore, there is great deal of interest in hybridmethods capable of improving the structural accuracyof both CB and SS methods on graphs containing upto thousands of vertices. However, they make thestrong assumption that the skeleton (also called super-structure) contains at least the edges of the true networkand as small as possible extra edges. While control-ling the false discovery rate (i.e., false extra edges) inBN learning has attracted some attention recently (Ar-men & Tsamardinos, 2011; Pena, 2008; Tsamardinos& Brown, 2008), to our knowledge, there is no work oncontrolling actively the rate of false-negative errors (i.e.,false missing edges).
2.4. Constraint-based structure learning
Before introducing the H2PC algorithm, we discussthe general idea behind CB methods. The induction oflocal or global BN structures is handled by CB methodsthrough the identification of local neighborhoods (i.e.,PCX), hence their scalability to very high dimensional
4
data sets. CB methods systematically check the data forconditional independence relationships in order to in-fer a target’s neighborhood. Typically, the algorithmsrun either a G2 or a χ2 independence test when the dataset is discrete and a Fisher’s Z test when it is continu-ous in order to decide on dependence or independence,that is, upon the rejection or acceptance of the null hy-pothesis of conditional independence. Since we arelimiting ourselves to discrete data, both the global andthe local distributions are assumed to be multinomial,and the latter are represented as conditional probabil-ity tables. Conditional independence tests and networkscores for discrete data are functions of these condi-tional probability tables through the observed frequen-cies {ni jk; i = 1, . . . ,R; j = 1, . . . ,C; k = 1, . . . , L} forthe random variables X and Y and all the configurationsof the levels of the conditioning variables Z. We useni+k as shorthand for the marginal
∑
j ni jk and similarlyfor ni+k, n++k and n+++ = n. We use a classic conditionalindependence test based on the mutual information. Themutual information is an information-theoretic distancemeasure defined as
MI(X, Y |Z) =R∑
i=1
C∑
j=1
L∑
k=1
ni jk
nlog
ni jkn++k
ni+kn+ jk
It is proportional to the log-likelihood ratio test G2
(they differ by a 2n factor, where n is the sample size).The asymptotic null distribution is χ2 with (R − 1)(C −1)L degrees of freedom. For a detailed analysis of theirproperties we refer the reader to (Agresti, 2002). Themain limitation of this test is the rate of convergenceto its limiting distribution, which is particularly prob-lematic when dealing with small samples and sparsecontingency tables. The decision of accepting or re-jecting the null hypothesis depends implicitly upon thedegree of freedom which increases exponentially withthe number of variables in the conditional set. Sev-eral heuristic solutions have emerged in the literature(Spirtes et al., 2000; Rodrigues de Morais & Aussem,2010a; Tsamardinos et al., 2006; Tsamardinos & Bor-boudakis, 2010) to overcome some shortcomings of theasymptotic tests. In this study we use the two follow-ing heuristics that are used in MMHC. First, we do notperform MI(X, Y |Z) and assume independence if thereare not enough samples to achieve large enough power.We require that the average sample per count is abovea user defined parameter, equal to 5, as in (Tsamardi-nos et al., 2006). This heuristic is called the power rule.Second, we consider as structural zero either case n+ jk
or ni+k = 0. For example, if n+ jk = 0, we consider y
as a structurally forbidden value for Y when Z = z andwe reduce R by 1 (as if we had one column less in thecontingency table where Z = z). This is known as thedegrees of freedom adjustment heuristic.
3. The H2PC algorithm
In this section, we present our hybrid algorithmfor Bayesian network structure learning, called Hy-brid HPC (H2PC). It first reconstructs the skeleton ofa Bayesian network and then performs a Bayesian-scoring greedy hill-climbing search to filter and orientthe edges. It is based on a CB subroutine called HPC tolearn the parents and children of a single variable. So,we shall discuss HPC first and then move to H2PC.
3.1. Parents and Children Discovery
HPC (Algorithm 1) can be viewed as an ensemblemethod for combining many weak PC learners in an at-tempt to produce a stronger PC learner. HPC is basedon three subroutines: Data-Efficient Parents and Chil-
dren Superset (DE-PCS), Data-Efficient Spouses Super-
set (DE-SPS), and Incremental Association Parents and
Children with false discovery rate control (FDR-IAPC),a weak PC learner based on FDR-IAMB (Pena, 2008)that requires little computation. HPC receives a targetnode T, a data set D and a set of variables U as inputand returns an estimation of PCT . It is hybrid in thatit combines the benefits of incremental and divide-and-conquer methods. The procedure starts by extractinga superset PCST of PCT (line 1) and a superset SPST
of SPT (line 2) with a severe restriction on the max-imum conditioning size (Z <= 2) in order to signifi-cantly increase the reliability of the tests. A first can-didate PC set is then obtained by running the weak PClearner called FDR-IAPC on PCST ∪SPST (line 3). Thekey idea is the decentralized search at lines 4-8 that in-cludes, in the candidate PC set, all variables in the su-perset PCST \ PCT that have T in their vicinity. So,HPC may be thought of as a way to compensate for thelarge number of false negatives at the output of the weakPC learner, by performing extra computations. Notethat, in theory, X is in the output of FDR-IAPC(Y) ifand only if Y is in the output of FDR-IAPC(X). How-ever, in practice, this may not always be true, particu-larly when working in high-dimensional domains (Pena,2008). By loosening the criteria by which two nodes aresaid adjacent, the effective restrictions on the size of theneighborhood are now far less severe. The decentral-ized search has a significant impact on the accuracy ofHPC as we shall see in in the experiments. We proved in
5
(Rodrigues de Morais & Aussem, 2010a) that the orig-inal HPC(T) is consistent, i.e. its output converges inprobability to PCT , if the hypothesis tests are consis-tent. The proof also applies to the modified version pre-sented here.
We now discuss the subroutines in more detail. FDR-IAPC (Algorithm 2) is a fast incremental method thatreceives a data set D and a target node T as its inputand promptly returns a rough estimation of PCT , hencethe term “weak” PC learner. In this study, we use FDR-IAPC because it aims at controlling the expected pro-portion of false discoveries (i.e., false positive nodes inPCT ) among all the discoveries made. FDR-IAPC is astraightforward extension of the algorithm IAMBFDRdeveloped by Jose Pena in (Pena, 2008), which is itselfa modification of the incremental association Markovboundary algorithm (IAMB) (Tsamardinos et al., 2003),to control the expected proportion of false discover-ies (i.e., false positive nodes) in the estimated Markovboundary. FDR-IAPC simply removes, at lines 3-6, thespouses SPT from the estimated Markov boundary MBT
output by IAMBFDR at line 1, and returns PCT assum-ing the faithfulness condition.
The subroutines DE-PCS (Algorithm 3) and DE-SPS(Algorithm 4) search a superset of PCT and SPT respec-tively with a severe restriction on the maximum condi-tioning size (|Z| <= 1 in DE-PCS and |Z| <= 2 in DE-SPS) in order to significantly increase the reliability ofthe tests. The variable filtering has two advantages :i) it allows HPC to scale to hundreds of thousands ofvariables by restricting the search to a subset of relevantvariables, and ii) it eliminates many true or approximatedeterministic relationships that produce many false neg-ative errors in the output of the algorithm, as explainedin (Rodrigues de Morais & Aussem, 2010b,a). DE-SPSworks in two steps. First, a growing phase (lines 4-8)adds the variables that are d-separated from the targetbut still remain associated with the target when condi-tioned on another variable from PCST . The shrinkingphase (lines 9-16) discards irrelevant variables that areancestors or descendants of a target’s spouse. Pruningsuch irrelevant variables speeds up HPC.
3.2. Hybrid HPC (H2PC)
In this section, we discuss the SS phase. The follow-ing discussion draws strongly on (Tsamardinos et al.,2006) as the SS phase in Hybrid HPC and MMHCare exactly the same. The idea of constraining thesearch to improve time-efficiency first appeared in theSparse Candidate algorithm (Friedman et al., 1999b).It results in efficiency improvements over the (uncon-strained) greedy search. All recent hybrid algorithms
Algorithm 1 HPC
Require: T : target;D: data set; U: set of variablesEnsure: PCT : Parents and Children of T
1: [PCST ,dSep]← DE-PCS(T,D,U)2: SPST ← DE-SPS(T,D,U,PCST , dSep)3: PCT ← FDR-IAPC(T,D, (T ∪ PCST ∪ SPST ))4: for all X ∈ PCST \ PCT do
5: if T ∈ FDR-IAPC(X,D, (T ∪ PCST ∪ SPST )) then
6: PCT ← PCT ∪ X
7: end if
8: end for
Algorithm 2 FDR-IAPC
Require: T : target; D: data set; U: set of variablesEnsure: PCT : Parents and children of T ;
* Learn the Markov boundary of T
1: MBT ← IAMBFDR(X,D,U)* Remove spouses of T from MBT
2: PCT ←MBT
3: for all X ∈MBT do
4: if ∃Z ⊆ (MBT \ X) such that T ⊥ X | Z then
5: PCT ← PCT \ X
6: end if
7: end for
Algorithm 3 DE-PCS
Require: T : target;D: data set; U: set of variables;Ensure: PCST : parents and children superset of T ; dSep: d-
separating sets;
Phase I: Remove X if T ⊥ X
1: PCST ← U \ T
2: for all X ∈ PCST do
3: if (T ⊥ X) then
4: PCST ← PCST \ X
5: dSep(X)← ∅6: end if
7: end for
Phase II: Remove X if T ⊥ X|Y
8: for all X ∈ PCST do
9: for all Y ∈ PCST \ X do
10: if (T ⊥ X | Y) then
11: PCST ← PCST \ X
12: dSep(X)← Y
13: break loop FOR
14: end if
15: end for
16: end for
6
Algorithm 4 DE-SPS
Require: T : target; D: data set; U: the set of variables;PCST : parents and children superset of T ; dSep: d-separating sets;
Ensure: SPST : Superset of the spouses of T ;
1: SPST ← ∅
2: for all X ∈ PCST do
3: SPSXT ← ∅
4: for all Y ∈ U \ {T ∪ PCST } do
5: if (T 6⊥ Y |dSep(Y) ∪ X) then
6: SPSXT ← SPSX
T ∪ Y
7: end if
8: end for
9: for all Y ∈ SPSXT do
10: for all Z ∈ SPSXT \ Y do
11: if (T ⊥ Y |X ∪ Z) then
12: SPSXT ← SPSX
T \ Y
13: break loop FOR
14: end if
15: end for
16: end for
17: SPST ← SPST ∪ SPSXT
18: end for
Algorithm 5 Hybrid HPC
Require: D: data set; U: set of variablesEnsure: A DAG G on the variables U
1: for all pair of nodes X,Y ∈ U do
2: Add X in PCY and Add Y in PCX if X ∈ HPC(Y) andY ∈ HPC(X)
3: end for
4: Starting from an empty graph, perform greedy hill-climbing with operators add-edge, delete-edge, reverse-
edge. Only try operator add-edge X → Y if Y ∈ PCX
build on this idea, but employ a sound algorithm foridentifying the candidate parent sets. The Hybrid HPCfirst identifies the parents and children set of each vari-able, then performs a greedy hill-climbing search in thespace of BN. The search begins with an empty graph.The edge addition, deletion, or direction reversal thatleads to the largest increase in score (the BDeu scorewith uniform prior was used) is taken and the searchcontinues in a similar fashion recursively. The impor-tant difference from standard greedy search is that thesearch is constrained to only consider adding an edge ifit was discovered by HPC in the first phase. We extendthe greedy search with a TABU list (Friedman et al.,1999b). The list keeps the last 100 structures explored.Instead of applying the best local change, the best lo-cal change that results in a structure not on the list isperformed in an attempt to escape local maxima. When15 changes occur without an increase in the maximumscore ever encountered during search, the algorithm ter-minates. The overall best scoring structure is then re-turned. Clearly, the more false positives the heuristic al-lows to enter candidate PC set, the more computationalburden is imposed in the SS phase.
4. Experimental validation on synthetic data
Before we proceed to the experiments on real-worldmulti-label data with H2PC, we first conduct an exper-imental comparison of H2PC against MMHC on syn-thetic data sets sampled from eight well-known bench-marks with various data sizes in order to gauge the prac-tical relevance of the H2PC. These BNs that have beenpreviously used as benchmarks for BN learning algo-rithms (see Table 1 for details). Results are reported interms of various performance indicators to investigatehow well the network generalizes to new data and howwell the learned dependence structure matches the truestructure of the benchmark network. We implementedH2PC in R (R Core Team, 2013) and integrated the codeinto the bnlearn package from (Scutari, 2010). MMHCwas implemented by Marco Scutari in bnlearn. Thesource code of H2PC as well as all data sets used forthe empirical tests are publicly available 2. The thresh-old considered for the type I error of the test is 0.05.Our experiments were carried out on PC with Intel(R)Core(TM) i5-3470M CPU @3.20 GHz 4Go RAM run-ning under Linux 64 bits.
We do not claim that those data sets resemble real-world problems, however, they make it possible to com-
2https://github.com/madbix/bnlearn-clone-3.4
7
Table 1: Description of the BN benchmarks used in the experiments.
network # var. # edgesmax. degree domain min/med/max
in/out range PC set size
child 20 25 2/7 2-6 1/2/8insurance 27 52 3/7 2-5 1/3/9mildew 35 46 3/3 3-100 1/2/5alarm 37 46 4/5 2-4 1/2/6
hailfinder 56 66 4/16 2-11 1/1.5/17munin1 186 273 3/15 2-21 1/3/15
pigs 441 592 2/39 3-3 1/2/41link 724 1125 3/14 2-4 0/2/17
pare the outputs of the algorithms with the known struc-ture. All BN benchmarks (structure and probability ta-bles) were downloaded from the bnlearn repository3.Ten sample sizes have been considered: 50, 100, 200,500, 1000, 2000, 5000, 10000, 20000 and 50000. Allexperiments are repeated 10 times for each sample sizeand each BN. We investigate the behavior of both algo-rithms using the same parametric tests as a reference.
4.1. Performance indicators
We first investigate the quality of the skeleton re-turned by H2PC during the CB phase. To this end, wemeasure the false positive edge ratio, the precision (i.e.,the number of true positive edges in the output dividedby the number of edges in the output), the recall (i.e., thenumber of true positive edges divided the true numberof edges) and a combination of precision and recall de-fined as
√
(1 − precision)2 + (1 − recall)2, to measurethe Euclidean distance from perfect precision and re-call, as proposed in (Pena et al., 2007). Second, to as-sess the quality of the final DAG output at the end ofthe SS phase, we report the five performance indicators(Scutari & Brogini, 2012) described below:
• the posterior density of the network for the datait was learned from, as a measure of goodness offit. It is known as the Bayesian Dirichlet equivalentscore (BDeu) from (Heckerman et al., 1995; Bun-tine, 1991) and has a single parameter, the equiv-alent sample size, which can be thought of as thesize of an imaginary sample supporting the priordistribution. The equivalent sample size was set to10 as suggested in (Koller & Friedman, 2009);
• the BIC score (Schwarz, 1978) of the network forthe data it was learned from, again as a measure ofgoodness of fit;
3http://www.bnlearn.com/bnrepository
• the posterior density of the network for a new dataset, as a measure of how well the network general-izes to new data;
• the BIC score of the network for a new data set,again as a measure of how well the network gener-alizes to new data;
• the Structural Hamming Distance (SHD) betweenthe learned and the true structure of the network,as a measure of the quality of the learned depen-dence structure. The SHD between two PDAGs isdefined as the number of the following operatorsrequired to make the PDAGs match: add or deletean undirected edge, and add, remove, or reverse theorientation of an edge.
For each data set sampled from the true probabilitydistribution of the benchmark, we first learn a networkstructure with the H2PC and MMHC and then we com-pute the relevant performance indicators for each pairof network structures. The data set used to assess howwell the network generalizes to new data is generatedagain from the true probability structure of the bench-mark networks and contains 50000 observations.
Notice that using the BDeu score as a metric of recon-struction quality has the following two problems. First,the score corresponds to the a posteriori probability of anetwork only under certain conditions (e.g., a Dirichletdistribution of the hyper parameters); it is unknown towhat degree these assumptions hold in distributions en-countered in practice. Second, the score is highly sensi-tive to the equivalent sample size (set to 10 in our exper-iments) and depends on the network priors used. Since,typically, the same arbitrary value of this parameter isused both during learning and for scoring the learnednetwork, the metric favors algorithms that use the BDeuscore for learning. In fact, the BDeu score does not relyon the structure of the original, gold standard network atall; instead it employs several assumptions to score thenetworks. For those reasons, in addition to the score wealso report the BIC score and the SHD metric.
4.2. Results
In Figure 1, we report the quality of the skeleton ob-tained with HPC over that obtained with MMPC (be-fore the SS phase) as a function of the sample size.Results for each benchmark are not shown here in de-tail due to space restrictions. For sake of conciseness,the performance values are averaged over the 8 bench-marks depicted in Table 1. The increase factor for agiven performance indicator is expressed as the ratio of
8
0
1000
0
2000
0
3000
0
4000
0
5000
00.0
0.2
0.4
0.6
0.8
sample size
mea
n ra
te
False negative rate(lower is better)
HPCMMPC
0
1000
0
2000
0
3000
0
4000
0
5000
0
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
sample sizem
ean
rate
False positive rate(lower is better)
HPCMMPC
50 100
200
500
1000
2000
5000
1000
020
000
5000
0
0
2
4
6
8
sample size
incr
ease
fact
or
Recall(higher is better)
50 100
200
500
1000
2000
5000
1000
020
000
5000
00.00.51.01.52.02.53.03.5
sample size
incr
ease
fact
or
Precision(higher is better)
50 100
200
500
1000
2000
5000
1000
020
000
5000
0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
sample size
incr
ease
fact
or
Euclidian distance(lower is better)
50 100
200
500
1000
2000
5000
1000
020
000
5000
0
0
5
10
15
20
25
sample size
incr
ease
fact
or
Number of statistical tests
Figure 1: Quality of the skeleton obtained with HPC over that obtained with MMPC (after the CB phase). The 2 first figures present mean valuesaggregated over the 8 benchmarks. The 4 last figures present increase factors of HPC /MMPC, with the median, quartile, and most extreme values(green boxplots), along with the mean value (black line).
9
the performance value obtained with HPC over that ob-tained with MMPC (the gold standard). Note that forsome indicators, an increase is actually not an improve-ment but is worse (e.g., false positive rate, Euclideandistance). For clarity, we mention explicitly on the sub-plots whether an increase factor > 1 should be inter-preted as an improvement or not. Regarding the qual-ity of the superstructure, the advantages of HPC againstMMPC are noticeable. As observed, HPC consistentlyincreases the recall and reduces the rate of false neg-ative edges. As expected this benefit comes at a lit-tle expense in terms of false positive edges. HPC alsoimproves the Euclidean distance from perfect precisionand recall on all benchmarks, while increasing the num-ber of independence tests and thus the running time inthe CB phase (see number of statistical tests). It is worthnoting that HPC is capable of reducing by 50% the Eu-clidean distance with 50000 samples (lower left plot).These results are very much in line with other exper-iments presented in (Rodrigues de Morais & Aussem,2010a; Villanueva & Maciel, 2012).
In Figure 2, we report the quality of the final DAG ob-tained with H2PC over that obtained with MMHC (af-ter the SS phase) as a function of the sample size. Re-garding BDeu and BIC on both training and test data,the improvements are noteworthy. The results in termsof goodness of fit to training data and new data usingH2PC clearly dominate those obtained using MMHC,whatever the sample size considered, hence its ability togeneralize better. Regarding the quality of the networkstructure itself (i.e., how close is the DAG to the truedependence structure of the data), this is pretty mucha dead heat between the 2 algorithms on small sam-ple sizes (i.e., 50 and 100), however we found H2PCto perform significantly better on larger sample sizes.The SHD increase factor decays rapidly (lower is bet-ter) as the sample size increases (lower left plot). For50 000 samples, the SHD is on average only 50% that ofMMHC. Regarding the computational burden involved,we may observe from Table 2 that H2PC has a littlecomputational overhead compared to MMHC. The run-ning time increase factor grows somewhat linearly withthe sample size. With 50000 samples, H2PC is approxi-mately 10 times slower on average than MMHC. This ismainly due to the computational expense incurred in ob-taining larger PC sets with HPC, compared to MMPC.
5. Application to multi-label learning
In this section, we address the problem of multi-label learning with H2PC. MLC is a challenging prob-lem in many real-world application domains, where
each instance can be assigned simultaneously to mul-tiple binary labels (DembczyÅski et al., 2012; Readet al., 2009; Madjarov et al., 2012; Kocev et al., 2007;Tsoumakas et al., 2010b). Formally, learning frommulti-label examples amounts to finding a mappingfrom a space of features to a space of labels. We shallassume throughout that X (a random vector in Rd) is thefeature set, Y (a random vector in {0, 1}n) is the labelset, U = X ∪ Y and p a probability distribution definedover U. Given a multi-label training set D, the goal ofmulti-label learning is to find a function which is ableto map any unseen example to its proper set of labels.From the Bayesian point of view, this problem amountsto modeling the conditional joint distribution p(Y | X).
5.1. Related work
This MLC problem may be tackled in various ways(Luaces et al., 2012; Alvares-Cherman et al., 2011;Read et al., 2009; Blockeel et al., 1998; Kocev et al.,2007). Each of these approaches is supposed to capture- to some extent - the relationships between labels. Thetwo most straightforward meta-learning methods (Mad-jarov et al., 2012) are: Binary Relevance (BR) (Luaceset al., 2012) and Label Powerset (LP) (Tsoumakas &Vlahavas, 2007; Tsoumakas et al., 2010b). Both meth-ods can be regarded as opposite in the sense that BRdoes consider each label independently, while LP con-siders the whole label set at once (one multi-class prob-lem). An important question remains: what shall wecapture from the statistical relationships between labelsexactly to solve the multi-label classification problem?The problem attracted a great deal of interest (Dem-bczyÅski et al., 2012; Zhang & Zhang, 2010). It is wellbeyond the scope and purpose of this paper to delvedeeper into these approaches, we point the reader to(DembczyÅski et al., 2012; Tsoumakas et al., 2010b)for a review. The second fundamental problem that wewish to address involves finding an optimal feature sub-set selection of a label set, w.r.t an Information The-ory criterion Koller & Sahami (1996). As in the single-label case, multi-label feature selection has been studiedrecently and has encountered some success (Gu et al.,2011; Spolaor et al., 2013).
5.2. Label powerset decomposition
We shall first introduce the concept of label powersetthat will play a pivotal role in the factorization of theconditional distribution p(Y | X).
Definition 1. YLP ⊆ Y is called a label powerset iff
YLP ⊥⊥ Y \ YLP | X. Additionally, YLP is said minimal
10
50 100
200
500
1000
2000
5000
1000
020
000
5000
00.5
0.6
0.7
0.8
0.9
1.0
sample size
incr
ease
fact
or
log(BDeu post.) on train data(lower is better)
50 100
200
500
1000
2000
5000
1000
020
000
5000
0
0.5
0.6
0.7
0.8
0.9
1.0
sample size
incr
ease
fact
or
BIC on train data(lower is better)
50 100
200
500
1000
2000
5000
1000
020
000
5000
0
0.5
0.6
0.7
0.8
0.9
1.0
sample size
incr
ease
fact
or
log(BDeu post.) on test data(lower is better)
50 100
200
500
1000
2000
5000
1000
020
000
5000
00.5
0.6
0.7
0.8
0.9
1.0
sample size
incr
ease
fact
or
BIC on test data(lower is better)
50 100
200
500
1000
2000
5000
1000
020
000
5000
0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
sample size
incr
ease
fact
or
Structural Hamming Distance(lower is better)
50 100
200
500
1000
2000
5000
1000
020
000
5000
0
2
4
6
8
sample size
incr
ease
fact
or
Number of scores
Figure 2: Quality of the final DAG obtained with H2PC over that obtained with MMHC (after the SS phase). The 6 figures present increase factorsof HPC /MMPC, with the median, quartile, and most extreme values (green boxplots), along with the mean value (black line).
11
Table 2: Total running time ratio R (H2PC/MMHC). White cells indicate a ratio R < 1 (in favor of H2PC), while shaded cells indicate a ratio R > 1(in favor of MMHC). The darker, the larger the ratio.
NetworkSample Size
50 100 200 500 1000 2000 5000 10000 20000 50000
child0.94 0.87 1.14 1.99 2.26 2.12 2.36 2.58 1.75 1.78±0.1 ±0.1 ±0.1 ±0.2 ±0.1 ±0.2 ±0.4 ±0.3 ±0.6 ±0.5
insurance0.96 1.09 1.56 2.93 3.06 3.48 3.69 4.10 3.76 3.75±0.1 ±0.1 ±0.1 ±0.2 ±0.3 ±0.4 ±0.3 ±0.4 ±0.6 ±0.5
mildew0.77 0.80 0.79 0.94 1.01 1.23 1.74 2.14 3.26 6.20±0.1 ±0.1 ±0.1 ±0.1 ±0.1 ±0.1 ±0.2 ±0.2 ±0.6 ±1.0
alarm0.88 1.11 1.75 2.43 2.55 2.71 2.65 2.80 2.49 2.18±0.1 ±0.1 ±0.1 ±0.1 ±0.1 ±0.1 ±0.2 ±0.2 ±0.3 ±0.6
hailfinder0.85 0.85 1.40 1.69 1.83 2.06 2.13 2.12 1.95 1.96±0.1 ±0.1 ±0.1 ±0.1 ±0.1 ±0.1 ±0.1 ±0.2 ±0.2 ±0.6
munin10.77 0.85 0.93 1.35 2.11 4.30 12.92 23.32 24.95 24.76±0.0 ±0.0 ±0.0 ±0.0 ±0.0 ±0.2 ±0.7 ±2.6 ±5.1 ±6.7
pigs0.80 0.80 4.55 4.71 5.00 5.62 7.63 11.10 14.02 11.74±0.0 ±0.0 ±0.1 ±0.1 ±0.2 ±0.2 ±0.3 ±0.6 ±1.7 ±3.2
link1.16 1.93 2.76 5.55 7.04 8.19 10.00 13.87 15.32 24.74±0.0 ±0.0 ±0.0 ±0.1 ±0.2 ±0.2 ±0.3 ±0.4 ±2.5 ±4.2
all0.89 1.04 1.86 2.70 3.11 3.71 5.39 7.75 8.44 9.64±0.1 ±0.1 ±1.2 ±1.5 ±1.9 ±2.2 ±4.0 ±7.3 ±8.4 ±9.7
if it is non empty and has no label powerset as propersubset.
Let LP denote the set of all powersets defined overU, and minLP the set of all minimal label powersets.It is easily shown {Y, ∅} ⊆ LP. The key idea behindlabel powersets is the decomposition of the conditionaldistribution of the labels into a product of factors
p(Y | X) =L∏
j=1
p(YLP j| X) =
L∏
j=1
p(YLP j|MLP j
)
where {YLP1 , . . . ,YLPL} is a partition of label power-
sets and MLP jis a Markov blanket of YLP j
in X. Fromthe above definition, we have YLPi
⊥⊥ YLP j| X, ∀i , j.
In the framework of MLC, one can consider a multi-tude of loss functions. In this study, we focus on a nonlabel-wise decomposable loss function called the sub-set 0/1 loss which generalizes the well-known 0/1 lossfrom the conventional to the multi-label setting. Therisk-minimizing prediction for subset 0/1 loss is sim-ply given by the mode of the distribution (DembczyÅskiet al., 2012). Therefore, we seek a factorization into aproduct of minimal factors in order to facilitate the esti-mation of the mode of the conditional distribution (alsocalled the most probable explanation (MPE)):
maxy
p(y | x) =L∏
j=1
maxyLP j
p(yLP j| x) =
L∏
j=1
maxyLP j
p(yLP j| mLP j
)
The next section aims to obtain theoretical results forthe characterization of the minimal label powersets YLP j
and their Markov boundaries MLP jfrom a DAG, in or-
der to be able to estimate the MPE more effectively.
5.3. Label powerset characterization
In this section, we show that minimal label powersetscan be depicted graphically when p is faithful to a DAG,
Theorem 6. Suppose p is faithful to a DAG G. Then,
Yi and Y j belong to the same minimal label powerset if
and only if there exists an undirected path in G between
nodes Yi and Y j in Y such that all intermediate nodes Z
are either (i) Z ∈ Y, or (ii) Z ∈ X and Z has two parents
in Y (i.e. a collider of the form Yp → X ← Yq).
The proof is given in the Appendix. We shall nowaddress the following questions: What is the Markovboundary in X of a minimal label powerset? Answeringthis question for general distributions is not trivial. Weestablish a useful relation between a label powerset andits Markov boundary in X when p is faithful to a DAG,
12
Theorem 7. Suppose p is faithful to a DAG G. Let
YLP = {Y1, Y2, . . . , Yn} be a label powerset. Then, its
Markov boundary M in U is also its Markov boundary
in X, and is given in G by M =⋃n
j=1{PCY j∪ SPY j
} \ Y.
The proof is given in the Appendix.
5.4. Experimental setup
The MLC problem is decomposed into a series ofmulti-class classification problems, where each multi-class variable encodes one label powerset, with as manyclasses as the number of possible combinations of la-bels, or those present in the training data. At this point,it should be noted that the LP method is a special case ofthis framework since the whole label set is a particularlabel powerset (not necessarily minimal though). Theabove procedure can been summarized as follows:
1. Learn the BN local graph G around the label set;2. Read off G the minimal label powersets and their
respective Markov boundaries;3. Train an independent multi-class classifier on each
minimal LP, with the input space restricted to itsMarkov boundary in G;
4. Aggregate the prediction of each classifierto output the most probable explanation, i.e.arg maxy p(y | x).
To assess separately the quality of the minimal labelpowerset decomposition and the feature subset selectionwith Markov boundaries, we investigate 4 scenarios:
• BR without feature selection (denoted BR): a clas-sifier is trained on each single label with all fea-tures as input. This is the simplest approach as itdoes not exploit any label dependency. It serves asa baseline learner for comparison purposes.
• BR with feature selection (denoted BR+MB): aclassifier is trained on each single label with theinput space restricted to its Markov boundary inG. Compared to the previous strategy, we evaluatehere the effectiveness of the feature selection task.
• Minimum label powerset method without featureselection (denoted MLP): the minimal label pow-ersets are obtained from the DAG. All features areused as inputs.
• Minimum label powerset method with feature se-lection (denoted MLP+MB): the minimal labelpowersets are obtained from the DAG. the inputspace is restricted to the Markov boundary of thelabels in that powerset.
We use the same base learner in each meta-learningmethod: the well-known Random Forest classifier(Breiman, 2001). RF achieves good performance instandard classification as well as in multi-label prob-lems (Kocev et al., 2007; Madjarov et al., 2012), andis able to handle both continuous and discrete data eas-ily, which is much appreciated. The standard RF imple-mentation in R (Liaw & Wiener, 2002)4 was used. Forpractical purposes, we restricted the forest size of RF to100 trees, and left the other parameters to their defaultvalues.
5.4.1. Data and performance indicators
A total of 10 multi-label data sets are collected forexperiments in this section, whose characteristics aresummarized in Table 3. These data sets come fromdifferent problem domains including text, biology, andmusic. They can be found on the Mulan5 repository,except for image which comes from Zhou6 (Maron &Ratan, 1998). Let D be the multi-label data set, we use|D|, dim(D), L(D), F(D) to represent the number of ex-amples, number of features, number of possible labels,and feature type respectively. DL(D) = |{Y |∃x : (x, Y) ∈D}| counts the number of distinct label combinations ap-pearing in the data set. Continues values are binarizedduring the BN structure learning phase.
Table 3: Data sets characteristics
data set domain |D| dim(D) F(D) L(D) DL(D)
emotions music 593 72 cont. 6 27yeast biology 2417 103 cont. 14 198image images 2000 135 cont. 5 20scene images 2407 294 cont. 6 15slashdot text 3782 1079 disc. 22 156genbase biology 662 1186 disc. 27 32medical text 978 1449 disc. 45 94enron text 1702 1001 disc. 53 753bibtex text 7395 1836 disc. 159 2856corel5k images 5000 499 disc. 374 3175
The performance of a multi-label classifier can beassessed by several evaluation measures (Tsoumakaset al., 2010a). We focus on maximizing a non-decomposable score function: the global accuracy (alsotermed subset accuracy, complement of the 0/1-loss),which measures the correct classification rate of thewhole label set (exact match of all labels required).
4http://cran.r-project.org/web/packages/
randomForest5http://mulan.sourceforge.net/datasets.html6http://lamda.nju.edu.cn/data_MIMLimage.ashx
13
Note that the global accuracy implicitly takes into ac-count the label correlations. It is therefore a very strictevaluation measure as it requires an exact match of thepredicted and the true set of labels. It was recentlyproved in (DembczyÅski et al., 2012) that BR is opti-mal for decomposable loss functions (e.g., the hammingloss), while non-decomposable loss functions (e.g. sub-set loss) inherently require the knowledge of the labelconditional distribution. 10-fold cross-validation wasperformed for the evaluation of the MLC methods.
5.4.2. Results
Table 4 reports the outputs of H2PC and MMHC, Ta-ble 5 shows the global accuracy of each method on the10 data sets. Table 6 reports the running time and theaverage node degree of the labels in the DAGs obtainedwith both methods. Figures 3 up to 7 display graph-ically the local DAG structures around the labels, ob-tained with H2PC and MMHC, for illustration purposes.
Several conclusions may be drawn from these exper-iments. First, we may observe by inspection of the av-erage degree of the label nodes in Table 6 that severalDAGs are densely connected, like scene or bibtex, whileothers are rather sparse, like genbase, medical, corel5k.The DAGs displayed in Figures 3 up to 7 lend them-selves to interpretation. They can be used for encod-ing as well as portraying the conditional independen-cies, and the d-sep criterion can be used to read themoff the graph. Many label powersets that are reportedin Table 4 can be identified by graphical inspection ofthe DAGs. For instance, the two label powersets inyeast (Figure 3, bottom plot) are clearly noticeable inboth DAGs. Clearly, BNs have a number of advantagesover alternative methods. They lay bare useful infor-mation about the label dependencies which is crucial ifone is interested in gaining an understanding of the un-derlying domain. It is however well beyond the scopeof this paper to delve deeper into the DAG interpreta-tion. Overall, it appears that the structures recoveredby H2PC are significantly denser, compared to MMHC.On bibtex and enron, the increase of the average labeldegree is the most spectacular. On bibtex (resp. enron),the average label degree has raised from 2.6 to 6.4 (resp.from 1.2 to 3.3). This result is in nice agreement withthe experiments in the previous section, as H2PC wasshown to consistently reduce the rate of false negativeedges with respect to MMHC (at the cost of a slightlyhigher false discovery rate).
Second, Table 4 is very instructive as it reveals thaton emotions, image, and scene, the MLP approach boilsdown to the LP scheme. This is easily seen as there isonly one minimal label powerset extracted on average.
In contrast, on genbase the MLP approach boils downto the simple BR scheme. This is confirmed by inspect-ing Table 5: the performances of MMHC and H2PC(without feature selection) are equal. An interesting ob-servation upon looking at the distribution of the labelpowerset size shows that for the remaining data sets, theMLP mostly decomposes the label set in two parts: oneon which it performs BR and the other one on whichit performs LP. Take for instance enron, it can be seenfrom Table 4 that there are approximately 23 label sin-gletons and a powerset of 30 labels with H2PC for a to-tal of 53 labels. The gain in performance with MLP overBR, our baseline learner, can be ascribed to the qualityof the label powerset decomposition as BR is ignoringlabel dependencies. As expected, the results using MLPclearly dominate those obtained using BR on all datasets except genbase. The DAGs display the minimumlabel powersets and their relevant features.
Third, H2PC compares favorably to MMHC. Onscene for instance, the accuracy of MLP+MB has raisedfrom 20% (with MMHC) to 56% (with H2PC). Onyeast, it has raised from 7% to 23%. Without featureselection, the difference in global accuracy is less pro-nounced but still in favor of H2PC. A Wilcoxon signedrank paired test reveals statistically significant improve-ments of H2PC over MMHC in the MLP approach with-out feature selection (p < 0.02). This trend is more pro-nounced when the feature selection is used (p < 0.001),using MLP-MB. Note however that the LP decompo-sition for H2PC and MMHC on yeast and image arestrictly identical, hence the same accuracy values in Ta-ble 5 in the MLP column.
Fourth, regarding the utility of the feature selection,it is difficult to reach any conclusion. Whether BR orMLP is used, the use of the selected features as inputsto the classification model is not shown greatly benefi-cial in terms of global accuracy on average. The per-formance of BR and our MLP with all the features out-performs that with the selected features in 6 data setsbut the feature selection leads to actual improvementsin 3 data sets. The difference in accuracy with and with-out the feature selection was not shown to be statisti-cally significant (p > 0.20 with a Wilcoxon signed rankpaired test). Surprisingly, the feature selection did ex-ceptionally well on genbase. On this data set, the in-crease in accuracy is the most impressive: it raised from7% to 98% which is atypical. The dramatic increase inaccuracy on genbase is due solely to the restricted fea-ture set as input to the classification model. This is alsoobserved on medical, to a lesser extent though. Inter-estingly, on large and densely connected networks (e.g.bibtex, slashdot and corel5K), the feature selection per-
14
formed very well in terms of global accuracy and signif-icantly reduced the input space which is noticeable. Onemotions, yeast, image, scene and genbase, the methodreduced the feature set down to nearly 1/100 its origi-nal size. The feature selection should be evaluated inview of its effectiveness at balancing the increasing er-ror and the decreasing computational burden by drasti-cally reducing the feature space. We should also keepin mind that the feature relevance cannot be defined in-dependently of the learner and the model-performancemetric (e.g., the loss function used). Admittedly, ourfeature selection based on the Markov boundaries is notnecessarily optimal for the base MLC learner used here,namely the Random Forest model.
As far as the overall running time performance is con-cerned, we see from Table 6 that for both methods, therunning time grows somewhat exponentially with thesize of the Markov boundary and the number of fea-tures, hence the considerable rise in total running timeon bibtex. H2PC takes almost 200 times longer on bib-
tex (1826 variables) and enron (1001 variables) whichis quite considerable but still affordable (44 hours ofsingle-CPU time on bibtex and 13 hours on enron). Wealso observe that the size of the parent set with H2PCis 2.5 (on bibtex) and 3.6 (on enron) times larger thanthat of MMHC (which may hurt interpretability). Infact, the running time is known to increase exponen-tially with the parent set size of the true underlying net-work. This is mainly due the computational overheadof greedy search-and-score procedure with larger par-ent sets which is the most promising part to optimize interms of computational gains as we discuss in the Con-clusion.
6. Discussion & practical applications
Our prime conclusion is that H2PC is a promisingapproach to constructing BN global or local structuresaround specific nodes of interest, with potentially thou-sands of variables. Concentrating on higher recall val-ues while keeping the false positive rate as low as possi-ble pays off in terms of goodness of fit and structure ac-curacy. Historically, the main practical difficulty in theapplication of BN structure discovery approaches hasbeen a lack of suitable computing resources and relevantaccessible software. The number of variables which canbe included in exact BN analysis is still limited. As aguide, this might be less than about 40 variables for ex-act structural search techniques (Perrier et al., 2008; Ko-jima et al., 2010). In contrast, constraint-based heuris-tics like the one presented in this study is capable of pro-cessing many thousands of features within hours on a
Table 4: Distribution of the number and the size of the minimal labelpowersets output by H2PC (top) and MMHC (bottom). On each dataset: mean number of powersets, minimum/median/maximum numberof labels per powerset, and minimum/median/maximum number ofdistinct classes per powerset. The total number of labels and distinctlabels combinations is recalled for convenience.
H2PC
dataset # LPs# labels / LP
L(D)# classes / LP
DL(D)min/med/max min/med/max
emotions 1.0 ± 0.0 6 / 6 / 6 (6) 26 / 27 / 27 (27)yeast 2.0 ± 0.0 2 / 7 / 12 (14) 3 / 64 / 135 (198)image 1.0 ± 0.0 5 / 5 / 5 (5) 19 / 20 / 20 (20)scene 1.0 ± 0.0 6 / 6 / 6 (6) 14 / 15 / 15 (15)slashdot 9.5 ± 0.8 1 / 1 / 14 (22) 1 / 2 / 111 (156)genbase 26.9 ± 0.3 1 / 1 / 2 (27) 1 / 2 / 2 (32)medical 38.6 ± 1.8 1 / 1 / 6 (45) 1 / 2 / 10 (94)enron 24.5 ± 1.9 1 / 1 / 30 (53) 1 / 2 / 334 (753)bibtex 39.6 ± 4.3 1 / 1 / 112 (159) 2 / 2 / 1588 (2856)corel5k 97.7 ± 3.9 1 / 1 / 263 (374) 1 / 2 / 2749 (3175)
MMHC
dataset # LPs# labels / LP
L(D)# classes / LP
DL(D)min/med/max min/med/max
emotions 1.2 ± 0.4 1 / 6 / 6 (6) 2 / 26 / 27 (27)yeast 2.0 ± 0.0 1 / 7 / 13 (14) 2 / 89 / 186 (198)image 1.0 ± 0.0 5 / 5 / 5 (5) 19 / 20 / 20 (20)scene 1.0 ± 0.0 6 / 6 / 6 (6) 14 / 15 / 15 (15)slashdot 15.1 ± 0.6 1 / 1 / 9 (22) 1 / 2 / 46 (156)genbase 27.0 ± 0.0 1 / 1 / 1 (27) 1 / 2 / 2 (32)medical 41.7 ± 1.4 1 / 1 / 4 (45) 1 / 2 / 7 (94)enron 36.8 ± 2.7 1 / 1 / 12 (53) 1 / 2 / 119 (753)bibtex 77.2 ± 3.1 1 / 1 / 28 (159) 2 / 2 / 233 (2856)corel5k 165.3 ± 5.5 1 / 1 / 177 (374) 1 / 2 / 2294 (3175)
Table 5: Global classification accuracies using 4 learning methods(BR, MLP, BR+MB, MLP+MB) based on the DAG obtained withH2PC and MMHC. Best values between H2PC and MMHC are bold-faced.
dataset BRMLP BR+MB MLP+MB
MMHC H2PC MMHC H2PC MMHC H2PC
emotions 0.327 0.371 0.391 0.147 0.266 0.189 0.285
yeast 0.169 0.273 0.271 0.062 0.155 0.073 0.233
image 0.342 0.504 0.504 0.227 0.252 0.308 0.360
scene 0.580 0.733 0.733 0.123 0.361 0.199 0.565
slashdot 0.355 0.419 0.474 0.274 0.373 0.278 0.439
genbase 0.069 0.069 0.069 0.974 0.976 0.974 0.976
medical 0.454 0.499 0.502 0.630 0.651 0.636 0.669
enron 0.134 0.165 0.180 0.024 0.079 0.079 0.157
bibtex 0.108 0.115 0.167 0.120 0.133 0.121 0.177
corel5k 0.002 0.030 0.043 0.000 0.002 0.010 0.034
15
Table 6: DAG learning time (in seconds) and average degree of thelabel nodes.
datasetMMHC H2PC
time label degree time label degree
emotions 1.5 2.6 ± 1.1 16.2 4.7 ± 1.8yeast 9.0 3.0 ± 1.6 90.9 5.0 ± 2.9image 3.1 4.0 ± 0.8 28.4 4.6 ± 1.0scene 9.9 3.1 ± 1.0 662.9 6.6 ± 1.8slashdot 44.8 2.7 ± 1.4 822.3 4.0 ± 5.2genbase 12.5 1.0 ± 0.3 12.4 1.1 ± 0.3medical 43.8 1.7 ± 1.1 334.8 2.9 ± 2.4enron 199.8 1.2 ± 1.5 47863.0 4.3 ± 5.2bibtex 960.8 2.6 ± 1.3 159469.6 6.4 ± 10.3corel5k 169.6 1.5 ± 1.4 2513.0 2.9 ± 2.8
personal computer, while maintaining a very high struc-tural accuracy. H2PC and MMHC could potentiallyhandle up to 10,000 labels in a few days of single-CPUtime and far less by parallelizing the skeleton identifica-tion algorithm as discussed in (Tsamardinos et al., 2006;Villanueva & Maciel, 2014).
The advantages in terms of structure accuracy and itsability to scale to thousands of variables opens up manyavenues of future possible applications of H2PC in var-ious domains as we shall discuss next. For example,BNs have especially proven to be useful abstractions incomputational biology (Nagarajan et al., 2013; Scutari& Nagarajan, 2013; Prestat et al., 2013; Aussem et al.,2012, 2010; Pena, 2008; Pena et al., 2005). Identify-ing the gene network is crucial for understanding thebehavior of the cell which, in turn, can lead to better di-agnosis and treatment of diseases. This is also of greatimportance for characterizing the function of genes andthe proteins they encode in determining traits, psychol-ogy, or development of an organism. Genome sequenc-ing uses high-throughput techniques like DNA microar-rays, proteomics, metabolomics and mutation analysisto describe the function and interactions of thousandsof genes (Zhang & Zhou., 2006). Learning BN modelsof gene networks from these huge data is still a difficulttask (Badea, 2004; Bernard & Hartemink, 2005; Fried-man et al., 1999a; Ott et al., 2004; Peer et al., 2001). Inthese studies, the authors had decide in advance whichgenes were included in the learning process, in all thecases less than 1000, and which genes were excludedfrom it (Pena, 2008). H2PC overcome the problem byfocusing the search around a targeted gene: the key stepis the identification of the vicinity of a node X (Penaet al., 2005).
Our second objective in this study was to demonstratethe potential utility of hybrid BN structure discovery tomulti-label learning. In multi-label data where many
inter-dependencies between the labels may be present,explicitly modeling all relationships between the labelsis intuitively far more reasonable (as demonstrated inour experiments). BNs explicitly account for such inter-dependencies and the DAG allows us to identify an op-timal set of predictors for every label powerset. The ex-periments presented here support the conclusion that lo-cal structural learning in the form of local neighborhoodinduction and Markov blanket is a theoretically well-motivated approach that can serve as a powerful learn-ing framework for label dependency analysis gearedtoward multi-label learning. Multi-label scenarios arefound in many application domains, such as multimediaannotation (Snoek et al., 2006; Trohidis et al., 2008), tagrecommendation, text categorization (McCallum, 1999;Zhang & Zhou., 2006), protein function classification(Roth & Fischer, 2007), and antiretroviral drug catego-rization (Borchani et al., 2013).
7. Conclusion & avenues for future research
We first discussed a hybrid algorithm for globalor local (around target nodes) BN structure learn-ing called Hybrid HPC (H2PC). Our extensive experi-ments showed that H2PC outperforms the state-of-the-art MMHC by a significant margin in terms of edgerecall without sacrificing the number of extra edges,which is crucial for the soundness of the super-structureused during the second stage of hybrid methods, likethe ones proposed in (Perrier et al., 2008; Kojimaet al., 2010). The code of H2PC is open-source andpublicly available online at https://github.com/
madbix/bnlearn-clone-3.4. Second, we discussedan application of H2PC to the multi-label learning prob-lem which is a challenging problem in many real-worldapplication domains. We established theoretical results,under the faithfulness condition, in order to characterizegraphically the so-called minimal label powersets thatappear as irreducible factors in the joint distribution andtheir respective Markov boundaries. As far as we know,this is the first investigation of Markov boundary princi-ples to the optimal variable/feature selection problem inmulti-label learning. These formal results offer a simpleguideline to characterize graphically : i) the minimal la-bel powerset decomposition, (i.e. into minimal subsetsYLP ⊆ Y such that YLP ⊥⊥ Y \ YLP | X), and ii) theminimal subset of features, w.r.t an Information Theorycriterion, needed to predict each label powerset, therebyreducing the input space and the computational burdenof the multi-label classification. The theoretical analysislaid the foundation for a practical multi-label classifica-tion procedure. Another set of experiments were carried
16
out on a broad range of multi-label data sets from dif-ferent problem domains to demonstrate its effectiveness.H2PC was shown to outperform MMHC by a significantmargin in terms of global accuracy.
We suggest several avenues for future research. Asfar as BN structure learning is concerned, future workwill aim at: 1) ascertaining which independence test(e.g. tests targeting specific distributions, employingparametric assumptions etc.) is most suited to the dataat hand (Tsamardinos & Borboudakis, 2010; Scutari,2011); 2) controlling the false discovery rate of theedges in the graph output by H2PC (Pena, 2008) espe-cially when dealing with more nodes than samples, e.g.learning gene networks from gene expression data. Inthis study, H2PC was run independently on each nodewithout keeping track of the dependencies found previ-ously. This lead to some loss of efficiency due to re-dundant calculations. The optimization of the H2PCcode is currently being undertaken to lower the compu-tational cost while maintaining its performance. Theseoptimizations will include the use of a cache to storethe (in)dependencies and the use of a global structure.Other interesting research avenues to explore are exten-sions and modifications to the greedy search-and-scoreprocedure which is the most promising part to optimizein terms of computational gains. Regarding the multi-label learning problem, experiments with several thou-sands of labels are currently been conducted and willbe reported in due course. We also intend to work onrelaxing the faithfulness assumption and derive a prac-tical multi-label classification algorithm based on H2PCthat is correct under milder assumptions underlying thejoint distribution (e.g. Composition, Intersection). Thisneeds further substantiation through more analysis.
Acknowledgments
The authors thank Marco Scutari for sharing his bn-
learn package in R. The research was also supportedby grants from the European ENIAC Joint Undertak-ing (INTEGRATE project) and from the French Rhone-Alpes Complex Systems Institute (IXXI).
Appendix
Lemma 8. ∀YLPi,YLP j
∈ LP, then YLPi∩ YLP j
∈ LP.
Proof. To keep the notation uncluttered and for the sakeof simplicity, consider a partition {Y1,Y2,Y3,Y4} of Y
such that YLPi= Y1 ∪ Y2 , YLP j
= Y2 ∪ Y3 . Fromthe label powerset assumption for YLPi
and YLP j, we
have (Y1 ∪ Y2) ⊥⊥ (Y3 ∪ Y4) | X and (Y2 ∪ Y3) ⊥⊥
(Y1 ∪ Y4) | X . From the Decomposition, we have thatY2 ⊥⊥ (Y3 ∪ Y4) | X . Using the Weak union, we obtainthat Y2 ⊥⊥ Y1 | (Y3 ∪ Y4 ∪ X) . From these two facts,we can use the Contraction property to show that Y2 ⊥⊥
(Y1 ∪ Y3 ∪ Y4) | X . Therefore, Y2 = YLPi∩ YLP j
is alabel powerset by definition.
Lemma 9. Let Yi and Y j denote two distinct labels in
Y and define by YLPiand YLP j
their respective minimal
label powerset. Then we have,
∃Z ⊆ Y \ {Yi, Y j}, {Yi} 6⊥⊥ {Y j} | (X ∪ Z)⇒ YLPi= YLP j
Proof. Let us suppose YLPi, YLP j
. By the label pow-erset definition for YLPi
, we have YLPi⊥⊥ Y \ YLPi
| X.As YLPi
∩ YLP j= ∅ owing to Lemma 8, we have
that Y j < YLPi. ∀Z ⊆ Y \ {Yi, Y j}, Z can be decom-
posed as Zi ∪ Z j such that Zi = Z ∩ (YLPi\ {Yi}) and
Z j = Z \ Zi . Using the Decomposition property, wehave that ({Yi} ∪Zi) ⊥⊥ ({Y j} ∪Z j) | X . Using the Weakunion property, we have that {Yi} ⊥⊥ {Y j} | (X ∪ Z) . Asthis is true ∀Z ⊆ Y \ {Yi, Y j}, then ∄Z ⊆ Y \ {Yi, Y j}, {Yi} 6
⊥⊥ {Y j} | (X ∪ Z) , which completes the proof.
Lemma 10. Consider YLP a minimal label powerset.
Then, if p satisfies the Composition property, Z 6⊥⊥ YLP \
Z | X.
Proof. By contradiction, suppose a nonempty Z exists,such that Z ⊥⊥ YLP \ Z | X . From the label powersetassumption of YLP, we have that YLP ⊥⊥ Y \ YLP | X
. From these two facts, we can use the Compositionproperty to show that Z ⊥⊥ Y \ Z | X which contradictsthe minimal label powerset assumption of YLP. Thisconcludes the proof.
Theorem 7. Suppose p is faithful to a DAG G. Then,
Yi and Y j belong to the same minimal label powerset if
and only if there exists an undirected path in G between
nodes Yi and Y j in Y such that all intermediate nodes Z
are either (i) Z ∈ Y, or (ii) Z ∈ X and Z has two parents
in Y (i.e. a collider of the form Yp → X ← Yq).
Proof. Suppose such a path exists. By conditioning onall the intermediate colliders Yk in Y of the form Yp →
Yk ← Yq along an undirected path in G between nodesYi and Y j, we ensure that ∃Z ⊆ Y \ {Yi, Y j}, dSep(Yi, Y j |
X ∪ Z). Due to the faithfulness, this is equivalent to{Yi} 6⊥⊥ {Y j} | (X∪Z). From Lemma 9, we conclude thatYi and Y j belong to the same minimal label powerset.
17
To show the inverse, note that owing to Lemma 10, weknow that there exists no partition {Zi,Z j} of Z such that({Yi} ∪ Zi) ⊥⊥ ({Y j} ∪ Z j) | X. Due to the faithfulness,there exists at least a link between ({Yi}∪Zi) and ({Y j}∪
Z j) in the DAG, hence by recursion, there exists a pathbetween Yi and Y j such that all intermediate nodes Z areeither (i) Z ∈ Y, or (ii) Z ∈ X and Z has two parents inY (i.e. a collider of the form Yp → X ← Yq).
Theorem 8. Suppose p is faithful to a DAG G. Let
YLP = {Y1, Y2, . . . , Yn} be a label powerset. Then, its
Markov boundary M in U is also its Markov boundary
in X, and is given in G by M =⋃n
j=1{PCY j∪ SPY j
} \ Y.
Proof. First, we prove that M is a Markov boundaryof YLP in U. Define Mi the Markov boundary of Yi inU, and M′
i = Mi \ Y. From Theorem 5, Mi is givenin G by Mi = PCY j
∪ SPY j. We may now prove that
M′1∪· · ·∪M′
n is a Markov boundary of {Y1, Y2, . . . , Yn}
in U. We show first that the statement holds for n = 2and then conclude that it holds for all n by induction.Let W denote U \ (Y1 ∪ Y2 ∪M′
1 ∪M′2), and define
Y1 = {Y1} and Y2 = {Y2}. From the Markov blanketassumption for Y1 we have Y1 ⊥⊥ U \ (Y1 ∪M1) | M1
. Using the Weak Union property, we obtain that Y1 ⊥⊥
W | M′1 ∪M′
2 ∪ Y2 . Similarly we can derive Y2 ⊥⊥
W |M′2 ∪M′
1 ∪Y1 . Combining these two statementsyields Y1∪Y2 ⊥⊥W |M′
1∪M′2 due to the Intersection
property. Let M =M′1 ∪M′
2 the last expression can beformulated as Y1 ∪Y2 ⊥⊥ U \ (Y1 ∪Y2 ∪M) |M whichis the definition of a Markov blanket of Y1 ∪ Y2 in U.We shall now prove that M is minimal. Let us supposethat it is not the case, i.e., ∃Z ⊂M such that Y1 ∪Y2 ⊥⊥
Z∪(U\(Y1∪Y2∪M)) |M\Z . Define Z1 = Z∩M1 andZ2 = Z∩M2, we may apply the Weak Union property toget Y1 ⊥⊥ Z1 | (M\Z)∪Y2∪Z2∪(U\(Y∪M)) which canbe rewritten more compactly as Y1 ⊥⊥ Z1 | (M1 \ Z1) ∪(U \ (Y1∪M1)) . From the Markov blanket assumption,we have Y1 ⊥⊥ U \ (Y1 ∪ M1) | M1 . We may nowapply the Intersection property on these two statementsto obtain Y1 ⊥⊥ Z1∪(U\(Y1∪M1)) |M1\Z1 . Similarly,we can derive Y2 ⊥⊥ Z2 ∪ (U \ (Y2 ∪M2)) | M2 \ Z2 .From the Markov boundary assumption of M1 and M2,we have necessarily Z1 = ∅ and Z2 = ∅, which in turnyields Z = ∅. To conclude for any n > 2, it suffices to setY1 =
⋃n−1j=1 {Y j} and Y2 = {Yn} to conclude by induction.
Second, we prove that M is a Markov blanket of YLP
in X. Define Z = M ∩ Y. From the label powersetdefinition, we have YLP ⊥⊥ Y \ YLP | X . Using theWeak Union property, we obtain YLP ⊥⊥ Z | U\(YLP∪Z)which can be reformulated as YLP ⊥⊥ Z | (U \ (YLP ∪
M))∪(M\Z) . Now, the Markov blanket assumption forYLP in U yields YLP ⊥⊥ U \ (YLP∪M) |M which can berewritten as YLP ⊥⊥ U \ (YLP ∪M) | (M \Z) ∪Z . Fromthe Intersection property, we get YLP ⊥⊥ Z∪ (U \ (YLP ∪
M)) | M \ Z . From the Markov boundary assumptionof M in U, we know that there exists no proper subsetof M which satisfies this statement, and therefore Z =
M∩Y = ∅. From the Markov blanket assumption of M
in U, we have YLP ⊥⊥ U \ (YLP ∪M) | M . Using theDecomposition property, we obtain YLP ⊥⊥ X \M | M
which, together with the assumption M ∩ Y = ∅, is thedefinition of a Markov blanket of YLP in X.
Finally, we prove that M is a Markov boundary ofYLP in X. Let us suppose that it is not the case, i.e.,∃Z ⊂ M such that YLP ⊥⊥ Z ∪ (X \ M) | M \ Z
. From the label powerset assumption of YLP, wehave YLP ⊥⊥ Y \ YLP | X which can be rewritten asYLP ⊥⊥ Y \ YLP | (X \M) ∪ (M \ Z) ∪ Z . Due to theContraction property, combining these two statementsyields YLP ⊥⊥ Z ∪ (U \ (M ∪ YLP) | M \ Z . Fromthe Markov boundary assumption of M in U, we havenecessarily Z = ∅, which suffices to prove that M is aMarkov boundary of YLP in X.
18
amazed−suprised
happy−pleased
relaxing−calm
quiet−still
sad−lonely
angry−aggresive
Mean_Acc1298_Mean_Mem40_MFCC_4
Mean_Acc1298_Mean_Mem40_Flux
BH_HighPeakAmp
BHSUM3
Mean_Acc1298_Mean_Mem40_MFCC_2
Mean_Acc1298_Std_Mem40_Flux
Mean_Acc1298_Std_Mem40_Rolloff
Mean_Acc1298_Std_Mem40_MFCC_4Std_Acc1298_Std_Mem40_Flux
amazed−suprised
happy−pleased
relaxing−calm
quiet−still
sad−lonely
angry−aggresive
Std_Acc1298_Std_Mem40_MFCC_4
Mean_Acc1298_Mean_Mem40_Rolloff
Std_Acc1298_Std_Mem40_MFCC_11
Mean_Acc1298_Mean_Mem40_MFCC_0
Mean_Acc1298_Mean_Mem40_Flux
BH_HighPeakAmp
Std_Acc1298_Std_Mem40_MFCC_1
Std_Acc1298_Std_Mem40_MFCC_3
Mean_Acc1298_Std_Mem40_MFCC_4
Std_Acc1298_Mean_Mem40_MFCC_4
Std_Acc1298_Std_Mem40_MFCC_2
Std_Acc1298_Std_Mem40_MFCC_5
Mean_Acc1298_Mean_Mem40_MFCC_1
Mean_Acc1298_Mean_Mem40_Centroid
Mean_Acc1298_Std_Mem40_Rolloff
Std_Acc1298_Std_Mem40_MFCC_10
Std_Acc1298_Std_Mem40_MFCC_12
Std_Acc1298_Std_Mem40_MFCC_8
Std_Acc1298_Mean_Mem40_MFCC_12
Mean_Acc1298_Std_Mem40_MFCC_11
Mean_Acc1298_Mean_Mem40_MFCC_8
Std_Acc1298_Mean_Mem40_MFCC_11
Std_Acc1298_Mean_Mem40_MFCC_0
Std_Acc1298_Std_Mem40_MFCC_0
Mean_Acc1298_Std_Mem40_Flux
BH_LowPeakAmp
Mean_Acc1298_Std_Mem40_MFCC_2
Mean_Acc1298_Std_Mem40_MFCC_0
Std_Acc1298_Std_Mem40_Flux
BHSUM3
BH_HighPeakBPM
Std_Acc1298_Mean_Mem40_MFCC_8
BHSUM2
Std_Acc1298_Mean_Mem40_MFCC_1
Std_Acc1298_Mean_Mem40_Flux
Std_Acc1298_Std_Mem40_Centroid
(a) Emotions
Class1
Class2
Class3
Class6
Class4
Class10
Class11
Class5
Class7Class8
Class9
Class12
Class13
Class14
Att88
Att96
Att89
Att100
Att84
Att87Att93
Class1
Class2
Class3
Class6
Class4
Class10 Class11Class5
Class7
Class8
Class9
Class12Class13
Class14
Att94
Att83
Att66
Att103
Att43
Att3
Att68
Att92
Att86
Att85
Att96
Att91
Att89
Att102
Att63
Att65
Att78
Att46Att37
Att77
Att47
Att84
Att80
Att88
Att42
Att38
Att62
Att71
Att12
Att4Att5
Att2Att29
Att1
Att36
Att44
Att11
Att67
Att13
Att16
(b) Yeast
Figure 3: The local BN structures learned by MMHC (left plot) and H2PC (right plot) on a single cross-validation split, on Emotions and Yeast.
19
desert
mountains
sea
sunset
trees
X9
X21
X78
X8
X18
X6
X20
X12
X24
X77
X69
X75 desertmountains
sea
sunset
trees
X21
X7
X76
X37
X2
X20
X12
X24X129
X23
X8
X16
X4
X89
X61X77 X67
X79
X73
X28
X46
X40
X118
X38
X3
X1
X11
(a) Image
Beach
Sunset
FallFoliage
Field
Mountain
Urban
Att45
Att46
Att44
Att38
Beach
Sunset
FallFoliage
Field
MountainUrban
Att1
Att115
Att78
Att85
Att238
Att26
Att27
Att263
Att271
Att68
Att22
Att173
Att2
Att8
Att3
Att114
Att116
Att122
Att108
Att59
Att213
Att71
Att29
Att86
Att281
Att92
Att36
Att91
Att183
Att237
Att231
Att245
Att280
Att294
Att25
Att19
Att75
Att33
Att28
Att34
Att20
Att76
Att262
Att264
Att165
Att270
Att67
Att256
Att69
Att61
Att166
Att23
Att15
Att64
Att172
Att174
Att180
Att121
(b) Scene
Figure 4: The local BN structures learned by MMHC (left plot) and H2PC (right plot) on a single cross-validation split, on Image and Scene.
20
Entertainment
Interviews
Main
Developers
Apache
News
Search
Mobile
Science
IT
BSD
Idle
Games
YourRightsOnline
AskSlashdot
Apple
BookReviews
Hardware
Meta
Linux
Politics
Technology
movie
format
television
answers
founder
developers
programming
perl
foundation
ruphus13
dead
standard
mobile
iphone
phonenetbookslaptop
scientists
X4writes
game
games
gaming
riaa
work
apple
mac
macbook
book
stoolpigeon
michael
intel
nvidia
amd
pc
linux
ubuntu
source
kernel
obamaelection
presidentialvoting
political
coondoggie
official
legal
tv
shows
questions
sendspoints
readers
notes
open
left
chrome
android
search
information
update
devices
broadband
notebook
app
cell
netbook
laptops
olpc
developed
research
video
developer
industry
newyorkcountrylawyer
company
experience
os
pro
lines
books
author
demo
graphics
gpu
ii
eeebarence
windows
X8code
barack
transition
discuss
machines
party
million
Entertainment
Interviews
Main
Developers
Apache
News
Search
Mobile
Science
IT
BSD
Idle
GamesYourRightsOnline
AskSlashdot
Apple
BookReviews
Hardware
Meta
Linux
Politics
Technology
movie
format
television
copyright
trek
answers
writes
founder
director
commission
developers
programming
java
project
developerlanguage
perl
snydeq
development
javascript
microsoft
foundation
ruphus13
dead
web
search
standard
mobile
iphone
phonenetbookslaptop
early
text laptops
blackberry
notebook
wireless
narramissic
devices
phones
X3g
t.mobile
android
scientists
universe
dna
planet
water
human
users
surface
quantum
humans
astronomers
evidence
physics
discovery
brain
solar
team
study
spacecraft
telescope
earth
discovered
dk
kentuckyfc
scientific
found
moon
research
mission
scientist
mars
nasa
science
researchers
cisco
released
X4
don
readers
sends
data
dog
future
interesting
isn
baby
japanese
officialsguy
recentreports
santa
disagree
women
story
announced
woman
man
game games
gaming
series
bring
trailer
details
edge
popular
players
drm
play
littlebigplanet
coming
massively
worlds
e3
fallout
interview
entertainment
warcraft
mmos
expansion
content
activision
downloadableonline
X1up
gameplay
gamer
spokeblizzard
console
gamers
kotaku
nintendo
upcoming
xbox
runningwii
ea
mmo
gamasutra
riaa
police
names
p2p
legal
government
isps
eu
music
bill
internet
dmca
rights
law
property
court
work
server
tricks good
time
small
books
computerpay
local boss
thinking
job
recently
apple
mactakes
itunes
macbook
steve
book
stoolpigeon
michael
management
ben
intel
nvidiaamd
pc
desktop
memory hp
chips
hardware
supercomputer
cpuchip
arcticstoat
linux
ubuntu
source
kernel
active
open.source
free
obama
election
presidential
voting
political palin
telecom
votes
states
debate
congress
gov
campaign
vote
mccain
coondoggie
insiderocket
silicon
system
prototype
tech
technology
developed
studios
official
consumer
war
tv switch
act
star
things
questions
policy
army
asked
points
notes
blog talks
eldavojohn
newyorkcountrylawyer
tips
anonymous
guide
education
federalsun
python
flash
open
olpc
gave
single
infoworld
exception
companies
paul
francisco
creator
committee
windows
electronic
announcement
zdnet
mentioned
left
chromeapps
yahoosite
browser
services
X0sites
applications
information
update
lamont
broadband
app
cell
callsnetbook
cheap
X360
child
atari
lenovo
reported
networkfcc
nokia
connection
g1
canadian
create
artificial
magnetic
red
energy
cancer
service
website
image
blacktaking
european
size
nobel
professor
japan
power
month
cells
transition
international
dark
conducted
finds risk
peopleshows
suggests
robots
million
hubble
space
survey
billionsatellite
matt
american
lunar
indiaimpact
institute
launch
original
holy
landerice
centerengineers
news
university
roland
working
replace
created
releases
version
release
today
firmware
security
X1
imaginary
feel
sending
wrote
written
death
bbc
excerpt
jagslive
word
anti.globalism
begins
article
firm
lucas123
trial
president
prettypickens
country
couple
X50
financial
technica
slashdot
claimissued
men
washington
announces
september
plans
mondaysuit
wife
back
contest
video
piracysales
independent
learning
modern
industry
show
final
X3d
mirror
magazine
money
run
user
november
guitar
pointed
player
virtual
bit
battle begin
year
conference
X3
australian
reilly
evolution
sony
digital
world
king
call
set
warhammer
patent
tour
author
videosways
ceo
employees
makers
part
social
chance
live
feature
piece
discussing
sporeaction
wars
mediasentry
led
block
received
ruling
chinese
plan
agreement
warner
illegal
streaming
passed
pdfproposed
house
senate
censorship
china
access
bittorrent
submitted
speech
attempt
intellectual
supreme
appeals
ruled
case
filed
district
order
judge
appeal
company
experience
stupid problem
gmail
spam
long
computersattack
org ordered
student
makes
vista
close
jobs
offers
decided
psystar
os
youtube
store
prolines
line
include
noted
read
current
demoperformance
core
graphics
gpudriver ii
eee
market
pcsbarence
demonstrated
move
top X500
designed
revealed
hat
X6
X8
mark
code
networks
software
download
barack
america
night
tuesday
issues
machines
problems
party
state
apparently private
account
reporting
united
europe
privacy
change
register
john
challenge
pictures
ad
high
operating
mike
screen battery
built
unveiled
(a) Slashdot
PDOC00154
PDOC00343
PDOC00271
PDOC00064
PDOC00791
PDOC00380
PDOC50007
PDOC00224
PDOC00100
PDOC00670
PDOC50002
PDOC50106
PDOC00561
PDOC50017
PDOC50003
PDOC50006
PDOC50156
PDOC00662
PDOC00018
PDOC50001
PDOC00014
PDOC00750
PDOC50196
PDOC50199
PDOC00660
PDOC00653
PDOC00030
PS50072
PS50067
PS50053
PS50065
PS01031
PS50004
PS50007
PS50008
PS50049
PS50011
PS50052
PS50002
PS50106
PS50050
PS50017
PS50003
PS50006
PS50156
PS50051
PS00018
PS50001
PS00014
PS50235
PS50199
PS50102
PS00027
PDOC00154
PDOC00343
PDOC00271
PDOC00064
PDOC00791
PDOC00380
PDOC50007
PDOC00224
PDOC00100
PDOC00670
PDOC50002
PDOC50106
PDOC00561
PDOC50017
PDOC50003
PDOC50006PDOC50156
PDOC00662PDOC00018
PDOC50001
PDOC00014
PDOC00750
PDOC50196
PDOC50199
PDOC00660PDOC00653
PDOC00030
PS50072
PS50067
PS50053
PS50065
PS00318
PS00066
PS01031
PS50004
PS50008
PS50007
PS50049
PS50011
PS50052
PS50002
PS50106
PS50050
PS50017
PS50003
PS50006PS50156
PS50051PS00018
PS50001
PS00014
PS50235
PS50229
PS50199
PS50245PS50804
PS50102
(b) Genbase
Figure 5: The local BN structures learned by MMHC (left plot) and H2PC (right plot) on a single cross-validation split, on Slashdot and Genbase.
21
Class−0−593_70
Class−1−079_99
Class−2−786_09 Class−3−759_89
Class−4−753_0
Class−5−786_2
Class−6−V72_5
Class−7−511_9
Class−8−596_8
Class−9−599_0
Class−10−518_0
Class−11−593_5
Class−12−V13_09
Class−13−791_0
Class−14−789_00
Class−15−593_1
Class−16−462
Class−17−592_0
Class−18−786_59
Class−19−785_6
Class−20−V67_09
Class−21−795_5
Class−22−789_09
Class−23−786_50
Class−24−596_54
Class−25−787_03
Class−26−V42_0
Class−27−786_05
Class−28−753_21
Class−29−783_0
Class−30−277_00
Class−31−780_6
Class−32−486
Class−33−788_41 Class−34−V13_02
Class−35−493_90
Class−36−788_30
Class−37−753_3
Class−38−593_89
Class−39−758_6
Class−40−741_90
Class−41−591
Class−42−599_7
Class−43−279_12
Class−44−786_07
reflux
grade
syndrome
hemihypertrophy
cough
unilateral
drainage
present
uti
atelectasis
hydroureter
proteinuria
pain
abdominal
pharyngitis
calculi
10−year−9−month
followup
ppd
positive
chest
neurogenic
vomiting
transplant
shortness
appetite
fibrosis
yearly
pneumonia
fever
intraluminal
asthma
features
pyelectasis
moderate
pyelocaliectasis
horseshoe
duplication
duplicated
enuresis
incontinence
2;
myelomeningocele
meningomyelocele
turner
hydronephrosis
hematuria
wheezing
vesicoureteral
cystogram
ii
low 2
aldrich
lobe
9−day
alternatively
count
echogenicity
recent
normal
lung
patchy
proximally
flank
mass
conveyed
evidence
pole
growth
routine
study
antibody
renal
kidneys
accounts
aspirated
marrow
cystic
evaluation
viral focal
ago
year−old radiographic
distentionresidual
caliber
change
partial
collecting
calcium
date
infections
grossly
kidney
gross
microscopic
reactive
Class−0−593_70
Class−1−079_99
Class−2−786_09
Class−3−759_89
Class−4−753_0
Class−5−786_2
Class−6−V72_5
Class−7−511_9
Class−8−596_8
Class−9−599_0
Class−10−518_0
Class−11−593_5
Class−12−V13_09
Class−13−791_0
Class−14−789_00
Class−15−593_1
Class−16−462
Class−17−592_0
Class−18−786_59
Class−19−785_6
Class−20−V67_09
Class−21−795_5
Class−22−789_09
Class−23−786_50
Class−24−596_54
Class−25−787_03
Class−26−V42_0
Class−27−786_05Class−28−753_21
Class−29−783_0
Class−30−277_00
Class−31−780_6
Class−32−486
Class−33−788_41
Class−34−V13_02
Class−35−493_90
Class−36−788_30
Class−37−753_3
Class−38−593_89
Class−39−758_6
Class−40−741_90
Class−41−591
Class−42−599_7
Class−43−279_12
Class−44−786_07
reflux
grade
growth
syndrome
hemihypertrophy
mass
hypoventilation
cough
weeks
x−ray
unilateral
drainage
present
tract
uti
febrile
atelectasis
hydroureter
decreased
woman
proteinuria
pain
abdominal
pharyngitis
calculi
10−year−9−month
followup
ppd
positive
flank
chest
neurogenic
vomiting
transplant
shortness
appetite
fibrosis
yearly
pneumonia
fever
intraluminal
asthma
reactive
features
ago
utis
recent
pyelectasis
pyelocaliectasis
abnormal
mild
horseshoe
duplication
duplicated
kidneys
kidney
enuresis
incontinence
2;
anuresis
night
ultrasound myelomeningocele
meningomyelocele
turner
hydronephrosis
improvement
unchangeddecrease
bilateral
moderate
wiskott
hematuria
wheezing
consistent
vesicoureteral
cystogram
voiding
left
history
deflux
low
4
2
nuclear
ii
interval
wiedemann
utero
thickening
clearprior
dilatation
lobe
pleural
stable
bladder
streaky
approximately
2−3
evidence
9−day
appearance
echogenicity
thoraxwhite
urinary
infection
partial
lung
band
lobar
area
patchy
density
opacities
consolidation
subsegmental
represent
sounds
junction
conveyed
stool
underlying
pole
5
reimplantation
duplex
routine
studytiterrenal
normal
views
radiographic
radiograph
started
web
marrow
cystic
evaluation
viral
examination
male
rulefocal
days
disease
hyperinflation
findings
airway
airways
year−old
nonconventional
etiology
9;
month
degrees
years
2001
pyelonephritis
bilaterally
resolves
change
increased
residual
gross
related
appears
distention
calix
ureteral
ureters
cortical
collecting
kidney;
partially
sonographic
appearing
lungs
size
made
infections
r=9
wetting
row
day
including
vesicostomy
grossly
resolution
urogenitalprenatal
significant
parenchymal
technical
9−month−24−day
microscopic
airminimally
(a) Medical
A.A1
A.A2
A.A3
A.A4
A.A5
A.A6
A.A7
A.A8
B.B1
B.B10
B.B11
B.B12
B.B13
B.B2
B.B3
B.B4
B.B5
B.B6
B.B7
B.B8
B.B9
C.C1
C.C10
C.C11
C.C12
C.C13
C.C2
C.C3
C.C4
C.C5
C.C6
C.C7
C.C8
C.C9
D.D1
D.D10
D.D11
D.D12
D.D13D.D14
D.D15
D.D16D.D17D.D18
D.D19
D.D2
D.D3
D.D4
D.D5
D.D6
D.D7D.D8
D.D9
meeting
job
draft attached
doc
amto
article
http
media
california
confidential
comments
proposal
language
thursday
staff
securitiesfile
cc
enroncc
11
pmto
attorney
intended
legal
www
visit
hit
ca
deregulation
A.A1
A.A2
A.A3
A.A4
A.A5
A.A6
A.A7
A.A8
B.B1
B.B10
B.B11
B.B12
B.B13
B.B2
B.B3
B.B4
B.B5
B.B6
B.B7
B.B8
B.B9
C.C1
C.C10
C.C11
C.C12
C.C13
C.C2
C.C3
C.C4
C.C5
C.C6
C.C7
C.C8
C.C9
D.D1
D.D10
D.D11
D.D12
D.D13D.D14
D.D15
D.D16
D.D17
D.D18D.D19
D.D2
D.D3
D.D4
D.D5
D.D6
D.D7
D.D8
D.D9
press
kaminski
vince
00
meeting
job
draft
forwarded
news
attached
doc
weeks
subject
original
pmto
copyright
article
angeles
homes
bill
http
ferc
market
confidential
points
californiadiego
procedures
release
caps
head
solution
government
message
june
mailto
stanford
risk
london
22
tuesday
3
121
5
thursday
yesterday
development
staff
talk
discussion
concern
board
jandidn
doesnpuc
comments
language
proposal
williams
pm
stevenhou
cc
08
28 1001
02
05
09
fw
2001
data
friday
july
monday
amto
100
week
prices
good
utility
big
sale
million
securities
james
mara
find
file
cpuc
scott
trading
lineago
office republican
questions
15
number
kean
2000
3123
16
06
0403
enron
enroncc
reserved
ordered
27 world
shut
budget
losnet
held
la
spokesman
ed
temporary
valley
reached
cut
megawatts
senate
bills
chris
political
effect
al
law
bad
www
expense
siteproducers
total
iso
order
commission
alan
september
settlement
sarah
rtolinda
wood
4
issues
investigation
reasonable
commissioners
filing
filed
authority
wholesale
powerprice
interest fall
makingreal including
industrybased
markets
intended
cash
system
legal
information
attorney
talking
key
assets
america
crisis
deregulation
ca
pacific
san
residential
sdg
wayspotential
passed
lawmakerslegislators
legislature rate
shortages
policies
required
(b) Enron
Figure 6: The local BN structures learned by MMHC (left plot) and H2PC (right plot) on a single cross-validation split, on Medical and Enron.
22
TAG_2005
TAG_2006
TAG_2007
TAG_agdetection
TAG_algorithms
TAG_amperometry
TAG_analysis
TAG_and
TAG_annotation
TAG_antibody
TAG_apob
TAG_architecture
TAG_article
TAG_bettasplendens
TAG_bibteximport
TAG_book
TAG_children
TAG_classification
TAG_clustering
TAG_cognition
TAG_collaboration
TAG_collaborative
TAG_community
TAG_competition
TAG_complex
TAG_complexity
TAG_compounds
TAG_computer
TAG_computing
TAG_concept
TAG_context
TAG_cortex
TAG_critical
TAG_data
TAG_datamining
TAG_date
TAG_design
TAG_development
TAG_diffusion
TAG_diplomathesis
TAG_disability TAG_dynamics
TAG_education
TAG_elearning
TAG_electrochemistryTAG_elisa
TAG_empirical
TAG_energy
TAG_engineering
TAG_epitope
TAG_equation
TAG_evaluation
TAG_evolution
TAG_fca
TAG_folksonomy
TAG_formal
TAG_fornepomuk
TAG_games
TAG_granular
TAG_graph
TAG_hci
TAG_homogeneous
TAG_imaging
TAG_immunoassay
TAG_immunoelectrode
TAG_immunosensor
TAG_information
TAG_informationretrieval
TAG_kaldesignresearch
TAG_kinetic
TAG_knowledge
TAG_knowledgemanagement
TAG_langen
TAG_language
TAG_ldl
TAG_learning
TAG_liposome
TAG_litreview
TAG_logic
TAG_maintenance
TAG_management
TAG_mapping
TAG_marotzkiwinfried
TAG_mathematics
TAG_mathgamespatterns
TAG_methodology
TAG_metrics
TAG_mining
TAG_model
TAG_modeling
TAG_models
TAG_molecular
TAG_montecarlo
TAG_myown
TAG_narrative
TAG_nepomuk
TAG_network
TAG_networks
TAG_nlp
TAG_nonequilibrium
TAG_notag
TAG_objectoriented
TAG_of
TAG_ontologies
TAG_ontology
TAG_pattern
TAG_patterns
TAG_phase
TAG_physics
TAG_process
TAG_programming
TAG_prolearn
TAG_psycholinguistics
TAG_quantum
TAG_random
TAG_rdf
TAG_representation
TAG_requirements
TAG_research
TAG_review
TAG_science
TAG_search
TAG_semantic
TAG_semantics
TAG_semanticweb
TAG_sequence
TAG_simulation
TAG_simulations
TAG_sna
TAG_social
TAG_socialnets
TAG_software
TAG_spin
TAG_statistics
TAG_statphys23
TAG_structure
TAG_survey
TAG_system
TAG_systems
TAG_tagging TAG_technology
TAG_theory
TAG_topic1
TAG_topic10
TAG_topic11
TAG_topic2
TAG_topic3
TAG_topic4
TAG_topic6
TAG_topic7
TAG_topic8
TAG_topic9
TAG_toread
TAG_transition
TAG_visual
TAG_visualization
TAG_web
TAG_web20
TAG_wiki
2007
algorithm
algorithms
amperometric
analysis
patients
annotation
metadata
monoclonal
constants
apolipoprotein lipoprotein
apo
apob
b
architecture
architectures
reading
fish
fighting
calculations
molecular
journal
energy
initio
functional
review
children
child
classification
categories
clustering
cognitive
collaborative
collaboration
community
communities
link
entities
competitive
competition
complex
complexity
compounds
semantic
word
computer
computing
la
concept
context
aware
cortex
critical
automation
consumption
power
design
diffusion
assessment
dynamics
education
learning
electrochemical
electrode
glucose
assay
linked
empirical
engineering
equation
evaluation
evaluating
evolution
evolutionary
change
formal
games
gas
velocity
graph
interfaces
homogeneous
imaging
brain
images
immunoassay
antibody
serum
information
entropy
kinetic
constant
knowledge
management
language
languages
ldl
lipid
membrane
technology
logic
description
maintenance
reasoning
mapping
n
u
mathematics
mathematical
designing
antigen
metrics
mining
model
models
modeling
modelling
network
networks
equilibrium
extraction
object
ontologies
ontology
pattern
patterns
process
programming
programs
hypermedia
adaptive
quantum
random
rdf
representation
representations
requirements
specification
reviewed
science
scientific
search
semantics
sequence
sequences
simulation
simulations
social
latent
probabilistic
software
spin
statistics
structure
survey
systems
tagging
theory
boltzmann
proteins
biological
dna
financial
surfaces
liquid
polymer
gap
glass
relaxation
transition
visual visualization
web
first
discovery
data
machine
training
anti
antibodies
novel
efficient
techniques
analyses
disease
clin
left
primary
reaction
calculated
values
density
res
c
v
p
platform
built
siamese
molecules
protein
conference
american
psychology
physics
free
surface
ab
equations
nodes
physical
letters
young
care
vector
hierarchical
cluster
computers
supported
human
society
participants
practice
links
relationships
relations
co
determination
formation
reduce
system
applications
text
words
factors
ieee
internet
concepts
contexts
personal
exponents
application
law
coefficient
test
dynamical
students
educational
school
chem
electron
concentration
enzyme
method
theoretical
development
evaluated
evolving
genetic
biology
changes
world
wide
users
pages
services
game
phase
particles
particle
motionflow
graphs
interface
user
heterogeneous
resonance
image
retrieval
about
processing
production
maximum
measure
kinetics
rate
managing
natural
receptor
cholesterol
cell
technologies
map
onto
sci
s
interactive
non
oriented
objects
processes
program
multimedia
adaptation
mechanical
mechanics
classical
stochastic
notes
engine
numerical
onlinecultural
people
indexing
probability
magnetic
structures
its
gene
genome
markets
solid
crystal
chain
temperature
state
second
then
TAG_2005
TAG_2006
TAG_2007
TAG_agdetection
TAG_algorithms
TAG_amperometry
TAG_analysis
TAG_and
TAG_annotation
TAG_antibody TAG_apob
TAG_architecture
TAG_article
TAG_bettasplendens
TAG_bibteximport
TAG_book
TAG_children
TAG_classification
TAG_clustering
TAG_cognition
TAG_collaboration
TAG_collaborative
TAG_community
TAG_competition
TAG_complex
TAG_complexity
TAG_compounds
TAG_computer
TAG_computing
TAG_concept
TAG_context
TAG_cortex
TAG_critical
TAG_data
TAG_datamining
TAG_date
TAG_design
TAG_development
TAG_diffusion
TAG_diplomathesis
TAG_disability
TAG_dynamics
TAG_education
TAG_elearning
TAG_electrochemistry
TAG_elisa
TAG_empirical
TAG_energy
TAG_engineering
TAG_epitope
TAG_equation
TAG_evaluation
TAG_evolution
TAG_fca
TAG_folksonomy
TAG_formal
TAG_fornepomuk
TAG_games
TAG_granular
TAG_graph
TAG_hci
TAG_homogeneous
TAG_imaging
TAG_immunoassay
TAG_immunoelectrode
TAG_immunosensor
TAG_information
TAG_informationretrieval
TAG_kaldesignresearch
TAG_kinetic
TAG_knowledge
TAG_knowledgemanagement
TAG_langen
TAG_language
TAG_ldl
TAG_learning
TAG_liposome
TAG_litreview
TAG_logic
TAG_maintenance
TAG_management
TAG_mapping
TAG_marotzkiwinfried
TAG_mathematics
TAG_mathgamespatterns
TAG_methodology
TAG_metrics
TAG_mining
TAG_model
TAG_modeling
TAG_models
TAG_molecular
TAG_montecarlo
TAG_myown
TAG_narrative
TAG_nepomuk
TAG_network
TAG_networks
TAG_nlp
TAG_nonequilibrium
TAG_notag
TAG_objectoriented
TAG_of
TAG_ontologies
TAG_ontology
TAG_pattern
TAG_patterns
TAG_phase
TAG_physics
TAG_process
TAG_programming
TAG_prolearn
TAG_psycholinguistics
TAG_quantum
TAG_random
TAG_rdf
TAG_representation
TAG_requirements
TAG_research
TAG_review
TAG_science
TAG_search
TAG_semantic
TAG_semantics
TAG_semanticweb
TAG_sequence
TAG_simulation
TAG_simulations
TAG_sna
TAG_social
TAG_socialnets
TAG_software
TAG_spin
TAG_statistics
TAG_statphys23
TAG_structure
TAG_survey
TAG_system
TAG_systems
TAG_tagging
TAG_technology
TAG_theory
TAG_topic1
TAG_topic10
TAG_topic11
TAG_topic2
TAG_topic3
TAG_topic4
TAG_topic6
TAG_topic7TAG_topic8
TAG_topic9
TAG_toread
TAG_transition
TAG_visual
TAG_visualization
TAG_webTAG_web20
TAG_wiki
05
mining
2007
algorithm
algorithms
amperometric
working
competitive
glucose
analysis
patients
annotation
metadata
monoclonal
determining
mass
association
constants
apolipoprotein
lipoprotein
apo
apob
b
through
ratio
95
approach
variation
united
gene
significantly
particles
isolated
samples
containing
subjects
disease
protein
serum
clin
human
cholesterol
lipid
plasma
lipoproteins
ldl
architecture
architectures
in
reading
conference
movement
fishfighting
betta
neural
psychology
2004
behavior
display
male
calculations
molecular
energy
functional
density
could
science
carbon
employed
second
self
series
approximate
calculate
spin
acid
yield
species
coupled
gaussian
experiment
contributions
comparison
predicted
chem
set
x
treatment
relative
studied
interactions
site
level
active
structures
consistent
been
obtained
transfer
carried
frequencies
ion
geometry
compared
approximation
theoretical
found
activation
gas
sets
optimized
potentials catalytic
perturbation
complexes
study
accurate
computed
transition
effects
correlation6
computational
electron
molecule
agreement
charge
structure
experimental
exchange
calculation
electrostatic
methods
combined
barrier
reactions
atoms
method
hydrogen
mol
basis
potential
american
electronic
calculated solvent
reaction
molecules
c
bond
mechanical
dft
hybrid
qm
energies
quantum
journal
chemical
chemistry
initio
networks
physical
children
child
numbers
classification
text
categories
clustering
constraints
cognitive
collaborative
collaboration
shared
accuracy
community communities
entities
link
enzyme
competition
labeled
complex
complexity
compounds
proceedings
interpretation
target
similarity
semantic
words
language
word
computer
school
computing
la
concept
conceptual
context
aware
contexts
cortex
input
critical
data
genome
automation
consumption
power
average
technique
reduction
multi
policy
date
memory
reducing
dynamic
performance
embedded
design
creating
development
diffusion
ontology
hierarchical
assessment
identity
2006
basic
developmental
dynamics
physics
phys
exhibits
education
educational
learning
und
e
electrochemical
electrode
limits
polymer
current
direct
signal
vs
analytical
modified
range
oxidation
sensor
immobilized
immunoassay
antibody
linked
empirical
software
arise
engineering
proteins
identification
distinct
located
specific
fragments
expression
equation
evaluation
evaluating
a
evolution
requirements
evolutionary
changes
change
evolving
maintenance
web
share
tagging
formal
games
game
velocity
homogeneous
media
graph
interfaces
addition
separation
imaging
brain
medical
directions
visualization
image
multiple
images
matter
procedure
4
min
developed
assays
sensitivity
ml
antibodies
detection
assay
response
information
entropy
retrieval
mathematics
computers
teaching
constant
kinetic
antigen
knowledge
management
the
experiences
personal
languages
teachers
membrane
technology
was
access
devices
health
logic
description
reasoning
mapping
n
u
der
mathematical
references
culture
perspective
practices
environments
designing
methodology
metrics
model
models
modeling
international
modelling
proc
generation
06 network
equilibrium
extraction
documents
object
oriented
and
ontologies
to
pattern
patterns
phase
surface
process
programming
programs
hypermedia
course
standards
adaptation
adaptive
were
argued
novel
experiments
combination
principle
random
book
rdf
graphs
representation
representations
illustrated
re
specification
research
review
reviewed
synthesis
cause
enzymes
scientific
search
manner
semantics
sequence
partial
sequences
simulation
simulations
social
latent
probabilistic
source
statistics
xxiii
survey
system
systems
library
users
communication
adapted
theory
biological
dna
cells
financial
optimization
driven
formation
surfaces
temporal
growth
liquid
gap
glass
visual
views
exploration
graphical
w
0
workshop
background
first
p
02
associated
objects
vision
enabling
meaning
machine
automatically
discovery
training
situations
tools
experience
practice
tasks
activities
learned
learn
students
relevant
98
institute
statistical
letters
degrees
curves
chosen
conducted
prototype
fully
smaller
uniform
automated
similar
benefits
possibility
goal
discrete
expected
region
equal
connected
expansion
depend
considered
leads
content
strongly
short
origin
linear
introduced
corresponding
particular
shown
chains
induced
wave
bulk
forces
motion
stationary
force
local
coupling
observed
square
parameters
generalized
interacting
relaxation
also
symmetry
dependence
states
exact
particle
thermodynamic
equations
point
mean
are
scaling
field
numerical
finite
techniques
on
vector
we
proposed
problem
efficient
based
bioinformatics
produced
determination
this
developers
group
mediated
differential
an
flow
substrate
by
comparative
analyze
revealed
analyzing
analyses
is
of
that
combiningstability
improved
elements
domainsinfluence
components
importance
characteristics
highly
main
processes
four
further
significant
defined
among
including
applications
under
several
three
properties
s
its
other
different
both
between
for
daily
age
conclusion
clinical
creation
anti
directly
determine
capture
mixture
features
greater
detect
free
50
using
against
j
concentrations
bound
concentration
affinity
binding
recognition
directed
cell
previously
expressed
role
thermal
non
action
law
annual
rate
values
overall
identical
contrast
20
normal
amino
assessed
low
respectively
res
v
view
close
g
1noise
abstract
2005
lettour
matching
approaches
coefficients
coefficient
variations
national
genetic
genes
than
results
interact
fraction
charged
fluid
sample
as
subject
risk
individuals
major
beta
run
reference
those
l
factors
interaction
activity
extent
biol
profile
40
levels
total
amounts
composition
very
rich
contained
enhanced
modification
had
decreased
recognized
increased
receptor
paper
built
execution
platform
infrastructure
reveals
turn
kinds
namely
characterization
relatively
light
complete
across
lack
little
involved
various
space
out
terms
large
about
present
or
who
his
1998
least
european
instance
transactions
procedures
allows
show
control
behavioursiamese
splendens
processing
rev
behaviors
differ
24
pairs
trans
performed
base
hard
shift
weight
small
structural
atomic
mechanics
water
minimumground
state
improvement
mixed
iii
dependent
whereas
double
gradient
be
notes
progress
order
two
organized
distribution
magnetic
25
side
decreasepathway
acids
ions
cluster
functions
contribution
their
predict
prediction
predictions
fit
y 2
10
alpha
theoretically
effect
attractive
sites
lattice
residues
high
at
mechanism
with
has
have
implemented
identified
far
past
however
recently
frequency
configuration
modeled
intelligence
framework
no
quality
formed
case
qualitative
character
findings
intermediate
occurs
transitions
temperature
r
strong
parameter
good
correlated
correlations
5
7
8
18
11
12
3
restricted
computation
cost
excellent
experimentally
full
strength
boltzmann
applicable
reliable
best
reverse
kinetics
realistic
one
extendedconventional
estimate
applied
measurement
estimated
society
solution
combines
chain
indicates
intrinsic
involving
path
step
reserved
h
2003
2000
1997
o
d
systematic
mm
environment
give
body
treated
fock
classical
studies
obtain
tested
ab
nodes
emergence
links
degree
scale
topology
young
early
year
skills
years
play
care
developing
number
probabilityopportunities
document
articles
clusters
composed
parallel
controlled
selected
actual
line
positive
resulting
during
product
standard
availablefinally
variety
presented
single
within
function
simple
when
into
new
time
strategies
arguetheories
user
virtual
impact
technical
functionality
decisions
goals
building
project
designed
principles
implementation
co
supporting
online
success
members
errors
participants
relationships
relations
ph
determined
reduce
symposium
2002
joint
conditions
acm www
definition
higher
measure
measures
not
natural
defining
acquisition
written
query
xml
decision
sense
supported
reports
ieee
internet
mobile
de
des
descriptions
concepts
rapidly
situation
there
left
area
period
layer
responses
maps
primary
orientation
output
arbitrary
exponents
from
prior
database
sources
collected
databases
whole
achieve
strategy
off
i
sizes
error
distance
agent
rather
test
comprehensive
storage
static
task
precision
better
efficiency
created
challenges
create
successful
develop
volume
mobility
domain
enable
assess
reliability
choice
examined
ability
dynamicalsimulated
fluctuations
regime
2001
1999
while
settings
die
dem
auch
sich
eine
sind
f
ist
mit
zur
als
den
auf
im
zu
von
limit
some
future
art
trends
currently
technologies
30
wide
long
broad monitoring
contact
measurements
resonance
onto
advantages
sensitive
detected
evidence
identify
industrial
reuse
program
managing
projects
code
conf
testing
changing
supports
needs
discuss
section
integrated
concernssupport
each
derived
begin
evaluated
evaluate
criteria
researchers
every
biology
selection
events
world
pages
topic
http
usage
intelligent
tool
sharing navigation
resources
services
they
generated
popular
look
video
market
others
heterogeneous
multimedia
medium
interface
specificity
resolution
areas
regions
since
interactive
seenonlinear
17
used
protocoldevice
quantitative
cross
solid
100
achieved
rapid
described
format
being
types
enables
existing
organizational
provides
people
provide
production
maximum
indexing
life
historical
ideas
solving
apparent
offers
measuring
amount
maintaining
integrating
making
construction
representing
providing
know
service
organizations
business
al
general
over
regardingmight
technological
innovation
what
digital
demonstrated
indicated
investigated
did
showed
conclusions
describing
already
parts
roles
map
such
sci
m
cultural
distributions
these
unified
int
next
generating
automatic
presentation
resource
distributed
connections
global
aspect
applying
family
establish
extend
consequences
construct
entire
facilitate
meaningful
allowing
secondary
comparable
completely
center
easy
perform
generate
constructed
interesting
generally
extensive
itself
required
appear
leading
furthermore
established
additional
upon
points
groups
limited
alternative
rates
likely
requires
increasing
examine
following
mechanismsuses
provided
initial
improve
together
key
individual
become
investigate
respect
lead
value
increase
must
allow
need
therefore
made
type
understand
according
effective
application
make
without
thus
up
able
way
them
most
related
alldue
use
but
more
it
which
many
will
you
clear
right
work
shows
like
find
phenomena
diagramphases
continuous
etc
included
day
60
suggesting
subsequent
can
analyzed
affected
15
five
seven
either
differences
less
having
special
answer
issues
foundation
academic
needed
papers
focuses
topics
discusses
efforts
emerging
questions
overviewunderstanding
how
developments
literature
occur
engine
relevance
queries
now
explicit
define
length
peptide
monte
behavioral
public
kind
political
organization
sciences
class
advances
libraries
communications
discussion
bases
markets
spatial
relation
growing
crystal
materials
attention
exploring
explore
clearly
13
where
lower
t
9
towards
then
(a) Bibtex
city
mountain
sky
sun
water
clouds
tree
bay
lake
sea
beach
boats
people
branch
leaf
grass
plain
palm
horizon
shell
hills
waves
birds
land
dog
bridge
ships
buildings
fence
island
storm peaks
jet
plane
runway
basket
flight
flag
helicopter
boeing
prop
f−16
tails
smoke
formation
bear polar
snow
tundra
ice
head
black
reflection
ground
forest
fall
river
field
flowers
stream
meadow
rocks
hillside
shrubs
close−up
grizzly
cubs
drum
log
hut
sunset
display
plants
pool
coral
fan
anemone
fish
ocean
diver
sunrise
face
sand
rainbow
farms
reefs
vegetation
house
village
carvings
path
wood
dress
coast
sailboats
cattiger
bengal
fox
kit
run
shadows winter
autumn
cliff
bush
rockface
pair
den
coyote
light
arctic
shore
town
road
chapel
moon
harbor
windmills
restaurant
wall
skyline
window
clothes
shops
street
cafe
tables
nets
crafts
roofs
ruins
stone
cars
castle
courtyard
statue
stairs
costume
sponges
sign
palace
paintings
sheep
valley
balcony
post
gate
plaza
festival
temple
sculpture
museum
hotel
art
fountain
market
door
mural
garden
star
butterfly
angelfish
lion
cave
crab
grouper
pagoda
buddha
decoration
monastery
landscape
detail
writing
sails
food
room
entrance
fruit
night
perch
cow
figures
facadechairs
guard
pond
church
park
barn
arch
hats
cathedral
ceremony
crowd glass
shrine
model
pillar
carpet
monument
floor
vines cottage
poppies
lawn
tower
vegetables
bench
rose
tulip
canal
cheese
railing
dock
horses
petals
umbrella
column
waterfalls
elephant
monks pattern
interior
vendor
silhouette
architecture
blossoms
athlete
parade
ladder
sidewalk
store
steps
relief
fog
frost
frozen
rapids
crystals
spider needles
stick
mist
doorway
vineyard
pottery
pots
military
designs
mushrooms
terrace
tent
bulls
giant
tortoise
wingsalbatross
booby
nest
hawk
iguana
lizard
marine
penguin
deer
white−tailed
horns
slope
mule
fawn
antlers
elk
caribou
herd
moose
clearing
mare
foals
orchid
lily
stems
row
chrysanthemums
blooms
cactus
saguarogiraffe
zebra
tusks
hands
train
desertdunes
canyon
lighthousemast
seals
texture
dust
pepper
swimmers
pyramid
mosque
sphinx
truck
fly
trunk
baby
eagle
lynx
rodent
squirrel
goat
marsh
wolf
pack
dall
porcupine
whales
rabbit
tracks
crops
animals
moss
trail
locomotive
railroad
vehicle aerial range
insect
manwoman
rice
prayer
glacier
harvest
girl
indian
pole
dance
africanshirt
buddhist
tomboutside
shade
formula
turn
straightaway
prototype
steel
scotland
ceiling
furniture
lichen
pups
antelope
pebbles
remains leopard
jeep
calf
reptile
snake
cougar
oahu
kauai
maui
school
canoe
race
hawaii
Cluster442
Cluster380
Cluster20
Cluster317
Cluster243
Cluster157
Cluster364
Cluster423
Cluster21
Cluster133
Cluster113
Cluster93
Cluster412
Cluster383
Cluster399
Cluster408
Cluster252
Cluster268
Cluster426
Cluster170
Cluster446
Cluster241
Cluster166
Cluster478
Cluster82
Cluster137
Cluster341
Cluster145
Cluster471
Cluster96
Cluster79
Cluster230
Cluster260
Cluster461
Cluster362
Cluster409
Cluster492
Cluster42
Cluster59
Cluster205
Cluster458
Cluster81
Cluster347
Cluster129
Cluster427
Cluster348
Cluster128
Cluster101
Cluster235
Cluster351
Cluster165
Cluster390
Cluster150
Cluster39
Cluster144
Cluster392
Cluster70
Cluster493
Cluster9
Cluster305
Cluster281
Cluster193
Cluster267
Cluster126
Cluster232
Cluster159
Cluster138
Cluster173Cluster482
Cluster22
Cluster29
Cluster49
Cluster470
Cluster353
Cluster425Cluster217
Cluster326
Cluster209
Cluster110
Cluster278
Cluster322
Cluster256
Cluster185
Cluster282
Cluster431
Cluster17
Cluster117
Cluster318
Cluster114
Cluster149
Cluster244
Cluster95
Cluster56
Cluster331Cluster142
Cluster424
Cluster479
Cluster356
Cluster286
Cluster313
Cluster344
Cluster201
Cluster291
Cluster457
Cluster420
Cluster330
Cluster23
Cluster62
Cluster498
Cluster405
Cluster443
Cluster214
Cluster152
Cluster1
Cluster394
Cluster284
Cluster190
Cluster439
Cluster345
Cluster227
Cluster436
Cluster308
Cluster195
Cluster403
city
mountain
sky
sun
water
clouds
tree
bay
lake
sea
beach
boats
people
branch
leaf
grass
plain
palm
horizon
shell
hills
waves
birds
land
dog
bridge
ships
buildings
fence
island
storm
peaks
jet
plane
runway
basket
flight
flag
helicopter
boeing
prop
f−16
tails
smokeformation
bear
polar
snow
tundra
ice
head
black
reflection
ground
forest
fall
river
field
flowers
stream
meadow
rocks
hillside
shrubs
close−up
grizzly
cubs
drum
log
hut
sunset
display
plants
pool
coral
fan
anemone
fish
ocean
diver
sunrise
face
sand
rainbow
farms
reefs
vegetation
house
village
carvings
path
wood
dress
coast
sailboats
cat
tiger
bengal
fox
kit run
shadows
winter
autumn
cliff
bush
rockface
pair
den
coyote
light
arctic
shore
town
road
chapel
moon
harbor
windmills
restaurant
wall
skyline
window
clothes
shops
street
cafe
tables
nets
crafts
roofs
ruins
stone
cars
castle
courtyard
statuestairs
costume
sponges
sign
palace
paintings
sheep
valley
balcony
post
gate
plaza
festival
temple
sculpture
museum
hotel
art
fountain
market
door
mural
garden
star
butterfly
angelfish
lion
cave
crab
grouper
pagoda
buddha
decoration
monastery
landscape
detail
writing
sails
food
room
entrance
fruit
night
perch
cow
figures
facade
chairs
guard
pond
church
park
barnarchhats
cathedral
ceremony
crowd
glass
shrine
model
pillar
carpet
monument
floor
vines
cottage
poppies
lawn
tower
vegetables
bench
rose
tulip
canal
cheese
railing
dock
horses
petals
umbrella
column
waterfalls
elephant
monks
pattern
interior
vendor
silhouette
architecture
blossoms
athlete
parade
ladder
sidewalk
store
steps
relief
fog
frostfrozen
rapids
crystals
spider
needles
stick
mist
doorway
vineyard
pottery
pots
military
designs
mushrooms
terrace
tent
bulls
gianttortoise
wings
albatross
booby
nest
hawk
iguana
lizard
marine
penguin
deer
white−tailed
horns
slope
mule
fawn
antlers
elk
caribouherd
moose
clearing
marefoals
orchidlily
stems
row
chrysanthemums
blooms
cactus
saguaro
giraffe
zebra
tusks
hands
train
desert
dunes
canyon
lighthouse
mast
seals
texture
dustpepper
swimmers
pyramid
mosque
sphinx
truck
fly
trunk
baby
eagle
lynx
rodent
squirrel
goat
marsh
wolf
pack
dall
porcupine
whales
rabbit
tracks
crops
animals
moss
trail
locomotiverailroad
vehicleaerial
range
insect
man
woman
rice
prayer
glacier
harvest
girl
indian
pole
dance
african
shirt
buddhist
tomb
outside
shade
formula
turnstraightaway
prototype
steel
scotland
ceilingfurniture
lichen
pups
antelope
pebbles
remains
leopard
jeep
calf
reptile
snake
cougar
oahu
kauai
maui
school
canoe race
hawaii
Cluster429Cluster287
Cluster94
Cluster352
Cluster96
Cluster395
Cluster442
Cluster433
Cluster118
Cluster365
Cluster83
Cluster106
Cluster489
Cluster71
Cluster89
Cluster390 Cluster165
Cluster495
Cluster380
Cluster20
Cluster120
Cluster471
Cluster57
Cluster144
Cluster150
Cluster60
Cluster449
Cluster411
Cluster224
Cluster275
Cluster84
Cluster53
Cluster88
Cluster325
Cluster50
Cluster112
Cluster288
Cluster317Cluster9
Cluster490
Cluster66
Cluster214
Cluster397
Cluster100
Cluster336
Cluster85
Cluster18
Cluster74
Cluster413
Cluster297
Cluster364
Cluster11
Cluster369
Cluster243
Cluster185
Cluster273
Cluster427
Cluster279
Cluster194
Cluster385
Cluster231
Cluster396
Cluster5
Cluster157
Cluster22
Cluster423
Cluster159
Cluster121
Cluster180
Cluster498
Cluster496 Cluster466
Cluster358
Cluster432
Cluster254
Cluster366
Cluster456
Cluster62
Cluster493
Cluster21
Cluster494
Cluster282
Cluster492
Cluster228
Cluster133
Cluster258
Cluster204
Cluster274
Cluster440
Cluster444
Cluster113
Cluster460
Cluster412Cluster383
Cluster399
Cluster326
Cluster148
Cluster192
Cluster19
Cluster459
Cluster278
Cluster408
Cluster441
Cluster463
Cluster482
Cluster376
Cluster304
Cluster252
Cluster268
Cluster27
Cluster149
Cluster426
Cluster410
Cluster79
Cluster318
Cluster170
Cluster55
Cluster166
Cluster321
Cluster389
Cluster333
Cluster377
Cluster52
Cluster241
Cluster138
Cluster345
Cluster163
Cluster398
Cluster294
Cluster154
Cluster437
Cluster414
Cluster48
Cluster417
Cluster331
Cluster2
Cluster415
Cluster467
Cluster12
Cluster478
Cluster209
Cluster174
Cluster82
Cluster137
Cluster354
Cluster341
Cluster123
Cluster308
Cluster238
Cluster465
Cluster117
Cluster43
Cluster139 Cluster439
Cluster379
Cluster462
Cluster178
Cluster72
Cluster115
Cluster458
Cluster260
Cluster461
Cluster234
Cluster436
Cluster182
Cluster169
Cluster359
Cluster285Cluster362
Cluster51
Cluster38
Cluster244
Cluster263
Cluster295
Cluster59
Cluster205
Cluster301
Cluster293
Cluster420
Cluster230
Cluster281
Cluster160
Cluster242
Cluster37
Cluster108Cluster200
Cluster81
Cluster438
Cluster227
Cluster292
Cluster191
Cluster322
Cluster406
Cluster448
Cluster349
Cluster450
Cluster15
Cluster291
Cluster129
Cluster394
Cluster491
Cluster348
Cluster128
Cluster101
Cluster305
Cluster130
Cluster45
Cluster220
Cluster4
Cluster486
Cluster156
Cluster235
Cluster221
Cluster386
Cluster316
Cluster151
Cluster351
Cluster276
Cluster381
Cluster289
Cluster422
Cluster404
Cluster481
Cluster173
Cluster267
Cluster102Cluster320
Cluster393
Cluster454
Cluster127
Cluster477
Cluster499
Cluster378
Cluster430
Cluster35
Cluster136
Cluster416
Cluster98
Cluster119
Cluster7Cluster264
Cluster271
Cluster39
Cluster392
Cluster330
Cluster391
Cluster114
Cluster186
Cluster164
Cluster58
Cluster162
Cluster246
Cluster483
Cluster14
Cluster435
Cluster199
Cluster215
Cluster6
Cluster111
Cluster70
Cluster283Cluster286
Cluster167
Cluster451
Cluster239
Cluster202
Cluster25
Cluster306
Cluster211
Cluster168
Cluster217
Cluster457
Cluster146
Cluster485
Cluster219
Cluster47
Cluster155
Cluster67
Cluster193
Cluster363
Cluster40
Cluster75
Cluster455
Cluster299
Cluster405
Cluster323
Cluster372
Cluster468
Cluster475
Cluster313
Cluster104
Cluster223
Cluster311
Cluster124
Cluster213
Cluster8
Cluster122
Cluster284
Cluster190
Cluster350
Cluster181
Cluster210
Cluster41
Cluster434
Cluster183
Cluster474
Cluster126
Cluster44
Cluster232
Cluster147
Cluster63
Cluster342
Cluster46
Cluster140
Cluster309
Cluster403
Cluster470
Cluster357
Cluster152
Cluster207
Cluster340
Cluster203 Cluster65
Cluster262
Cluster233
Cluster269
Cluster387
Cluster68
Cluster34
Cluster208
Cluster23
Cluster375
Cluster29
Cluster310
Cluster49Cluster30
Cluster1
Cluster353
Cluster425
Cluster327
Cluster110
Cluster497
Cluster402
Cluster257
Cluster476
Cluster329Cluster266
Cluster197
Cluster175
Cluster76
Cluster256
Cluster179
Cluster338
Cluster222
Cluster187
Cluster370
Cluster384
Cluster277
Cluster431
Cluster17
Cluster109
Cluster265
Cluster447
Cluster407
Cluster226
Cluster56
Cluster360
Cluster153
Cluster184
Cluster445
Cluster400
Cluster382
Cluster91
Cluster247
Cluster255
Cluster142
Cluster424
Cluster479Cluster421
Cluster356
Cluster196
Cluster97
Cluster195
Cluster300Cluster335
Cluster361
Cluster206
Cluster259
Cluster480
Cluster3
Cluster347
Cluster419
Cluster107
Cluster443
Cluster87
Cluster240
Cluster24
Cluster280
Cluster61
Cluster225
Cluster343 Cluster332
Cluster270
Cluster296
Cluster488
Cluster16
Cluster116
Cluster302
Cluster26
Cluster487
Cluster171
Cluster143
Cluster249
Cluster77
Cluster464
(b) Corel5k
Figure 7: The local BN structures learned by MMHC (left plot) and H2PC (right plot) on a single cross-validation split, on Bibtex and Corel5k.
23
References
Agresti, A. (2002). Categorical Data Analysis. Wiley, 2nd edition.Aliferis, C. F., Statnikov, A. R., Tsamardinos, I., Mani, S., & Kout-
soukos, X. D. (2010). Local causal and markov blanket inductionfor causal discovery and feature selection for classification part i:Algorithms and empirical evaluation. Journal of Machine Learn-
ing Research, JMLR, 11, 171–234.Alvares-Cherman, E., Metz, J., & Monard, M. (2011). Incorporating
label dependency into the binary relevance framework for multi-label classification. Expert Systems With Applications, ESWA, 39,1647–1655.
Armen, A. P., & Tsamardinos, I. (2011). A unified approach to esti-mation and control of the false discovery rate in bayesian networkskeleton identification. In European Symposium on Artificial Neu-
ral Networks, ESANN.Aussem, A., Rodrigues de Morais, S., & Corbex, M. (2012). Analysis
of nasopharyngeal carcinoma risk factors with bayesian networks.Artificial Intelligence in Medicine, 54.
Aussem, A., Tchernof, A., Rodrigues de Morais, S., & Rome, S.(2010). Analysis of lifestyle and metabolic predictors of visceralobesity with bayesian networks. BMC Bioinformatics, 11, 487.
Badea, A. (2004). Determining the direction of causal influence inlarge probabilistic networks: A constraint-based approach. In Pro-
ceedings of the Sixteenth European Conference on Artificial Intel-
ligence (pp. 263–267).Bernard, A., & Hartemink, A. (2005). Informative structure priors:
Joint learning of dynamic regulatory networks from multiple typesof data. In Proceedings of the Pacific Symposium on Biocomputing
(pp. 459–470).Blockeel, H., Raedt, L. D., & Ramon, J. (1998). Top-down induction
of clustering trees. In J. W. Shavlik (Ed.), International Conference
on Machine Learning, ICML (pp. 55–63). Morgan Kaufmann.Borchani, H., Bielza, C., Toro, C., & Larranaga, P. (2013). Predicting
human immunodeficiency virus inhibitors using multi-dimensionalbayesian network classifiers. Artificial Intelligence in Medicine,57, 219–229.
Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.Brown, L. E., & Tsamardinos, I. (2008). A strategy for making predic-
tions under manipulation. Journal of Machine Learning Research,
JMLR, 3, 35–52.Buntine, W. (1991). Theory refinement on Bayesian networks. In
B. D’Ambrosio, P. Smets, & P. Bonissone (Eds.), Uncertainty in
Artificial Intelligence, UAI (pp. 52–60). San Mateo, CA, USA:Morgan Kaufmann Publishers.
Cawley, G. (2008). Causal and non-causal feature selection for ridgeregression. JMLR: Workshop and Conference Proceedings, 3,107–128.
Cheng, J., Greiner, R., Kelly, J., Bell, D. A., & Liu, W. (2002). Learn-ing Bayesian networks from data: An information-theory basedapproach. Artificial Intelligence, 137, 43–90.
Chickering, D., Heckerman, D., & Meek, C. (2004). Large-samplelearning of Bayesian networks is NP-hard. Journal of Machine
Learning Research, JMLR, 5, 1287–1330.Chickering, D. M. (2002). Optimal structure identification with
greedy search. Journal of Machine Learning Research, JMLR, 3,507–554.
Cussens, J., & Bartlett, M. (2013). Advances in Bayesian NetworkLearning using Integer Programming. In Uncertainty in Artificial
Intelligence (pp. 182–191).DembczyÅski, K., Waegeman, W., Cheng, W., & Hullermeier, E.
(2012). On label dependence and loss minimization in multi-labelclassification. Machine Learning, 88, 5–45.
Ellis, B., & Wong, W. H. (2008). Learning causal bayesian network
structures from experimental data. Journal of the American Statis-
tical Association, 103, 778–789.Friedman, N., Nachman, I., , & Peer, D. (1999a). Learning bayesian
network structure from massive datasets: The sparse candidate al-gorithm. In Proceedings of the Fifteenth Conference on Uncer-
tainty in Artificial Intelligence (pp. 206–215).Friedman, N., Nachman, I., & Pe’er, D. (1999b). Learning bayesian
network structure from massive datasets: the ”sparse candidate”algorithm. In K. B. Laskey, & H. Prade (Eds.), Uncertainty in
Artificial Intelligence, UAI (pp. 21–30). Morgan Kaufmann Pub-lishers.
Gasse, M., Aussem, A., & Elghazel, H. (2012). Comparison of hybridalgorithms for bayesian network structure learning. In European
Conference on Machine Learning and Principles and Practice of
Knowledge Discovery in Databases, ECML-PKDD (pp. 58–73).Springer volume 7523 of Lecture Notes in Computer Science.
Gu, Q., Li, Z., & Han, J. (2011). Correlated multi-label feature selec-tion. In C. Macdonald, I. Ounis, & I. Ruthven (Eds.), CIKM (pp.1087–1096). ACM.
Guo, Y., & Gu, S. (2011). Multi-label classification using conditionaldependency networks. In International Joint Conference on Artifi-
cial Intelligence, IJCAI (pp. 1300–1305). AAAI Press.Heckerman, D., Geiger, D., & Chickering, D. (1995). Learning
bayesian networks: The combination of knowledge and statisticaldata. Machine Learning, 20, 197–243.
Kocev, D., Vens, C., Struyf, J., & Dzeroski, S. (2007). Ensemblesof multi-objective decision trees. In J. N. Kok (Ed.), European
Conference on Machine Learning, ECML (pp. 624–631). Springervolume 4701 of Lecture Notes in Artificial Intelligence.
Koivisto, M., & Sood, K. (2004). Exact bayesian structure discov-ery in bayesian networks. Journal of Machine Learning Research,
JMLR, 5, 549–573.Kojima, K., Perrier, E., Imoto, S., & Miyano, S. (2010). Optimal
search on clustered structural constraint for learning bayesian net-work structure. Journal of Machine Learning Research, JMLR, 11,285–310.
Koller, D., & Friedman, N. (2009). Probabilistic Graphical Models:
Principles and Techniques. MIT Press.Koller, D., & Sahami, M. (1996). Toward optimal feature selection. In
International Conference on Machine Learning, ICML (pp. 284–292).
Liaw, A., & Wiener, M. (2002). Classification and regression by ran-domforest. R News, 2, 18–22.
Luaces, O., Dıez, J., Barranquero, J., del Coz, J. J., & Bahamonde,A. (2012). Binary relevance efficacy for multilabel classification.Progress in AI, 1, 303–313.
Madjarov, G., Kocev, D., Gjorgjevikj, D., & Dzeroski, S. (2012).An extensive experimental comparison of methods for multi-labellearning. Pattern Recognition, 45, 3084–3104.
Maron, O., & Ratan, A. L. (1998). Multiple-Instance Learning forNatural Scene Classification. In International Conference on Ma-
chine Learning, ICML (pp. 341–349). Citeseer volume 7.McCallum, A. (1999). Multi-label text classification with a mixture
model trained by em. In AAAI Workshop on Text Learning.Moore, A., & Wong, W. (2003). Optimal reinsertion: A new
search operator for accelerated and more accurate Bayesian net-work structure learning. In T. Fawcett, & N. Mishra (Eds.), Inter-
national Conference on Machine Learning, ICML.Nagarajan, R., Scutari, M., & Lbre, S. (2013). Bayesian Networks in
R: with Applications in Systems Biology (Use R!). Springer.Neapolitan, R. E. (2004). Learning Bayesian Networks. Upper Saddle
River, NJ: Pearson Prentice Hall.Ott, S., Imoto, S., & Miyano, S. (2004). Finding optimal models for
small gene networks. In Proceedings of the Pacific Symposium on
Biocomputing (pp. 557–567).
24
Pena, J., Nilsson, R., Bjorkegren, J., & Tegner, J. (2007). Towardsscalable and data efficient learning of Markov boundaries. Inter-
national Journal of Approximate Reasoning, 45, 211–232.Pena, J. M., Bjorkegren, J., & Tegner, J. (2005). Growing Bayesian
network models of gene networks from seed genes. Bioinformat-
ics, 40, 224–229.Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Net-
works of Plausible Inference.. San Francisco, CA, USA: MorganKaufmann.
Pena, J. M. (2008). Learning gaussian graphical models of gene net-works with false discovery rate control. In European Conference
on Evolutionary Computation, Machine Learning and Data Min-
ing in Bioinformatics 6 (pp. 165–176).Pena, J. M. (2012). Finding consensus bayesian network structures.
Journal of Artificial Intelligence Research, 42, 661–687.Perrier, E., Imoto, S., & Miyano, S. (2008). Finding optimal bayesian
network given a super-structure. Journal of Machine Learning Re-
search, JMLR, 9, 2251–2286.Peer, D., Regev, A., Elidan, G., & Friedman, N. (2001). Inferring
subnetworks from perturbed expression profiles. Bioinformatics,17, 215–224.
Prestat, E., Rodrigues de Morais, S., Vendrell, J., Thollet, A., Gautier,C., Cohen, P., & Aussem, A. (2013). Learning the local Bayesiannetwork structure around the ZNF217 oncogene in breast tumours.Computers in Biology and Medicine, 4, 334–341.
R Core Team (2013). R: A Language and Environment for Statisti-
cal Computing. R Foundation for Statistical Computing Vienna,Austria. URL: http://www.R-project.org.
Read, J., Pfahringer, B., Holmes, G., & Frank, E. (2009). Classi-fier chains for multi-label classification. In European Conference
on Machine Learning and Principles and Practice of Knowledge
Discovery in Databases, ECML-PKDD (pp. 254–269). Springervolume 5782 of Lecture Notes in Computer Science.
Rodrigues de Morais, S., & Aussem, A. (2010a). An efficient learningalgorithm for local bayesian network structure discovery. In Euro-
pean Conference on Machine Learning and Principles and Prac-
tice of Knowledge Discovery in Databases, ECML-PKDD (pp.164–169).
Rodrigues de Morais, S., & Aussem, A. (2010b). A novel Markovboundary based feature subset selection algorithm. Neurocomput-
ing, 73, 578–584.Roth, V., & Fischer, B. (2007). Improved functional prediction of pro-
teins by learning kernel combinations in multilabel settings. BMC
Bioinformatics, 8, S12+.Schwarz, G. E. (1978). Estimating the dimension of a model. Journal
of Biomedical Informatics, 6, 461–464.Scutari, M. (2010). Learning bayesian networks with the bnlearn R
package. Journal of Statistical Software, 35, 1–22.Scutari, M. (2011). Measures of Variability for Graphical Models.
Ph.D. thesis School in Statistical Sciences, University of Padova.Scutari, M., & Brogini, A. (2012). Bayesian network structure learn-
ing with permutation tests. Communications in Statistics Theory
and Methods, 41, 3233–3243.Scutari, M., & Nagarajan, R. (2013). Identifying significant edges in
graphical models of molecular networks. Artificial Intelligence in
Medicine, 57, 207–217.Silander, T., & Myllymaki, P. (2006). A Simple Approach for Finding
the Globally Optimal Bayesian Network Structure. In Uncertainty
in Artificial Intelligence, UAI (pp. 445–452).Snoek, C., Worring, M., Gemert, J. V., Geusebroek, J., & Smeulders,
A. (2006). The challenge problem for automated detection of 101semantic concepts in multimedia. In ACM International Confer-
ence on Multimedia (pp. 421–430). ACM Press.Spirtes, P., Glymour, C., & Scheines, R. (2000). Causation, Predic-
tion, and Search. (2nd ed.). The MIT Press.
Spolaor, N., Cherman, E. A., Monard, M. C., & Lee, H. D. (2013).A comparison of multi-label feature selection methods using theproblem transformation approach. Electronic Notes in Theoretical
Computer Science, 292, 135–151.Studeny, M., & Haws, D. (2014). Learning Bayesian network struc-
ture: Towards the essential graph by integer linear programmingtools. International Journal of Approximate Reasoning, 55, 1043–1071.
Trohidis, K., Tsoumakas, G., Kalliris, G., & Vlahavas, I. (2008).Multi-label classification of music into emotions. In ISMIR (pp.325–330).
Tsamardinos, I., Aliferis, C., & Statnikov, A. (2003). Algorithms forlarge scale Markov blanket discovery. In Florida Artificial Intelli-
gence Research Society Conference FLAIRS’03 (pp. 376–381).Tsamardinos, I., & Borboudakis, G. (2010). Permutation testing im-
proves bayesian network learning. In European Conference on Ma-
chine Learning and Knowledge Discovery in Databases, ECML-
PKDD (pp. 322–337).Tsamardinos, I., Brown, L., & Aliferis, C. (2006). The Max-Min
Hill-Climbing Bayesian Network Structure Learning Algorithm.Machine Learning, 65, 31–78.
Tsamardinos, I., & Brown, L. E. (2008). Bounding the false discoveryrate in local Bayesian network learning. In AAAI Conference on
Artificial Intelligence (pp. 1100–1105).Tsoumakas, G., Katakis, I., & Vlahavas, I. (2010a). Mining Multi-
label Data. Transformation, 135, 1–20.Tsoumakas, G., Katakis, I., & Vlahavas, I. (2010b). Random k-
labelsets for Multi-Label Classification. IEEE Transactions on
Knowledge and Data Engineering, TKDE, 23, 1–12.Tsoumakas, G., & Vlahavas, I. (2007). Random k-labelsets: An en-
semble method for multilabel classification. Proceedings of the
18th European Conference on Machine Learning, 4701, 406–417.Villanueva, E., & Maciel, C. (2012). Optimized algorithm for learning
bayesian network superstructures. In International Conference on
Pattern Recognition Applications and Methods, ICPRAM.Villanueva, E., & Maciel, C. D. (2014). Efficient methods for learning
bayesian network super-structures. Neurocomputing, (pp. 3–12).Zhang, M. L., & Zhang, K. (2010). Multi-label learning by ex-
ploiting label dependency. In European Conference on Machine
Learning and Principles and Practice of Knowledge Discovery in
Databases, ECML PKDD (p. 999). ACM Press volume 16 ofKnowledge Discovery in Databases, KDD.
Zhang, M.-L., & Zhou., Z.-H. (2006). Multilabel neural networkswith applications to functional genomics and text categorization.IEEE Transactions on Knowledge and Data Engineering, 18, 1338– 1351.
25