Meta-path-based Collective Classification in Heterogeneous ...xkong/cikm12_slides.pdfto diﬀerent meta paths. Empirical studies on real-world net - works demonstrate that e ﬀectiveness

Xiangnan Kong, Philip S. Yu, Ying Ding, David J. Wild

Meta-path-based Collective Classification in Heterogeneous Information Networks

University of Illinois at Chicago Indiana University of Bloomington

2

Collective Classification q  Conventional classification approaches assume that instances are independent identically distributed (i.i.d.)

x2 y2

x1 y1

label instance

independent

x2 y2

x1 y1

x3 y3

related

q  In network data, instances are interconnected through links, thus are correlated with each other.

3

Meta Path-Based Collective Classification inHeterogeneous Information Networks

Xiangnan Kong! Philip S. Yu!§ Ying Ding† David J. Wild†

!University of Illinois at Chicago, Chicago, IL, USA†Indiana University Bloomington, Bloomington, IN, USA

§Computer Science Department, King Abdulaziz University, Jeddah, Saudi [email protected], [email protected], {dingying, djwild}@indiana.edu

ABSTRACTCollective classification approaches exploit the dependenciesof a group of linked objects whose class labels are correlatedand need to be predicted simultaneously. In this paper,we focus on studying the collective classification problemin heterogeneous networks, which involves multiple types ofdata objects interconnected by multiple types of links. In-tuitively, two objects are correlated if they are linked bymany paths in the network. By considering di!erent link-age paths in the network, one can capture the subtlety ofdi!erent types of dependencies among objects. We intro-duce the concept of meta-path based dependencies amongobjects, where a meta path is a path consisting a certainsequence of linke types. We show that the quality of col-lective classification results strongly depends upon the metapaths used. To accommodate the large network size, a novelsolution, called Hcc (meta-path based Heterogenous Collec-tive Classification), is developed to e!ectively assign labelsto a group of instances that are interconnected through dif-ferent meta-paths. The proposed Hcc model can capturedi!erent types of dependencies among objects with respectto di!erent meta paths. Empirical studies on real-world net-works demonstrate that e!ectiveness of the proposed metapath-based collective classification approach.

Categories and Subject DescriptorsH.2.8 [Database Management]: Database Applications-Data Mining

KeywordsHeterogeneous information networks, Meta path

1. INTRODUCTIONCollective classification methods [6] aim at exploiting the

label autocorrelation among a group of inter-connected in-stances and predict their class labels collectively, instead

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.CIKM’12, October 29–November 2, 2012, Maui, HI, USA.Copyright 2012 ACM 978-1-4503-1156-4/12/10 ...$10.00.

Paper 1

Paper 2

Paper 3

Paper 4

Paper 5

Paper 6

Paper 7

Paper 8

Conference 1 Conference 2

Author 1 Author 2 Author 3 Author 4 Author 5 Author 6

Proceeding 1 Proceeding 2 Proceeding 3 Proceeding 4

Institute 1 Institute 2 Institute 3

citation citation

citationcitation

publishIn publishIn publishIn publishIn

affiliation affiliation affiliation

authoredBy

authoredBy

authoredBy

collectedIn collectedIn

Figure 1: A Heterogeneous Information Network

of independently. The dependencies among the related in-stances should be considered explicitly during classificationprocess. Most approaches in collective classification focuson exploiting the dependencies among interconnected ob-jects in homogeneous networks. However, many real-worldapplications are facing large scale heterogeneous informationnetworks [5] with multiple types of objects inter-connectedthrough multiple types links. These networks are multi-mode and multi-relational networks, which involves largeamount of information. For example, Figure 1 involves abibliographic network with five types of nodes (papers, au-thor, a"liations, conference and proceedings) and five typesof links.

In this paper, we focus on studying the problem of collec-tive classification on one type of nodes within a heterogenousinformation network, e.g., classifying the paper nodes col-lectively in Figure 1. Formally, the collective classificationproblem in heterogeneous information networks correspondsto predicting the labels of a group of related instances simul-taneously. If we consider collective classification and hetero-geneous information networks as a whole, the major researchchallenges can be summarized as follows:

Multi-Mode and Multi-Relational Data: One fun-damental problem in classifying heterogeneous informationnetworks is that the network structure involves multipletypes of nodes and multiple types of links. For example, inFigure 1, one paper node can be linked directly with di!er-ent types of objects, such as authors, conference proceedingsand other papers, through di!erent types of links, such ascitation, authoredBy, etc. Trivial application of conventionalmethods by ignoring the link types and node types can notfully exploit the structural information within a heteroge-neous information network.

Heterogeneous Information Networks

4











Paper 1

Paper 2

Paper 3

Paper 4

Paper 5

Paper 6

Paper 7

Paper 8





citation citation

citationcitation



authoredBy

authoredBy

authoredBy







Title+ Abstract

attributeACM

Index Term

class labelPaper(12.5K)

Citation(6.1K)

publishIn(12.5K)

Proceeding(196)

Conference(18)

authoredBy(37K)

Author(17K)

Institute(1.8K)

(a) ACM Conference Datasets

Figure 7: Schema of datasets

tive classification, we designed a similar inference procedurefor our Hcc method as shown in Figure 5. (1) For inferencesteps, the labels of all the unlabeled instances are unknown.We first bootstrap an initial set of label estimation for eachinstance using content attributes of each node. In our cur-rent implementation, we simply set the relational featuresof unlabeled instances with zero vectors. Other strategiesfor bootstrap can also be used in this framework. (2) Itera-tive Inference: we iteratively update the relational featuresbased on the latest predictions and then these new featuresare used to update the prediction of local models on each in-stance. The iterative process terminates when convergencecriteria are met. In our current implementation, we updatethe variable Yi in the (r+1)-th iteration ( say Y

(r+1)i ) using

the predicted values in the r-th iteration (Y (r)j ) only.

5. EXPERIMENTS5.1 Data CollectionIn order to validate the collective classification performances,we tested our algorithm on three real-world heterogeneousinformation networks (Summarized in Table 3).

ACM Conference Dataset: Our first dataset studied inthis paper was extracted from ACM digital library1 in June2011. ACM digital library provides detailed bibliographicinformation on ACM conference proceedings, including pa-per abstracts, citation, author information etc. We extracttwo ACM sub-networks containing conference proceedingsbefore the year 2011.

•The first subset, i.e., ACM Conference-A, involves 14 con-ferences in computer science: SIGKDD, SIGMOD, SIGIR,SIGCOMM, CIKM, SODA, STOC, SOSP, SPAA, Mobi-COMM, VLDB, WWW, ICML and COLT. The networkschema is summarized in Figure 7(a), which involves fivetypes of nodes and five types of relations/links. This networkincludes 196 conference proceedings (e.g., KDD’10, KDD’09,etc.), 12.5K papers, 17K authors and 1.8K authors’ a!lia-tions. On each paper node, we extract bag-of-words repre-sentation of the paper title and abstract to use as contentattributes. The stop-words and rare words that appear in

1http://dl.acm.org/

less than 100 papers are removed from the vocabulary. Eachpaper node in the network is assigned with a class label,indicating the ACM index term of the paper including 11categories. The task in this dataset is to classify the pa-per nodes based on both local attributes and the networkinformation.

•The second subset, i.e., ACM Conference-B, involves an-other 12 conferences in computer science: ACM Multime-dia, OSDI, GECCO, POPL, PODS, PODC, ICCAD, ICSE,ICS, ISCA, ISSAC and PLDI. The network includes 196corresponding conference proceedings, 10.8K papers, 16.8Kauthors and 1.8K authors’ a!liations. After removing stop-words in the paper title and abstracts, we get 0.4K termsthat appears in at least 1% of the papers. The same setupswith ACM Conference-A dataset are also used here to buildthe second heterogeneous network.

DBLP Dataset: The third dataset, i.e., DBLP four ar-eas2 [6], is a bi-type information network extracted fromDBLP3, which involves 20 computer science conferences andauthors. The relationships involve conference-author linksand co-author links. On the author nodes, a bag-of-wordsrepresentation of all the paper titles published by the authoris used as attributes of the node. Each author node in thenetwork is assigned with a class label, indicating researcharea of the author. The task in this dataset is to classify theauthor nodes based on both local attributes and the networkinformation. For detailed description of the DBLP dataset,please refer to [6].

5.2 Compared MethodsIn order to validate the e"ectiveness of our collective classi-fication approach, we test with following methods:

• Heterogeneous Collective Classification (Hcc): We firsttest our proposed method, Hcc, for collective classificationin heterogeneous information networks. The proposed ap-proach can exploit dependencies based on multiple metapaths for collective classification.• Homogeneous Collective Classification (Ica): This methodis our implementation of the ICA (Iterative ClassificationAlgorithm) [9] by only using homogeneous network infor-mation for collective classification. In the homogeneous in-formation networks, only paper-paper links in ACM datasetsand author-author links in DBLP dataset are used.• Combined Path Relations (Cp): We compare with a base-line method for multi-relational collective classification [3]:We first convert the heterogeneous information networksinto multiple relational networks with one type of nodes andmultiple types of links. Each link type corresponds to a metapath in the Hcc method. Then, the Cp method combinesmultiple link types into a homogeneous network by ignoringthe link types. We then train one Ica model to performcollective classification on the combined network.• Collective Ensemble Classification (Cf): We compare withanother baseline method for multi-relational collective clas-sification. This method is our implementation of the collec-tive ensemble classification [3], which trains one collective

2http://www.cs.illinois.edu/homes/mingji1/DBLP_four_area.zip3http://www.informatik.uni-trier.de/~ley/db/

5

The Problem

� Multiple types of dependencies: different types of dependencies have different semantic meanings

� Collective classification

The key is to exploit the heterogeneous correlations among different instances

6











Paper 1

Paper 2

Paper 3

Paper 4

Paper 5

Paper 6

Paper 7

Paper 8





citation citation

citationcitation



authoredBy

authoredBy

authoredBy







“Citation”

7











Paper 1

Paper 2

Paper 3

Paper 4

Paper 5

Paper 6

Paper 7

Paper 8





citation citation

citationcitation



authoredBy

authoredBy

authoredBy







“sharing authors”

8

Paper 1

Paper 2

Paper 3

Paper 4

Author 1 Author 2Institute 1

affiliationaffiliation -1

Author 2 Institute 1

authoredBy

Author 1

-1authoredBy

authoredBy

affiliation affiliation -1-1authoredBy

Figure 3: Path instances corresponding to the metapath PAFAP .

citation links in paper classification tasks, co-author linksin expert classification tasks. These methods can modelPr(Yi|xi,YP(i)). Here YP(i) denotes the vector containingall variable Yj (!j " P(i)), and P(i) denotes the index setof related instances to the i-th instance through meta pathP . Hence, by considering the single type of dependencies,we will have

Pr(Y|X ) #!

i!U

Pr(Yi|xi,YP(i))

Meta Path-based Dependency

In heterogeneous information networks, there are complexdependencies not only among instances directly linked throughlinks, but also among instances indirectly linked through dif-ferent meta paths. In order to solve the collective classifi-cation problem more e!ectively, in this paper, we explicitlyconsider di!erent types of meta-path based dependencies inheterogeneous information networks. Meta path-based de-pendences refer to the dependencies among instances thatare inter-connected through a meta path.

To the best of our knowledge, meta path-based dependencieshave not been studied in collective classification researchbefore. Given a set of meta paths S = {P1, · · · ,Pm}, themeta path-based dependency models are shown in Figure 2,i.e., Pr(Yi|xi,YP1(i),YP2(i), · · · ,YPm(i)). Pj(i) denotes theindex set of related instances to the i-th instance throughmeta path Pj .

For each meta path, one instance can be connected withmultiple related instances in the network. For example,in Figure 3, Paper 1 is correlated with Paper 2, 3 and4 through meta path Pi = PAFAP , i.e., Pi(Paper1) ={Paper 2, 3, 4}. Hence, by considering meta path-based de-pendencies, we will have

Pr(Y|X ) =!

i!U

Pr"

Yi|xi,YP1(i),YP2(i), · · · ,YPm(i)

#

4. METAPATH-BASEDCOLLECTIVECLAS-SIFICATION

For classifying target nodes in a heterogeneous informationnetwork, the most naıve approach is to approximate Pr(Y|X ) #$

i!U Pr(Yi|xi) with the assumptions that all instances areindependent from each other. However, this approach canbe detrimental to their performance for many reasons. Thisis particularly troublesome when nodes in heterogeneousnetworks have very complex dependencies with each otherthrough di!erent meta paths.

In this section, we propose a simple and e!ective algorithm

Conference

Author Proceeding

Institute

citationpublishInauthoredBy

-1citation

Paper

collectedInaffiliationcitation

-1citation

Author Proceeding

collectedInaffiliation

-1

-1

publishInauthoredBy-1

-1

authoredBy

Paper

Paper Paper

Paper

Paper

Paper Paper

Paper

publishIn

-1

Paper Paper

Figure 4: An example of dependence tree for metapath-based dependencies. Each paper node corre-sponds to a unique type of path-based dependenciesin the network.

for meta path-based collective classification in heterogeneousinformation networks. We aim to develop a model to esti-mate the probabilities Pr

"

Yi|xi,YP1(i), · · · ,YPm(i)

#

. Wefirst introduce how the extract the set of meta paths froma heterogeneous information network, then propose our col-lective classification algorithm, called Hcc (HeterogeneousCollective Classification).

We first consider how to extract all meta paths in a heteroge-neous information network of bounded length !max. When!max is small, we can easily generate all possible meta pathsas follows: We can organize all the type-correct relationsinto a prefix tree, called dependence tree. In Figure 4, weshow an example of dependence tree in ACM conferencenetworks. The target nodes for classification are the papernodes, and each paper node in the dependence tree corre-sponds to an unique meta path, indicating one type of de-pendencies among paper instances. However, in general thenumber of meta paths grows exponentially with the maxi-mum path length !max. As it has been showed in [15], longmeta paths may not be quite useful in capturing the link-age structure of heterogeneous information networks. In thispaper, we only exploit the instance dependences with shortmeta paths (!max = 4).

In many really world network data, exhaustively extractingall meta paths may result in large amount of redundant metapaths, e.g., PV PV P . Including redundant meta paths in acollective classification model can result in overfitting risks,because of additional noisy features. Many of the redun-dant meta paths are constructed by combining two or moremeta paths, e.g., meta path PV PV P can be constructedby two PV P paths. In order to reduce the model’s overfit-ting risk, we extract all meta paths that cannot be decom-posed into shorter meta paths (with at least one non-trivialmeta paths). Here non-trivial meta paths refer to the pathswith lengths greater than 1. For example, in ACM confer-ence network, meta paths like P$PAP can be decomposedinto P$P and PAP , thus will be excluded from our metapath set. In Figure 5, we the meta path set extract pro-cess as the “Initialization” step of our proposed method. Bybreadth-first search on the dependence tree, our model firstselect shortest meta paths from the network. Then longermeta paths are incrementally selected into path set S until

Example III

P – A – I – A - P “Same School”

9

Table 1: Semantics of Meta Paths among Paper Nodes

Notation Meta Path Semantics of the Dependency

1 P!P Papercite"""! Paper Citation

2 P#P!P Papercite!1

"""""! Papercite"""! Paper Co-citation

3 P!P#P Papercite"""! Paper

cite!1

"""""! Paper Bibliographic coupling

4 PVP PaperpublishIn"""""""! Proceeding

publishIn!1

"""""""""! Paper Papers in the same proceeding

5 PVCVP PaperpublishIn"""""""! Proceeding

collectIn""""""! Conference

collectIn!1

""""""""! ProceedingpublishIn!1

"""""""""! Paper Papers in the same conference

6 PAP Paperwrite!1

""""""! Authorwrite""""! Paper Papers sharing authors

7 PAFAP Paperwrite!1

""""""! Authoraffiliation""""""""! Institute

affiliation!1

""""""""""! Authorwrite""""! Paper Papers from the same institute

Our work is related to both collective classification tech-niques on relational data and heterogeneous information net-works. We briefly discuss both of them.

Collective classification of relational data has been inves-tigated by many researchers. The task is to predict theclasses for a group of related instances simultaneously, ratherthan predicting a class label for each instance independently.In relational datasets, the class label of one instance canbe related to the class labels (sometimes attributes) of theother related instances. Conventional collective classifica-tion approaches focus on exploiting the correlations amongthe class labels of related instances to improve the classifi-cation performances. Roughly speaking, existing collectiveclassification approaches can be categorized into two typesbased upon the di!erent approximate inference strategies:(1) Local methods: The first type of approaches employ alocal classifier to iteratively classify each unlabeled instanceusing both attributes of the instances and relational featuresderived from the related instances. This type of approachesinvolves an iterative process to update the labels and therelational features of the related instances, e.g. iterativeconvergence based approaches [12, 9] and Gibbs samplingapproaches [11]. Many local classifiers have been used forlocal methods, e.g. logistic regression [9], Naive Bayes [12],relational dependency network [13], etc. (2) Global meth-ods: The second type of approaches optimizes global ob-jective functions on the entire relational dataset, which alsouses both attributes and relational features for inference [18].For a detailed review of collective classification please referto [14].

Heterogeneous information networks are special kinds of in-formation networks which involve multiple types of nodesor multiple types of links. In a heterogeneous informationnetwork, di!erent types of nodes and edges have di!erentsemantic meanings. The complex and semantically enrichednetwork possesses great potential for knowledge discovery.In the data mining domain, heterogeneous information net-works are ubiquitous in many applications, and have at-tracted much attention in the last few years [17, 16, 5]. Sunet al. [17, 15] studied the clustering problem and top-k simi-larity problem in heterogeneous information networks. Minget al. studied a specialized classification problem on hetero-geneous networks, where di!erent types of nodes share a

same set of label concepts [5]. However, these approachesare not directly applicable in collective classification prob-lems, since focus on convention classification tasks withoutexploiting the meta path-based dependencies among objects.

3. PROBLEM DEFINITIONIn this section, we first introduce several related conceptsand notations. Then, we will formally define the collec-tive classification problem in heterogeneous information net-works..

Definition 1. Heterogeneous Information Network:A heterogeneous information network [17, 15] is a specialkind of information network, which is represented as a di-rected graph G = (V, E). V is the set of nodes, including ttypes of objects T1 = {v11, · · · , v1n1

} , · · · , Tt = {vt1, · · · , vtnt}.E ! V " V is the set of links between the nodes in V, whichinvolves multiple types of links.

Example 1. ACM conference network: A heteroge-neous information network graph is provided in Figure 1.This network involves five types of objects, i.e., papers (P),authors (A), institutes (F), proceedings (V) and conferences(C), and five types of links, i.e., citation, authoredBy, a!li-ation, publishedIn and collectedIn.

Di!erent from conventional networks, heterogeneous infor-mation networks involve di!erent types of objects (e.g., pa-pers and conference) that are connected with each otherthrough multiple types of links. Each type of links repre-sents an unique binary relation R from node type i to nodetype j, where R(vip, vjq) holds i! object vip and vjq are re-lated by relation R. R!1 denotes the inverted relation of R,which holds naturally for R!1(vjq , vip). Let dom(R) = Ti

denote the domain of relation R, rang(R) = Tj denotes itsrange. R(a) = {b : R(a, b)}. For example, in Figure 1, thelink type “authorBy” can be written as a relation R betweenpaper nodes and author nodes. R(vip, vjq) holds i! authorvjq is one of the authors for paper vip. For convenience, we

can write this link type as “paperauthoredBy########$ author” or

“TiR#$ Tj”.

In heterogenous information networks, objects are also inter-

Different Types of Dependencies

10

Paper 1

Paper 2

Paper 3

Paper 4




authoredBy

Author 1

-1authoredBy

authoredBy




Pr(Y|X ) #!

i!U

Pr(Yi|xi,YP(i))





Pr(Y|X ) =!

i!U

Pr"


#





Conference

Author Proceeding

Institute


-1citation

Paper


-1citation

Author Proceeding


-1

-1


-1

authoredBy

Paper

Paper Paper

Paper

Paper

Paper Paper

Paper

publishIn

-1

Paper Paper



"


#




Different types of Dependencies “Citation”

11

Paper 1

Paper 2

Paper 3

Paper 4




authoredBy

Author 1

-1authoredBy

authoredBy




Pr(Y|X ) #!

i!U

Pr(Yi|xi,YP(i))





Pr(Y|X ) =!

i!U

Pr"


#





Conference

Author Proceeding

Institute


-1citation

Paper


-1citation

Author Proceeding


-1

-1


-1

authoredBy

Paper

Paper Paper

Paper

Paper

Paper Paper

Paper

publishIn

-1

Paper Paper



"


#




Example II “sharing authors”

12

Paper 1

Paper 2

Paper 3

Paper 4




authoredBy

Author 1

-1authoredBy

authoredBy




Pr(Y|X ) #!

i!U

Pr(Yi|xi,YP(i))





Pr(Y|X ) =!

i!U

Pr"


#





Conference

Author Proceeding

Institute


-1citation

Paper


-1citation

Author Proceeding


-1

-1


-1

authoredBy

Paper

Paper Paper

Paper

Paper

Paper Paper

Paper

publishIn

-1

Paper Paper



"


#




Example III “Same School”

13

Table 2: Important Notations.

Symbol Definition

V =!t

i=1 Ti the set of nodes, involving t types of nodesE = {ei ! V " V} the set of edges or links

X = {x1, · · · ,xn1} the given attribute values for each node in target type T1

Y = {Y1, · · · , Yn1} the set of variables for labels of the nodes in T1, and Yi ! C

L and U the sets for training nodes and testing nodes, and L # U = T1

yi the given label for node v1i ! L, and Yi = yiS = {P1, · · · ,Pm} the set of meta paths

Pj(i) = {k|Pj(v1i, v1k)} the index set of all related instances to xi through meta path Pj

connected through indirect links, i.e., paths. For example,in Figure 1, paper 1 and paper 4 are linked through a se-

quence of edges: “paper1authoredBy$$$$$$$$% author1

authoredBy!1

$$$$$$$$$$%paper4”. In order to categorize these paths, we extend thedefinition of link types to “path types”, which are named asmeta path, similar to [15, 8].

Definition 2. Meta Path: A meta path P represents asequence of relations R1, · · · , R! with constrains that &i !{1, · · · , ! $ 1}, rang(Ri) = dom(Ri+1). The meta path P

can also be written as P : T1R1$$% T2

R2$$% · · ·R!$$% T!+1, i.e.,

P corresponds to a composite relation R1 " R2 " · · · " R!

between node type T1 and T!+1. dom(P ) = dom(R1) andrang(P ) = rang(R!). The length of P is !, i.e., the numberof relations in P .

Di!erent meta paths usually represent di!erent semantic re-lationships among linked objects. In Table 1, we show someexamples of meta paths with their corresponding semantics.Most conventional relationships studied in network data cannaturally be captured by di!erent meta paths. For exam-ple, the paper co-citation relation [2] can naturally be repre-

sented by meta path “papercite!1

$$$$% papercite$$% paper”, and

the co-citation frequencies can be written as the number ofpath instances for the meta path. Here a path instance ofP , denoted as p ! P , is an unique sequence of nodes andlinks in the network that follows the meta path constrains.For convenience, we use the node type sequence to repre-sent a meta path, i.e., P = T1T2 · · · Tl+1. For example, we

use PAP to represent the meta path “paperauthoredBy$$$$$$$$%

authorauthoredBy!1

$$$$$$$$$$% paper”. Note that for meta paths in-volving citation links, we explicitly add arrows to representthe link directions, e.g., the paper co-citation path can bewritten as P 'P %P .

Collective Classification in Heterogeneous Informa-tion Networks

In this paper, we focus on studying the collective classifica-tion problem on one type of objects, instead of on all types ofnodes in heterogeneous information networks. This problemsetting exists in a wide variety of applications. The reasonsare as follows: in heterogenous information networks, the la-bel space of di!erent types of nodes are quite di!erent, wherewe can not assume all types of node share the same set oflabel concepts. For example, in medical networks, the labelconcepts for patient classification tasks are only defined onpatient nodes, instead of doctor nodes or medicine nodes. In

xi Y i

xj Y j

xk Y k

Meta Path1

Meta Pathm

Figure 2: Meta path-based dependencies in collec-tive classification for heterogeneous information net-works. Yi with double circles denotes the currentlabel variable to be predicted. Each rectangle rep-resents a group of instances following the same metapath. xi denotes the attribute values of the instance.

a specific classification task, we usually only care about theclassification results on one type of node. Without loss ofgenerality, we assume the node type T1 is the target objectswe need to classify. Suppose we have n nodes in T1. Oneach node v1i ! T1, we have a vector of attributes xi ! R

d

in the d-dimensional input space, and X = {x1, · · · ,xn1}.

Let C = {c1, c2, · · · , cq} be the q possible class labels. Oneach node v1i ! T1, we also have a label variable Yi ! Cindicating the class label assigned to node v1i, Y = {Yi}

n1

i=1.

Assume further that we are given a set of known valuesYL for nodes in a training set L ( T1, and L denotes theindex set for training data. YL = {yi|i ! L}, where yi !C is the observed labels assigned to node x1i. Then thetask of collective classification in heterogeneous informationnetworks is to infer the values of Yi ! YU for the remainingnodes in the testing set (U = T1 $ L).

As reviewed in Section 2, the inference problem in classifi-cation tasks is to estimate Pr(Y|X ) given a labeled trainingset. Conventional classification approaches usually requirei.i.d. assumptions, the inference for each instance is per-formed independently:

Pr(Y|X ) )"

i!U

Pr(Yi|xi)

Homogeneous Link-based Dependency

In collective classification problems, the labels of related in-stances are not independent, but are closely related witheach other. Conventional approaches focus on exploitinglabel dependencies corresponding to one types of homoge-neous links to improve the classification performances, e.g.,

Paper 1

Paper 2

Paper 3

Paper 4




authoredBy

Author 1

-1authoredBy

authoredBy




Pr(Y|X ) #!

i!U

Pr(Yi|xi,YP(i))





Pr(Y|X ) =!

i!U

Pr"


#





Conference

Author Proceeding

Institute


-1citation

Paper


-1citation

Author Proceeding


-1

-1


-1

authoredBy

Paper

Paper Paper

Paper

Paper

Paper Paper

Paper

publishIn

-1

Paper Paper



"


#




Paper 1

Paper 2

Paper 3

Paper 4




authoredBy

Author 1

-1authoredBy

authoredBy




Pr(Y|X ) #!

i!U

Pr(Yi|xi,YP(i))





Pr(Y|X ) =!

i!U

Pr"


#





Conference

Author Proceeding

Institute


-1citation

Paper


-1citation

Author Proceeding


-1

-1


-1

authoredBy

Paper

Paper Paper

Paper

Paper

Paper Paper

Paper

publishIn

-1

Paper Paper



"


#




Meta-path-based Dependencies

14

Experiments: Compared Methods

Independent classification SVM

Meta-path-based collective classification HCC the proposed approach [this paper] HCC-ceiling a ceiling analysis [this paper]

Collective classification ICA iterative classification algorithm [Lu&Getoor, ICML’03]

CP Combined Path Relations CF Ensemble of path relations [Eldardiry&Neville, AAAI’11]

15

0.62

0.66

0.70

0.74

SVM ICA CP CF HCC HCC−ceiling HCC−all

Accuracy

(a) ACM Conference-A

0.54

0.58

0.62

0.66

0.68

SVM ICA CP CF HCC HCC−ceiling HCC−all

Accuracy

(b) ACM Conference-B

0.80

0.85

0.90

0.95

SVM ICA CP CF HCC HCC−ceiling

Accuracy

(c) DBLP

Figure 8: Collective classification results.

classification model on each link types. We use the samesetting of the Cp method to extract multi-relational net-works. Then we use Ica as the base models for collectiveclassification. In the iterative inference process, each modelvote for the class label of each instance, and prediction ag-gregation was performed in each iteration. Thus this processis also called collective fusion, where each base model cana!ect each other in the collective inference step.

Table 3: Summary of experimental datasets.

Data Sets

Characteristics ACM-A ACM-B DBLP

# Feature 1,903 376 1,618# Instance 12,499 10,828 4,236# Node Type 5 5 2# Link Type 5 5 2# Class 11 11 4

• Ceiling of Hcc (Hcc-ceiling): One claim of this paper isthat Hcc can e!ectively infer the labels of linked unlabeledinstances using iterative inference process. To evaluate thisclaim, we include a model which use the ground-truth labelsof the related instances during the inference. This methodillustrate a ceiling performance of Hcc can possibly achieveby knowing the true label of related instances.• Hcc with all meta-paths (Hcc-all): Another claim ofthis paper is that selected meta path in Hcc can e!ectivelycapture the dependencies in heterogeneous information net-works and avoiding overfitting. To evaluate this claim, weinclude a model which uses all possible meta paths witha maximum path length of 5. This method illustrates theperformance of Hcc if we exhaustively involves all possiblepath-based dependencies without selection.

We use LibSVM [1] with linear kernel as the base classifierfor all the compared methods. The maximum number ofiteration all methods are set as 10. All experiments areconducted on machines with Intel XeonTMQuad-Core CPUsof 2.26 GHz and 24 GB RAM.

5.3 Performances of Collective ClassificationIn our first experiment, we evaluate the e!ectiveness of theproposed Hcc method on collective classification. 10 times3-fold cross validations are performed on each heterogeneousinformation network to evaluate the collective classificationperformances. We report the detailed results in Figure 8. Itshows the performances of the six methods on three datasetswith box plots, including the smallest/largest results, lowerquartile, median and upper quartile.

The first observation we have in Figure 8 is as follows: al-most all the collective classification methods that explic-itly exploit the label dependencies from various aspects, canachieve better classification accuracies than the baseline Svm,which classify each instance independently. These resultscan support the importance of collective classification by ex-ploiting the di!erent types of dependencies in network data.For example, Ica outperformances Svm by exploiting auto-correlation among instances while considering only one typeof links, i.e., citation links in ACM datasets and co-authorlinks in the DBLP dataset. Cf and Cp methods can alsoimprove the classification performances by exploiting mul-tiple types of dependencies. Similar results have also beenreported in collective classification literatures.

Then we find that our meta path-based collective classifi-cation method (Hcc) consistently and significantly outper-form other baseline methods. Hcc can utilize the metapath-based dependencies to exploit the heterogenous net-

Experiment Results, ACM dataset

16

Conclusions

� Meta-path-based Collective Classification in Heterogeneous Information Networks �  Propose an algorithm to exploit the dependencies

among heterogeneous dependencies among related instances v Meta-path-based Dependency

�  Classification performances can be improved by considering the heterogeneous dependencies among instances. Thank you!

Documents

Meta-path-based Collective Classification in Heterogeneous ...xkong/cikm12_slides.pdfto diﬀerent meta paths. Empirical studies on real-world net - works demonstrate that e ﬀectiveness