8
A Survey on Mining Heterogeneous Information Network * Jing Yan Department of Computer Science University of Hong Kong Pokfulam Road, Hong Kong [email protected] Xiaodong Li Department of Computer Science University of Hong Kong Pokfulam Road, Hong Kong [email protected] ABSTRACT The world we are living in is connected. In the other words, most of our real-world applications,like interconnected social media and social networks, scientific, engineering, and med- ical information systems, online e-commerce systems, and most database systems, can be structured into a expres- sion of Network. Multi-type, and Multi-interactions make the network more like a semi-structured Heterogeneous In- formation Network. Realized the necessity to analyze such network, recently there are a lot of related papers about the Heterogeneous Information Network and leverage net- work analysis approach to reveal the rich information in the network. These work are mainly focused on Cluster- ing, Ranking, and some Model Measurement. In this paper, we provide a survey of heterogeneous information network analysis. We will introduce basic concepts of heterogeneous information network analysis, examine its developments on different data mining tasks, discuss some advanced topics, and point out some future research directions. 1. INTRODUCTION The world is interconnected with multiple relations and mul- tiple objects. Networks (or graphs) is able to model real world entities and their relationships by objects and links . There have been rich researches focusing on the analysis of homogeneous information network which is partially reflect the real world. To explore the rich semantic information among nodes, many works have been done in Homogeneous Information Network (in which all objects/links are of one single type), link mining and analysis [8] social network anal- ysis [34] hypertext and web mining [7] network science [17], and graph mining [9]. Most of these work are very based on the assumption, the type of objects of the network is unique. However, most * This work is submitted for the course COMP8101 Ad- vanced Topics in Data Engineering. And both authors equally contributed to this paper. M1 M2 M3 M4 A1 A3 A2 D1 D2 P1 P2 (a) M D P A (b) Figure 1: An instance of Heterogeneous Information of the real-life application system contained more than one type. For example, facebook Open Graph 1 , contained ob- jects of different type., such as messages, groups, locations, posts etc. . With the rapid development, there are complex knowledge base. For instance, Yago 2 is a knowledge base that captures information derived from Wikipedia, Word- Net and GeoNames. Yago is a repository of information on more than 10 million objects (such as persons, organiza- tions, cities, etc.) and it records more than 120 million facts about these entities. The limitation of previous research which based on the Homogeneous Information Network be- come less able to describe or model the systems properly. Therefore, many research turned to Heterogeneous Informa- tion Network which models systems more accurately. A Het- erogeneous Information Network (HIN) is a network whose objects are of different types and whose links represent dif- ferent kinds of relationships between objects. Figure one are two simple instances on Heterogeneous Information Net- work. Compared to widely-used homogeneous information network, the heterogeneous information network can effec- tively fuse more information and contain rich semantics in nodes and links, and thus it forms a new development of data mining. In this paper, we are going to talk about interesting and popular topics, including ranking and clustering, meta-path based HIN research, and other new models on Heteroge- neous Information Network. Also, we will introduce some interesting application based on Meta-path, like co-author prediction, personalized recommendation, query recommen- dation. Finally, we will conclude and give some research frontiers from our opinion. 1 https://developers.facebook.com/docs/sharing/opengraph 2 http://www.mpi-inf.mpg.de/departments/databases-and- information-systems/research/yago-naga/yago 1

A Survey on Mining Heterogeneous Information Network · the Heterogeneous Information Network and leverage net-work analysis approach to reveal the rich information in the network

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Survey on Mining Heterogeneous Information Network · the Heterogeneous Information Network and leverage net-work analysis approach to reveal the rich information in the network

A Survey on Mining Heterogeneous Information Network ∗

Jing Yan†Department of Computer Science

University of Hong KongPokfulam Road, Hong Kong

[email protected]

Xiaodong Li †Department of Computer Science

University of Hong KongPokfulam Road, Hong Kong

[email protected]

ABSTRACTThe world we are living in is connected. In the other words,most of our real-world applications,like interconnected socialmedia and social networks, scientific, engineering, and med-ical information systems, online e-commerce systems, andmost database systems, can be structured into a expres-sion of Network. Multi-type, and Multi-interactions makethe network more like a semi-structured Heterogeneous In-formation Network. Realized the necessity to analyze suchnetwork, recently there are a lot of related papers aboutthe Heterogeneous Information Network and leverage net-work analysis approach to reveal the rich information inthe network. These work are mainly focused on Cluster-ing, Ranking, and some Model Measurement. In this paper,we provide a survey of heterogeneous information networkanalysis. We will introduce basic concepts of heterogeneousinformation network analysis, examine its developments ondifferent data mining tasks, discuss some advanced topics,and point out some future research directions.

1. INTRODUCTIONThe world is interconnected with multiple relations and mul-tiple objects. Networks (or graphs) is able to model realworld entities and their relationships by objects and links .There have been rich researches focusing on the analysis ofhomogeneous information network which is partially reflectthe real world. To explore the rich semantic informationamong nodes, many works have been done in HomogeneousInformation Network (in which all objects/links are of onesingle type), link mining and analysis [8] social network anal-ysis [34] hypertext and web mining [7] network science [17],and graph mining [9].

Most of these work are very based on the assumption, thetype of objects of the network is unique. However, most

∗This work is submitted for the course COMP8101 Ad-vanced Topics in Data Engineering. And both authorsequally contributed to this paper.

M1 M2 M3 M4

A1 A3A2

D1 D2 P1 P2

(a)

M

D P

A

(b)

Figure 1: An instance of Heterogeneous Information

of the real-life application system contained more than onetype. For example, facebook Open Graph 1, contained ob-jects of different type., such as messages, groups, locations,posts etc. . With the rapid development, there are complexknowledge base. For instance, Yago2 is a knowledge basethat captures information derived from Wikipedia, Word-Net and GeoNames. Yago is a repository of informationon more than 10 million objects (such as persons, organiza-tions, cities, etc.) and it records more than 120 million factsabout these entities. The limitation of previous researchwhich based on the Homogeneous Information Network be-come less able to describe or model the systems properly.Therefore, many research turned to Heterogeneous Informa-tion Network which models systems more accurately. A Het-erogeneous Information Network (HIN) is a network whoseobjects are of different types and whose links represent dif-ferent kinds of relationships between objects. Figure oneare two simple instances on Heterogeneous Information Net-work. Compared to widely-used homogeneous informationnetwork, the heterogeneous information network can effec-tively fuse more information and contain rich semantics innodes and links, and thus it forms a new development ofdata mining.

In this paper, we are going to talk about interesting andpopular topics, including ranking and clustering, meta-pathbased HIN research, and other new models on Heteroge-neous Information Network. Also, we will introduce someinteresting application based on Meta-path, like co-authorprediction, personalized recommendation, query recommen-dation. Finally, we will conclude and give some researchfrontiers from our opinion.

1https://developers.facebook.com/docs/sharing/opengraph2http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago

1

Page 2: A Survey on Mining Heterogeneous Information Network · the Heterogeneous Information Network and leverage net-work analysis approach to reveal the rich information in the network

Figure 2: Instance of Netwrok Scehma

2. BASIC CONCEPTSIn this paper we are going to talk about the mining workon Heterogeneous Information Network. In this section, weare going to have a clear and complete definition of Hetero-geneous Information Network.

Definition 1. Information Network, is a directed graphG = (V,E) with an object type mapping function φ : V → Aand a link type mapping function ψ : E → R, where eachobject v ∈ V belongs to an object type φ(v) ∈ A, and eachlink e ∈ E belongs to a link type ψ(e) ∈ R.

Definition 2. Heterogeneous Information Network(HIN),is a Complex Information Network where the node types|A|>1 and like relations |R|>1. Otherwise, the informationnetwork is a Homogeneous Information Network.

Definition 3. HIN Schema. Given an HIN G = (V,E)with mappings φ : V → A and ψ : E →R, its schema TG isa directed graph defined over object types A and link typesR, i.e., TG = (A,R).

Figure 1 illustrates an HIN, which is also a bibliography net-work. A paper object can link (or be linked) to its authors,a venue and its related topics. Note that multiple edges ofdistinct types between two objects may exist. Figure 2 illus-trate sthe instance of the Network Schema in a PublicationBibliography there are for bi-directed relations among fourdifferent node types.

3. RANKING AND CLUSTERINGRanking and clustering analysis are fundamental tasks indata mining [5]. Clustering is to partition a set of dataobjects into a set of clusters, such that objects in a clusterare similar to one another, yet dissimilar to objects in otherclusters. Clustering is based on the features of objects, suchas k-means [13]. Also, there are some work about clusteringbased on networked data (e.g., community detection [23]) .

Many works has adapted the traditional methods on Homo-geneous Information Network to Heterogeneous InformationNetwork. GNetMine [18] was proposed to model the linkstructure in information networks with arbitrary networkschema and arbitrary number of object/link types. Recently,Luo et al. proposed HetPathMine [21] to cluster with smalllabeled data on HIN through a novel meta path selectionmodel, and Jacob et al. [12] proposed a method to labelnodes of different types by computing a latent representa-tion of nodes in a space where two connected nodes tend to

Figure 3: Framework of RankClus

have close latent representations. Some works also extendinductive classication that is to construct a decision functionin the whole data space.

Ranking is an important data mining task in network anal-ysis, which evaluates object importance or popularity basedon some ranking functions. Many ranking methods havebeen proposed in homogeneous networks. For example, PageR-ank [24] evaluates the importance of objects through a ran-dom walk process, and HITS ranks objects using the author-ity and hub scores [16]. These approaches only consider thesame type of objects in homogeneous networks. PopRank [?]aims at ranking popularity of web objects. They have con-sidered the role difference of different web pages, and thusturn web pages into a heterogeneous network.

3.1 RankClus AlgorithmIn Heterogeneous Information Network, since the complex-ity and the large-scale and raw networks, links betweennodes are in a mess. It is often difficult to understand thenetwork in that way. [29] proposed a clustering rankingbased algorithm to solve this problem.

In [29], the author proposed a novel algorithm to do theranking procedure. A novel clustering framework calledRankClus is proposed that directly generates clusters in-tegrated with ranking.

The Algorithm can be divided into four parts, As illustratedin Table 1 and Table 2

Table 1: RankClus Algorithm StepsStep 0 Initialization

Step 1 Ranking

Step 2 Generating new measure space

Step3 Adjusting Cluster

Step4 Repeat 1-3 till stable

3.2 Bi-type HIN Rankclus Case StudyTo better understand the algorithm, we introduce a casestudy in this section. The case is based on a Bi-type net-work, which is a Conference-author network. Links can existbetween Conference (X) and author (Y) , Author (Y) and

2

Page 3: A Survey on Mining Heterogeneous Information Network · the Heterogeneous Information Network and leverage net-work analysis approach to reveal the rich information in the network

author (Y) . Also, we define a link matrix Wxy. The infor-mation network can be denoted as G =< {X ∪ Y },Wxy >.

Step 0: Initialization The initial clusters for target ob-jects are generated, by assigning each target object with acluster label from 1 to K randomly.

Step 1: Ranking

[29] Based on current clusters, K cluster-induced networksare generated accordingly, and the con- ditional rank distri-butions for types Y and X are calculated. In this step, wealso need to judge whether any cluster is empty, which maybe caused by the improper initialization of the algorithm.When some cluster is empty, the algorithm needs to restartin order to generate K clusters.

For the ranking calculation, [29] introduced two major rank-ing methods: Simple Ranking and Authority Ranking.

Simple Ranking, which is proportional to degree counting forobjects (E.g.,in bi-typed, this is the number of publicationsof authors). Simple Ranking only considers only immediateneighborhood in the network

In this case, simple ranking of conferences and authors isbased on the number of publications, which is proportionalto the numbers of papers accepted by a conference or pub-lished by an author.The ranking score for X and Y is:

The time complexity of Simple Ranking is O(|E|), where |E|is the number of links.

For Authority Ranking, which is the extension to HITS [16]

in weighted bi-type network. The ranking generally followsthree rules as follows:

Rule 1: Highly ranked authors publish many papers in highlyranked conferences

Rule 2: Highly ranked conferences attract many papers frommany highly ranked authors

Rule 3: The rank of an author is enhanced if he or she co-authors with many authors or many highly ranked authors.And the parameter α ∈ [0, 1].

Also, for the Authority Ranking, the ranking score is prop-agated by iterations using rules 2 and 1, or rules 2 and 3.Take the case as example. the authority ranking of X andY turned out to be primary eigenvectors of some symmet-ric matrix. Also, this ranking method considers the impactfrom the overall network , which should rank better thansimple ranking.

For authority ranking, the time complexity is O(t|E|), wheret is the iteration number and |E| is the number of links inthe graph. Notice that, |E| = O(d|V |) � |V |2 in a sparsenetwork, where |V | is the number of total objects in thenetwork and d is the average link per each object.

Step 2: Generate New Measure Space

A naive method on this is to map target object to a K-dimensional vector directly by considering a sub-networkinduced by it.

In this paper, the author proposed a mixture model method.In their method, they consider each target objects links aregenerated under a mixture distribution of ranking from eachcluster.

To be more Specific, We can consider ranking as a distri-bution: r(Y )→ p(Y ) . And this distribution could be con-sidered as a mixture model over K component distributions,which are attribute types conditional rank distributions onK clusters. Based on this, [29] proposed a mixture modelwhere each target object xi is mapped into a K-vector πi,k

. This function use the πi,k to denote the xi’s coefficient forcomponent k.

For the parameters in this function, the author proposedthe EM [6] algorithm to estimate, which maximizes thelog-likelihood given all the observations of links.

3

Page 4: A Survey on Mining Heterogeneous Information Network · the Heterogeneous Information Network and leverage net-work analysis approach to reveal the rich information in the network

Step 3: Adjusting Cluster

In this step, the author adjusted the cluster. The clustercenter is in new measure space regarding to step 1 and step2. The procedure for adjustment is, measure the distanceby 1−CosineSimilarity and assign to the cluster with thenearest center.

Step 4: Repeat Step 1, 2, 3

Repeat Steps 1, 2 and 3 until clusters change only by avery small ratio or the number of iterations is bigger thana predefined value itermNum.In [29], they set ε = 0, anditermNum= 20.

Figure 4 illustrates the steps and results on the bi-typedinformation network. And Figure 5 illustrates the result ofthe RankClus based on the dataset from DBLP.

Figure 4: Steps of RankClus

Figure 5: Results of RankClus based on DBLP

3.3 SummaryThe RankClus method, which provides a novel frameworkfor the ranking-based clustering on Heterogeneous Informa-tion Network has raised great influence. From then on, thereare many publications, which have discussed about the clus-tering or ranking on Heterogeneous Information Network.

The Rankclus might be efficient to comparing to other al-gorithms while comparing the similarity, need to calculatepairwise similarity [14], there are some weaknesses for RankClus:(1) it has not demonstrated the ability to clustering on net-works with arbitrary number of types; and (2) the clusters

generated by RankClus only contain one type of objects.To solve this problem, in [?], the author proposed a Net-Clus, which can generate net-clusters comprised of objectsfrom multiple types, given any star network [30]. Besides,many works, [20, 4], aimed at clustering objects from dif-ferent types simultaneously. Given different cluster numberneeded for each type of objects, clusters for each type aregenerated by maximizing some objective function.

4. META-PATH BASED RESEARCHIn this section, we are going to introduce some works whichare based on the meta-path. Meta-path is the concept firstintroduced in [3] and from then on is widely used in ex-ploiting the Heterogeneous Information Network.

Definition 4. A meta path, denoted by P, is essentially apath defined on an HIN schema TG, with the types of sourceobject and target object on both ends of the path.

Figure 6: Instances of Meta-path in a bibliographynetwork

In this section, we are going to talk about the similaritysearch, discover met-path and some interesting topics, in-cluding co-authorship prediction, personalized recommen-dation and query recommendation.

4.1 Meta-path based similarity SearchIn a bibliographic network, a user may be interested in the(top-k) most similar authors for a given author, or the mostsimilar venues for a given venue. Therefore, it is impor-tant to provide a similarity search function. Before, thereare many works about the similarity search, like SimRank[14], and Personalized Page Rank [15]. However, Adoptionof such measures to heterogeneous networks has significantdrawbacks: Objects of different types and links carry differ-ent semantic meanings, and it does not make sense to mixthem to measure the similarity without distinguishing theirsemantics. To solve these problems, [14] proposed a newmeasurement function: PathSim.

PathSim, that captures the subtlety of peer similarity. Theintuition behind it is that two similar peer objects shouldnot only be strongly connected, but also share comparablevisibility. As the relation of peer should be symmetric, Path-Sim merely on the symmetric meta paths. It is easy to seethat, round trip meta paths with the form P = (PlP−1

l ).

Definition 5. Path count(PC): the number of path in-stances p between x and y following P : s(x, y) = |{p :p ∈ P}|.

Definition 6.

4

Page 5: A Survey on Mining Heterogeneous Information Network · the Heterogeneous Information Network and leverage net-work analysis approach to reveal the rich information in the network

From this definition, it is easy to find s(x, y) is defined interms of two parts: (1) their connectivity defined by thenumber of paths between them following P; (2) the balanceof their visibility, where the visibility is defined as the num-ber of path instances between themselves. Then the paperintroduced the calculation of PathSim between any two ob-jects of the same type given a certain meta path by theCommuting Matrix.

Definition 7.

Figure compares the similarity search among different func-tions.

Figure 7: Results of different Similarity Search

While the definition of meta path-based similarity search isflexible to accommodate different queries, it requires expen-sive computations (matrix multiplications). It is time andspace expensive to materialize all the possible meta paths.For example, in the DBLP network, the similarity matrixcorresponding to a length-4 meta path, APCPA, for iden-tifying similar authors publishing in common venues is a710K 710K matrix. [14] provided the solution: Partiallymaterialize commuting matrices for short length meta paths,and concatenate them online to get longer ones for a givenquery. However, we reserved our opinion on that method,and therefore we won’t discuss about the details in this pa-per (just refer to the paper for more details).

4.2 Link PredictionLink Prediction is a very popular and hot topic in the so-cial network. Early work mainly studies unsupervised meth-ods [1, 19] and later on, supervised methods that are ableto combine different features with different coefficients viatraining data sets are proposed by different studies [2, 32].Leveraging the advantage of the rich content in the Hetero-geneous Information Network, and then do the link predic-tion intuitively will have some interesting outcomes.

Case: Co-authorship Prediction

In this case, the target relation for prediction is co-authorshiprelation, which can be described using meta-path A P A.For the topological features, [28] study all the meta-pathlisted in Table 5.1 other than A P A and all the measureslisted in the last section. [28] introduced the relationshipprediction model which models the probability of co- au-thorship between two authors as a function of topologicalfeatures between them. Given the training pairs of authors,the author first extract the topological features for them,and then build the prediction model to learn the weightsassociated with these features.

In order to predict whether two authors are going to col-laborate in a future interval, denoted as y, [28] used thelogistic regression model as the prediction model. For eachtraining pair of authors <ai1 , ai2>, they let xi be the (d+1)-dimensional vector including constant 1 and d topologicalfeatures between them, and yi be the label of whether theywill be co-authors in the future (yi = 1 if they will be co-authors, and 0 otherwise), which follows Bernoulli distribu-tion with probability pi (P(yi = 1) = pi). The probabilitypi is modeled as follows:

where β is the d+ 1 coefficient weights associated with theconstant and each topological feature

4.3 RecommendationPersonalized Entity Recommendation

Most of the previous studies in personalized recommenda-tion area only consider a single relationship type, such asfriendships in a social network. In many scenarios, the en-tity recommendation problem exists in a heterogeneous in-formation network environment. Recently, some researchershave begun to be aware of the importance of heterogeneousinformation for recommendations [27]. The comprehensiveinformation and rich semantics of HIN make it promising togenerate better recommendations.

Figure 8: IMDB movie recommendation

Figure 8 shows such an example in movie recommendation[35].

In [35], In order to take full advantage of the relationshipheterogeneity in information networks, the author first in-troduced meta-path-based latent features to represent theconnectivity between users and items along different types

5

Page 6: A Survey on Mining Heterogeneous Information Network · the Heterogeneous Information Network and leverage net-work analysis approach to reveal the rich information in the network

of paths (leveraging diffusion process along meta-paths, infigure 9). Then they defined recommendation models atboth global and personalized levels and use Bayesian rank-ing optimization techniques to estimate the proposed mod-els. Empirical studies show that our approaches outperformseveral widely employed or the state-of-the-art entity rec-ommendation techniques

Figure 9: User Preference Diffusion

Query Recommendation

Extracting entities in the query, and then using the entity’sshortest meta-path in a Knowledge Base(also can be seen asthe Heterogeneous Information Network) to recommend thepeer entities [10].

4.4 Meta StructureWhile investigating the use of HIN information for relevancecomputation, most of work only utilize simple structure,such as path, to measure the similarity(PathSim [3]) be-tween objects. [11]propose to use meta structure, which isa directed acyclic graph of object types(see Figure 10) withedge types connecting in between, to measure the proxim-ity between objects. The strength of meta structure is thatit can describe complex relationship between two HIN ob-jects (e.g., two papers in DBLP share the same authors andtopics).

Figure 10: Meta-path, and Meta-Structure

4.5 SummaryIn this section, we studies researches and some interestingapplications based on the meta-path. Meanwhile, we alsointroduced other model on HIN to better measure the rele-vance, or similarity.

5. FRONTIERSIn this section, we are going to discuss some of the frontiersbased on our understanding and sense. Also, we will sum-marize some of the future-forwards from some of the citedpapers.

5.1 Further Similarity SearchDynamic and Reverse version Search

The current similarity search are mostly conducted on thePathSim method. On one hand, this method reveals theconnections following a meta-path between peer nodes, onthe other hand the method might just reveals the impor-tance to me from a single degree. For example, as the stu-dent of Reynold, till my graduation, I might only have fewpapers coauthored with him. If we utilized the SimRank toevaluate the similarity in a top-k based, did we reveal such?Table 2 illustrates the different results of Reynold’s Path-Sim in a time basis. This indicates we should also maintaina dynamic version and also the reverse version. By these,we may have further explorations.

Table 2: Reynold’s top-5 APA SimilarityNO. All-Year Recent 5-Year

1 Ben Kao Ben Kao

2 Sunil Prabhakar David Wai-Lok Cheung

3 David Wai-Lok Cheung Luyi Mo

4 Jinchuan Chen Matthias Renz

5 Silviu Maniu Xuan S. Yang

Intelligent querying and semantic search in hetero-geneous information networks

Given real-world data are interconnected, forming giganticand complex heterogeneous information networks, it posesnew challenges to query and search in such networks intel-ligently and efficiently. Given the enormous size and com-plexity of a large network, a user is often only interested ina small portion of the objects and links most relevant to thequery. However, objects are connected and inter-dependenton each other, how to search effectively in a large network fora given users query could be a challenge. Similarity searchthat returns the most similar objects to a queried object,as studied in this thesis [14] and its follow-up [26], willserve as a basic function for semantic search in heteroge-neous networks. Such kind of similarity search may lead touseful applications, such as product search in e-commercenetworks and patent search in patent networks. Search func-tions should be further enhanced and integrated with manyother functions. Querying and semantic search in heteroge-neous information networks opens another interesting fron-tier on research related to mining heterogeneous informationnetworks.

5.2 Refining Heterogeneous Information Net-works.

Many works on Heterogeneous Information Network assumethat a HIN to be investigated contains a well-defined net-work schema and a large set of relatively clean and un-ambiguous objects and links. However, in the real world,things are more complicated. Though the work [22], butthe method is more applicable in the network like DBLP oris able to detect few paths in Wikipedia or Freebase. In [33]provides an framework in which the users only need to pro-vide few or several instances, the system will automaticallydetect other meta-paths in the HIN. Since in section 7, wealso pointed out that constructing other models are possible

6

Page 7: A Survey on Mining Heterogeneous Information Network · the Heterogeneous Information Network and leverage net-work analysis approach to reveal the rich information in the network

and effective in some cases, how to refine other models likemeta-structure in HIN?

5.3 Heterogeneous Network EmbeddingRecently, since the first release of paper on Network Em-bedding [31], many works start to explore the Embedding.Although similarity search in HINs has been studied previ-ously, most existing approaches neither explore rich seman-tic information embedded in the network structures nor takeusers preference as a guidance. In [25], authors reexaminesimilarity search in HINs and propose a novel embedding-based framework. It models vertices as low-dimensional vec-tors to explore network structure- embedded similarity.

6. CONCLUSIONIn this paper, we generally discussing the HeterogeneousInformation Network which is a more accurate model forthe real world containing rich semantic schema. We havediscussed the basic mining methods like ranking and clus-tering, meta-path which is a typical schema on HIN basedresearches, and some interesting topics. We also examinethe meta-path, and introduce some new models. Finally, weraise some interesting questions and frontier research direc-tions in our standing.

7. REFERENCES[1] L. A. Adamic and E. Adar. Friends and neighbors on

the web. Social networks, 25(3):211–230, 2003.

[2] M. Al Hasan, V. Chaoji, S. Salem, and M. Zaki. Linkprediction using supervised learning. In SDM06:workshop on link analysis, counter-terrorism andsecurity, 2006.

[3] X. Bai and L. J. Latecki. Path similarity skeletongraph matching. IEEE transactions on patternanalysis and machine intelligence, 30(7):1282–1292,2008.

[4] R. Bekkerman, R. El-Yaniv, and A. McCallum.Multi-way distributional clustering via pairwiseinteractions. In Proceedings of the 22nd internationalconference on Machine learning, pages 41–48. ACM,2005.

[5] P. Berkhin. A survey of clustering data miningtechniques. In Grouping multidimensional data, pages25–71. Springer, 2006.

[6] J. A. Bilmes et al. A gentle tutorial of the emalgorithm and its application to parameter estimationfor gaussian mixture and hidden markov models.International Computer Science Institute, 4(510):126,1998.

[7] S. Chakrabarti et al. Mining the web: Analysis ofhypertext and semi structured data, 2002.

[8] L. Getoor and C. P. Diehl. Link mining: a survey.ACM SIGKDD Explorations Newsletter, 7(2):3–12,2005.

[9] L. B. Holder and D. J. Cook. Graph-based datamining. Encyclopedia of data warehousing and mining,2:943–949, 2009.

[10] Z. Huang, B. Cautis, R. Cheng, and Y. Zheng.Kb-enabled query recommendation for long-tailqueries. In Proceedings of the 25th ACM Internationalon Conference on Information and KnowledgeManagement, pages 2107–2112. ACM, 2016.

[11] Z. Huang, Y. Zheng, R. Cheng, Y. Sun, N. Mamoulis,and X. Li. Meta structure: Computing relevance inlarge heterogeneous information networks.

[12] Y. Jacob, L. Denoyer, and P. Gallinari. Learninglatent representations of nodes for classifying inheterogeneous social networks. In Proceedings of the7th ACM international conference on Web search anddata mining, pages 373–382. ACM, 2014.

[13] A. K. Jain. Data clustering: 50 years beyond k-means.Pattern recognition letters, 31(8):651–666, 2010.

[14] G. Jeh and J. Widom. Simrank: a measure ofstructural-context similarity. In Proceedings of theeighth ACM SIGKDD international conference onKnowledge discovery and data mining, pages 538–543.ACM, 2002.

[15] G. Jeh and J. Widom. Scaling personalized websearch. In Proceedings of the 12th internationalconference on World Wide Web, pages 271–279. ACM,2003.

[16] J. M. Kleinberg. Authoritative sources in ahyperlinked environment. Journal of the ACM(JACM), 46(5):604–632, 1999.

[17] T. G. Lewis. Network science: Theory andapplications. John Wiley & Sons, 2011.

[18] X. Li, B. Kao, Y. Zheng, and Z. Huang. Ontransductive classification in heterogeneousinformation networks. In Proceedings of the 25th ACMInternational on Conference on Information andKnowledge Management, pages 811–820. ACM, 2016.

[19] D. Liben-Nowell and J. Kleinberg. The link-predictionproblem for social networks. Journal of the Americansociety for information science and technology,58(7):1019–1031, 2007.

[20] B. Long, Z. M. Zhang, X. Wu, and P. S. Yu. Spectralclustering for multi-type relational data. InProceedings of the 23rd international conference onMachine learning, pages 585–592. ACM, 2006.

[21] C. Luo, R. Guan, Z. Wang, and C. Lin. Hetpathmine:A novel transductive classification algorithm onheterogeneous information networks. In EuropeanConference on Information Retrieval, pages 210–221.Springer, 2014.

[22] C. Meng, R. Cheng, S. Maniu, P. Senellart, andW. Zhang. Discovering meta-paths in largeheterogeneous information networks. In Proceedings ofthe 24th International Conference on World WideWeb, pages 754–764. ACM, 2015.

[23] M. E. Newman and M. Girvan. Finding andevaluating community structure in networks. Physicalreview E, 69(2):026113, 2004.

7

Page 8: A Survey on Mining Heterogeneous Information Network · the Heterogeneous Information Network and leverage net-work analysis approach to reveal the rich information in the network

[24] L. Page, S. Brin, R. Motwani, and T. Winograd. Thepagerank citation ranking: bringing order to the web.1999.

[25] J. Shang, M. Qu, J. Liu, L. M. Kaplan, J. Han, andJ. Peng. Meta-path guided embedding for similaritysearch in large-scale heterogeneous informationnetworks. arXiv preprint arXiv:1610.09769, 2016.

[26] C. Shi, X. Kong, P. S. Yu, S. Xie, and B. Wu.Relevance search in heterogeneous networks. InProceedings of the 15th International Conference onExtending Database Technology, pages 180–191. ACM,2012.

[27] C. Shi, Y. Li, J. Zhang, Y. Sun, and P. S. Yu. Asurvey of heterogeneous information network analysis.arXiv preprint arXiv:1511.04854, 2015.

[28] Y. Sun, R. Barber, M. Gupta, C. C. Aggarwal, andJ. Han. Co-author relationship prediction inheterogeneous bibliographic networks. In Advances inSocial Networks Analysis and Mining (ASONAM),2011 International Conference on, pages 121–128.IEEE, 2011.

[29] Y. Sun, J. Han, P. Zhao, Z. Yin, H. Cheng, andT. Wu. Rankclus: integrating clustering with rankingfor heterogeneous information network analysis. InProceedings of the 12th International Conference onExtending Database Technology: Advances in DatabaseTechnology, pages 565–576. ACM, 2009.

[30] Y. Sun, Y. Yu, and J. Han. Ranking-based clusteringof heterogeneous information networks with starnetwork schema. In Proceedings of the 15th ACMSIGKDD international conference on Knowledgediscovery and data mining, pages 797–806. ACM,2009.

[31] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, andQ. Mei. Line: Large-scale information networkembedding. In Proceedings of the 24th InternationalConference on World Wide Web, pages 1067–1077.ACM, 2015.

[32] C. Wang, V. Satuluri, and S. Parthasarathy. Localprobabilistic models for link prediction. In SeventhIEEE International Conference on Data Mining(ICDM 2007), pages 322–331. IEEE, 2007.

[33] C. Wang, Y. Sun, Y. Song, J. Han, Y. Song, L. Wang,and M. Zhang. Relsim: Relation similarity search inschema-rich heterogeneous information networks.

[34] S. Wasserman and K. Faust. Social network analysis:Methods and applications, volume 8. Cambridgeuniversity press, 1994.

[35] X. Yu, X. Ren, Y. Sun, Q. Gu, B. Sturt,U. Khandelwal, B. Norick, and J. Han. Personalizedentity recommendation: A heterogeneous informationnetwork approach. In Proceedings of the 7th ACMinternational conference on Web search and datamining, pages 283–292. ACM, 2014.

8