Efficient cohesive subgraph search in big attributed graph ... · graph data. Realised that nding both the optimal query plan for multiple queries and the optimal query plan for a

Efficient Cohesive Subgraph Search in Big

Attributed Graph Data

by

Lu Chen

Faculty of Science, Engineering and Technologyin fulfilment of the requirements for the degree of

Doctor of Philosophy

at the

SWINBURNE UNIVERSITY OF TECHNOLOGY

December 2018

Acknowledgments

This thesis would not have been possible without the help, support and guidance of

important people in my life.

I would like to show my deepest appreciation to my coordinating supervisor Prof.

Chengfei Liu for his kindness, patience, and guidance. He encouraged and advised me

to pursue a Ph.D., and I am grateful for his great advice perpetually. His guidance

greatly inspired and helped me all the time of research and writing of this thesis. I

greatly appreciate that I had the chance to work with him. I would like to thank my

associate supervisor Dr. Jianxin Li for his inspiration and assistance. His determina-

tion and willpower affected me greatly, encouraging me to move forward when I got

stuck in research.

I would like to thank Dr. Rui Zhou, Prof. Xiaochun Yang, Assoc. Prof. Bin

Wang, and Assoc. Prof. Zhenying He. They helped me tremendously regarding how

to conduct in depth research and how to present profound research idea in clear and

simple language. Especially, Dr. Rui Zhou gave me invaluable comments and help

over my Ph.D. study period.

I would like to thank and appreciate all members of our research group at Swin-

burne, who are Dr. Saiful Islam, Dr.Tarique Anwar, Dr. Musfique Anwar, Dr. Mehdi

Naseriparsa, Ahmed Alshammari, Afzal Azeem Chowdhary, Aman Abidi, Limeng

Zhang and Xiaofan Li.

I would like to thank my parents and parents in law for their unconditional support

and love, and for being with me on important steps of my life. They always stood

by me at any cost whenever I was in tough situations and encouraged me with their

loving spirit.

i

More importantly, I would like to thank my wife Ms. Yingxian Zhang. She has

been always encouraging and alway had the right words to keep me going when I was

discouraged. I could not have done it without her.

At last but not least, I would like to acknowledge Swinburne University of Tech-

nology for providing funding, financial support, various facilities, and trainings, to

finish my Ph.D. research successfully.

ii

Declaration

I, Lu Chen, declare that this thesis titled, “Efficient Cohesive Subgraph Search in Big

Attributed Graph Data” and the work presented in it are my own. I confirm that:

• This work was done wholly or mainly while in candidature for a research degree

at this University.

• Where any part of this thesis has previously been submitted for a degree or

any other qualification at this University or any other institution, this has been

clearly stated.

• Where I have consulted the published work of others, this is always clearly

attributed.

• Where I have quoted from the work of others, the source is always given. With

the exception of such quotations, this thesis is entirely my own work.

• I have acknowledged all main sources of help.

• Where the thesis is based on work done by myself jointly with others, I have

made clear exactly what was done by others and what I have contributed myself.

iii

iv

Publications

• Lu Chen, Chengfei Liu, Rui Zhou, Jianxin Li, Xiaochun Yang, Bin Wang.

Maximum Co-located Community Search in Large Scale Social Networks. PVLDB

2018, CORE rank A*. (Chapter 3)

• Lu Chen, Chengfei Liu, Kewen Liao, Jianxin Li, Rui Zhou. Contextual Com-

munity Search over Large Social Networks. ICDE 2019, CORE rank A*. (Chap-

ter 4)

• Lu Chen, Chengfei Liu, Jianxin Li, Xiaochun Yang, Bin Wang, Rui Zhou.

Efficient Batch Processing for Multiple Keyword Queries on Graph Data. CIKM

2016, CORE rank A. (Chapter 5)

• Jianxin Li, Chengfei Liu, Lu Chen, Zhenying He, Amitava Datta, Feng Xia.

iTopic: Influential Topic Discovery from Information Networks via Keyword

Query. WWW 2017, Best Demo Award, CORE rank A*.

• Qiao Tian, Jianxin Li, Lu Chen, Ke Deng, Rong-hua Li, Mark Reynolds,

Chengfei Liu. Evidence-driven dubious decision making in online shopping.

WWWJ 2018, CORE rank A.

v

vi

Abstract

Models for finding cohesive subgraph previously have focused on graphs having no

attributes. However, these graphs provide only partial representation of real graph

data and miss important attributes describing a variety of features for each vertex

in the graphs. As such real graph data are better modelled as attributed graph. In-

vestigations for cohesive subgraph search in attributed graphs are still at preliminary

stage. Searching cohesive subgraphs in an attributed graph can discover interesting

communities and find useful information for answering keyword queries. In this the-

sis, several cohesive subgraph models considering spatial and textual attributes are

studied, which well fit into various real applications.

Firstly, the problem of k-truss search has been well defined and investigated to find

the highly correlated user groups in social networks. But there is no previous study to

consider the constraint of users’ spatial information in k-truss search, denoted as co-

located community search in this thesis. The co-located community can serve many

real applications. To search the maximum co-located communities efficiently, we

first develop an efficient exact algorithm with several pruning techniques. We further

develop an approximation algorithm with adjustable accuracy guarantees and explore

more effective pruning rules, which can reduce the computational cost significantly.

To accelerate the real-time efficiency, we also devise a novel quadtree based index

to support the efficient retrieval of users in a region and optimise the search regions

with regards to the given query region. We verify the performance of our proposed

algorithms and index using five real datasets.

Secondly, we propose a novel parameter-free community model called contextual

community, for searching a community in an attributed graph. The proposed model

vii

only requires a query context providing a set of keywords describing the desired com-

munity context, where the returned community is designed to be both structure and

attribute cohesive w.r.t. the query context. We show that both exact and approxi-

mate contextual community can be searched in polynomial time. The best proposed

exact algorithm bounds the running time by a cubic factor or better using an elegant

parametric maximum flow technique. The proposed 13-approximation algorithm sig-

nificantly improves the search efficiency. In the experiment, we use six real networks

with ground-truth communities to evaluate the effectiveness of our contextual com-

munity model. Experimental results demonstrate that the proposed model can find

near ground-truth communities. We test both our exact and approximate algorithms

using twelve large real networks to demonstrate the high efficiency of the proposed

algorithms.

Thirdly, answering keyword queries on textual attributed graph data has drawn

a great deal of attention from database communities. However, most graph keyword

search solutions proposed so far primarily focus on a single query setting. We observe

that for a popular keyword query system, the number of keyword queries received

could be substantially large even in a short time interval, and the chance that these

queries share common keywords is quite high. Therefore, answering keyword queries

in batches would significantly enhance the performance of the system. Motivated

by this, this thesis studies efficient batch processing for multiple keyword queries on

graph data. Realised that finding both the optimal query plan for multiple queries

and the optimal query plan for a single keyword query on graph data are compu-

tationally hard, we first propose two heuristic approaches which target maximising

keyword overlap and give preferences for processing keywords with short sizes. Then

we devise a cardinality based cost estimation model that takes both graph data statis-

tics and search semantics into account. Based on the model, we design an A* based

algorithm to find the global optimal execution plan for multiple queries. We evaluate

the proposed model and algorithms on two real datasets and the experimental results

demonstrate their efficacy.

viii

Contents

List of Figures ix

List of Figures xiii

List of Tables xv

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Maximum Co-located Communities Search . . . . . . . . . . . 2

1.1.2 Contextual Community Search . . . . . . . . . . . . . . . . . 3

1.1.3 Batch Keyword Query Processing on Graph Data . . . . . . . 6

1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.1 Maximum Co-located Communities Search . . . . . . . . . . . 7

1.2.2 Contextual Community Search . . . . . . . . . . . . . . . . . 8

1.2.3 Batch Keyword Query Processing on Graph Data . . . . . . . 10

1.3 Organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Literature Review 13

2.1 Cohesive Subgraph Models . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1.1 k-truss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1.2 Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.1.3 Clique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.1.4 k-core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2 Community Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . 20

ix

2.2.1 Community Discovery in Spatial Attributed Networks . . . . . 21

2.2.2 Community Discovery in Textual Attributed Networks . . . . 22

2.2.3 Community Discovery in Networks without Attributes . . . . 23

2.3 Keyword Search on Graph Data . . . . . . . . . . . . . . . . . . . . . 24

2.3.1 Keyword Search Semantics . . . . . . . . . . . . . . . . . . . . 24

2.3.2 Batch Query Processding . . . . . . . . . . . . . . . . . . . . . 25

3 Maximum Co-located Communities Search 27

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3 Finding Exact Results . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3.1 Baseline Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3.2 Efficient (k,d)-MCC Search . . . . . . . . . . . . . . . . . . . 35

3.3.3 Prunings before (k,d)-MCCs Enumeration . . . . . . . . . . . 40

3.4 Finding Spatial Approximate Result . . . . . . . . . . . . . . . . . . 41

3.4.1 How to Approximate (k,d)-MCCs . . . . . . . . . . . . . . . . 41

3.4.2 Spatial Index and Search Bounds . . . . . . . . . . . . . . . . 43

3.4.3 Prunings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.4.4 Error-bounded Approximation Algorithm . . . . . . . . . . . . 47

3.4.5 Truss Attributed Quadtree Index . . . . . . . . . . . . . . . . 48

3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.5.1 Efficiency Evaluation . . . . . . . . . . . . . . . . . . . . . . . 52

3.5.2 Effectiveness Evaluation . . . . . . . . . . . . . . . . . . . . . 58

3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4 Contextual Community Search 65

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.2.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.2.2 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . 70

4.2.3 Why contextual community search . . . . . . . . . . . . . . . 72

x

4.2.4 Pre-prunings . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.3 A Flow Network Based Approach . . . . . . . . . . . . . . . . . . . . 73

4.3.1 Flow Network Preliminaries . . . . . . . . . . . . . . . . . . . 73

4.3.2 Algorithm Framework . . . . . . . . . . . . . . . . . . . . . . 75

4.3.3 Warm-up for Flow Network Construction . . . . . . . . . . . . 76

4.3.4 CC Auxiliary Flow Network . . . . . . . . . . . . . . . . . . . 78

4.3.5 Correctness and Time Complexity . . . . . . . . . . . . . . . . 82

4.4 An Improved Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.4.1 Optimization Framework . . . . . . . . . . . . . . . . . . . . . 85

4.4.2 Algorithm Correctness . . . . . . . . . . . . . . . . . . . . . . 85

4.4.3 Solving the Subproblem . . . . . . . . . . . . . . . . . . . . . 86

4.4.4 Analysing the Number of Iterations . . . . . . . . . . . . . . . 89

4.5 The Incremental Parametric Maximum Flow . . . . . . . . . . . . . . 90

4.5.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.5.2 Parametric Flow Framework . . . . . . . . . . . . . . . . . . . 92

4.5.3 CC Parametric Flow Network . . . . . . . . . . . . . . . . . . 93

4.5.4 Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.6 Approximation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 95

4.7 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.7.1 Finding Large and Connected CC . . . . . . . . . . . . . . . . 98

4.7.2 State-of-the-art Maximum Flow Algorithms . . . . . . . . . . 98


4.8.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 99

4.8.2 Effectiveness Evaluation . . . . . . . . . . . . . . . . . . . . . 102

4.8.3 Efficiency Evaluation . . . . . . . . . . . . . . . . . . . . . . . 112

4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

5 Batch Keyword Query Processing on Graph Data 115

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.2 Preliminaries and Problem Definitions . . . . . . . . . . . . . . . . . 117

xi

5.2.1 Keyword Query on Graph Data . . . . . . . . . . . . . . . . . 118

5.2.2 Batched Multiple-Keyword Queries . . . . . . . . . . . . . . . 118

5.3 Heuristic-based Approaches . . . . . . . . . . . . . . . . . . . . . . . 120

5.3.1 A Shortest List Eager Approach . . . . . . . . . . . . . . . . . 120

5.3.2 A Maximal Overlapping Driven Approach . . . . . . . . . . . 121

5.4 Cost Estimation for Query Plans . . . . . . . . . . . . . . . . . . . . 125

5.4.1 Cost of an r-Join . . . . . . . . . . . . . . . . . . . . . . . . . 125

5.4.2 Estimating Cardinality of an r-Join Result . . . . . . . . . . . 126

5.5 Estimation-based Query Plans . . . . . . . . . . . . . . . . . . . . . . 127

5.5.1 Finding Optimal Solution based on Estimated Cost . . . . . . 128

5.5.2 Reducing Search Space . . . . . . . . . . . . . . . . . . . . . . 129


5.6.1 Datasets and Tested Queries . . . . . . . . . . . . . . . . . . . 131

5.6.2 Evaluation of the Efficiency . . . . . . . . . . . . . . . . . . . 132

5.6.3 Evaluation of Effectiveness . . . . . . . . . . . . . . . . . . . . 136

5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6 Conclusion and Future Work 141

6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

Bibliography 145

xii

List of Figures

3-1 Spatial attributed graph . . . . . . . . . . . . . . . . . . . . . . . . . 29

3-2 Rectangular regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3-3 TQ-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3-4 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3-5 Effect of k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3-6 Effect of d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3-7 Effect of search granularity . . . . . . . . . . . . . . . . . . . . . . . . 57

3-8 Exact algorithm pruning effectiveness . . . . . . . . . . . . . . . . . . 58

3-9 Effectiveness of pruning rules . . . . . . . . . . . . . . . . . . . . . . 59

3-10 Region pruning ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3-11 Approximation ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3-12 Density study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3-13 Case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4-1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4-2 Warm-up flow network illustrations . . . . . . . . . . . . . . . . . . . 76

4-3 F1 scores for facebook . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4-4 Effectiveness evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 103

4-5 Attributed networks with no ground-truth . . . . . . . . . . . . . . . 105

4-6 Sensitivity w.r.t. query attribute size . . . . . . . . . . . . . . . . . . 105

4-7 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

4-8 Scalability cont. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

4-9 Effect of |Q| . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

xiii

4-10 Effect of |Q| cont. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5-1 An example graph G and the answer subsgraphs to q1 in the subgraph

G′ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5-2 Query plans for single queries q1, q2, and batch multiple queries {q1, q2} 119

5-3 An example of processes in the algorithm Overlap . . . . . . . . . . 123

5-4 Scalability and speedup studies . . . . . . . . . . . . . . . . . . . . . 133

5-5 Efficiency of multiple queries . . . . . . . . . . . . . . . . . . . . . . . 134

5-6 Accuracy of cardinality estimation . . . . . . . . . . . . . . . . . . . . 137

5-7 Pruning effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

xiv

List of Tables

3.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2 Maximal cliques contained in Figure 3-1(c) . . . . . . . . . . . . . . . 34

3.3 Enumeration trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4 Truss ids and union of truss-to-vertex descriptions . . . . . . . . . . . 48

3.5 Description files for a branch . . . . . . . . . . . . . . . . . . . . . . 49

3.6 Implemented algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.7 Statistic information in datasets . . . . . . . . . . . . . . . . . . . . . 51

3.8 Parameter settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.9 TQ-tree construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.1 Statistic information of datasets . . . . . . . . . . . . . . . . . . . . . 100

4.2 Implemented algorithm for different community models . . . . . . . . 101

5.1 Keyword sets for DBLP and IMDB . . . . . . . . . . . . . . . . . . . 132

xv

Chapter 1

Introduction

Cohesive subgraph, one of fundamental components in graphs, is a group of fine-

connected vertices which can be used to discover meaningful information from graph

data. The best known applications of finding cohesive subgraphs are community

search or detection and keyword search. Models for finding cohesive subgraph have

focused on graphs having no attributes previously. However, these graphs provide

only partial representation of real graph data and miss important attributes describ-

ing a variety of features for each vertex in the graphs, which leads to that real graph

data are better modelled as attributed graph. Although a plenty of cohesive subgraph

models have been proposed and extensively studied, effective cohesive subgraph mod-

els considering various of graph attributes remain to be further studied in earnest.

In this thesis, we explore three cohesive subgraph models: (1) k-truss; (2) den-

sity; and (3) r-cliques, together with one of two types of attributes: (1) textual

attribute; and (2) spatial attribute. That leads to three interesting while challenging

problems on attributed graph data: (1) discovering communities with cohesiveness

on both structure and geo-location; (2) searching communities with cohesiveness on

both structure and keyword attributes; and (3) finding answers for keyword queries

on attributed graph data efficiently.

Section 1.1 presents backgrounds, motivations, research gaps to be filled, and main

ideas of approaches for the problems introduced above. Section 1.2 presents principle

contributions for each of the problems studied in this thesis. Thesis organisation is

1

presented in Section 1.3.

1.1 Motivation

1.1.1 Maximum Co-located Communities Search

With the increasing popularity of online social networks, one of the most important

tasks in social network data analytics is to find communities of users with close

structural connections each other. The extensive studies on finding communities can

be categorised into global community detection GCD (e.g., [37, 39, 44, 78, 33, 17]),

local community detection LCD (e.g., [28, 103]), global community search GCS (e.g.,

[70, 68, 80]), and local community search LCS (e.g., [35, 93, 28, 27, 51, 34, 108,

50]). Community detection methods are often used to discover communities in social

networks based on the predefined implicit criteria, e.g., modularity [37]. The main

difference between GCD and LCD is that each user is equivalently important to

be measured in GCD, while the importance of a user depends on his relevance to

the given query vertex in LCD. Different from community detection, the community

search methods concentrate on finding communities from social networks based on

the users’ specified explicit criteria, e.g., parameter k in k-core based model [89], k-

truss based model [26], and k-edge-connected component based model [12]. Similar

to community detection methods, the major difference between GCS and LCS is that

LCS requires the communities to contain the given query vertex, but GCS doesn’t

have such additional requirement. However, most works above didn’t consider the

effect of users’ spatial information in their community detection or search methods.

Searching communities with social and spatial cohesiveness is of great impor-

tance in many applications, e.g., event scheduling, product recommendation, targeted

advertisement, local activism and advocacy, as well as more effective content spread-

ing like shop promotions, local news, and job openings. Although spatial feature is

highly desirable in applications, in practice, the existing study on spatial social com-

munity is still limited. In [34], Fang et al. require all the vertices of a returned k-core

2

community in a minimum covering circle with the smallest radius and the resultant

community must contain the given query vertex. So it is a type of LCS with the

spatial constraint. In [33, 17], Expert et al. and Chen et al. take into account spatial

information in the process of community detection by weighting the link based on the

spatial distance of two linked users. It is a type of GCD with the spatial constraint.

However, the two types of work cannot guarantee the spatial closeness of the commu-

nity members, which will be further discussed in our experiments. In [109], Zhang et

al. require all the vertices of a returned k-core based community meeting similarity

constraints, where the similarity could be distance similarity. However, finding exact

result for this community model is expensive in large scale social network since its

NP-hardness.

Therefore, in this thesis, we investigate the co-located community search problem

that reveals the maximum communities with high social and spatial cohesiveness,

denoted as (k,d)-MCCs search. The social cohesiveness is defined using the minimum

truss value k [26] and the spatial cohesiveness is parameterised by a user-specified

distance value d. As such, our proposed (k,d)-MCCs search problem can allow users

to easily affirm the quality of the resultant communities, which also fills in the research

gap on the type of GCS with spatial constraint.

1.1.2 Contextual Community Search

As real social network data are actually complicated, i.e. users are generally profiled

with attributes, a social network is better modelled as an attributed graph where

vertices are attached with descriptive attributes like keywords describing various user

properties. For this reason, searching community in attributed graph has become

popular with the invention of new community models, queries and search algorithms.

Since an attributed community search method does not need to explore all vertices

[35, 53], reducing the search space by orders of magnitudes, it makes online commu-

nity search become applicable, which is ideal for many applications. However, most

existing attributed community search methods do not support community search

given only the context information.

3

Often an application or user would only like to search any community that is

most relevant to its provided context information without knowing how community

looks like and what community members actually are. In this thesis, we propose a

novel parameter-free community model, namely the contextual community that only

requires a query to provide a set of keywords describing an application/user context

(e.g. location and preference). As such users can search desired communities without

detailed information of them. This is in contrast to existing community search models

where additionally a set of known query vertices as well as community cohesiveness

parameters (e.g. k as the minimum vertex degree) are required. But still, the returned

contextual community is designed to be both structure and context cohesive.

Structure cohesiveness. Given the query context, the most popular cohesive mea-

surements like k-core and k-truss are often unsuitable. On one hand, there exists an

inherent contradiction that a larger k value may imply a smaller community to be

found that is potentially insufficient to match the provided context. On the other

hand, while considering more about the context match, vertices (edges) may very

likely fail to meet the minimum number requirement of neighbours (common neigh-

bours) of k-core (k-truss). Moreover, imposing the k bound on the community can

be deemed to be inflexible and deciding the best k that satisfies both context and

structure requirements is challenging. Therefore, for seeking a proper contextual com-

munity we instead adopt the notion of relative subgraph density that is parameter-free

and relative in the number of considered motifs/units and the number of their induced

vertices. The search goal is then reduced to finding a subgraph having the highest

density. However, as shown in [97] if the considered motifs or community signatures

are simply edges, the found community might be large and not absolutely cohesive.

On the other hand, [97] shows that triangle motifs would be better signatures to

produce a truly cohesive subgraph, but in this case size of the returned community

might drop dramatically, thereby affecting the desirable vertex coverage. To alleviate

such shortcomings, we instead account for both involved triangle and edge motifs as a

unified density measure to suitably explain the structure cohesiveness of a contextual

community.

4

Context cohesiveness. As discussed previously, in real applications, it would be

desirable that, by simply accepting a set of keywords about the desired community

context, a community search system is able to find a community that is highly relevant

to the provided context. Intuitively, this translates to, vertices with attributes close to

the context shall be considered as members of the desired community in an attributed

graph. However, overemphasising the co-relation between found vertex attributes and

the query context may cause the search to result in a small and loosely connected

subgraph. This is actually against the structural requirement of being a community

and becomes another popular research topic, keyword search [9, 47, 57, 66, 81, 58].

To avoid such pessimistic situation, we can relax context cohesiveness by tolerating

community vertices that are themselves less relevant to the query context but in-

stead their surrounding vertices are more relevant. As shown in Section 4.2.2, this

relaxation is naturally achieved with triangle and edge (the aforementioned subgraph

density motif) contextual scores/weights aggregated from nodal contextual relevance.

Notice that our weighted motif (triangle and edge) measure ensures relaxed but strong

internal context cohesiveness in a community since all the involved units are matched

against the query context.

Contextual community. Based on the desiderata of contextual community search,

we propose a weighted density based contextual community model. First, a contex-

tual weight is assigned to each small motif of a subgraph. It measures the context

prevalence of a motif. Then, the context weighted density of a subgraph is calculated

as the division of the aggregated weight over all motifs by the total number vertices

in the subgraph. Finally, the subgraph with the highest contextual weighted density

w.r.t. the query context is returned as the best community. Notice that the intuition

behind our contextual community model is: every designated community member

should be involved in many structurally overlapped edge and triangle motifs which

are themselves prevalent in the specified query context. In real life, these units are

analogous to mutual friendships and family circles.

Advantages and applications. There are a number of advantages which contex-

tual community search can offer. First, contextual community search significantly

5

simplifies the search parameter space. As such, users do not need to admit to any

cohesiveness parameter in their search. Second, contextual community search only

explores parts of the social network that are relevant to the query context. This

significantly reduces the solution search space and makes online community search

possible. Also note that communities found by contextual community search can be

multifarious. For instance, if a query context covers most attributes of a ground truth

community, then the contextual search finds a near-groundtruth result. On the other

hand, in the case when a query context covers attributes across multiple ground truth

communities, the search returns a community that cannot be found by most exist-

ing methods. These situations are also evident from our experimental findings. For

this kind of flexibility, contextual community search is suitable for broader applica-

tions including the existing ones such as event scheduling, product recommendation,

targeted advertisement, activism and advocacy that leverage the query context.

1.1.3 Batch Keyword Query Processing on Graph Data

Keyword search has been extensively studied in the field of database, e.g., relational

database [2, 49], XML database [74, 105], graph database [9, 47, 49, 57, 58, 66, 67, 81]

as well as spatial database [13, 29]. However, all the above existing work focused on

single query processing in keyword search. They design their algorithms and indices

based on the performance of answering single keyword queries. Such single query

based techniques are not enough to support real query processing systems due to

several reasons. Normally, a query processing system should support multiple types

of users. For example, beyond general users, a third-party company as an important

data consumer may perform significant analysis and mining of the underlying data in

order to optimise their business by issuing a group of queries as a batch query. Here,

the third-party company may be an industry sector collecting their interested data

from online databases, a researcher comparing the scientific results from scientific

databases. In all the cases, the batch of queries issued from the third-party com-

pany are used to mine information from the databases and optimise their business

or targeting benefit. It is also important that such a query processing system is de-

6

signed with the goal of returning results in fractions of a second for a large number of

queries to be received in a very short time. Recently, domain-specific search engines

have been widely used as they provide specific and profound information that well

satisfies users’ search intentions. Usually, the underlying data is highly structured,

and in most cases, is represented as attributed graphs. We observe that for a popular

domain-specific search engine, the number of keyword queries received could be sub-

stantially large even in a short time interval, and the chance that these queries share

common keywords is quite high. Therefore, answering keyword queries in batches

would significantly enhance the performance of domain-specific search engines. How-

ever, most graph keyword search solutions proposed so far primarily focus on a single

query setting.

In this thesis, we study the problem of batch processing of keyword queries on

graph data. Our goal is to process a set of keyword queries as a batch while minimising

the total time cost of answering the set of queries.

1.2 Contribution

1.2.1 Maximum Co-located Communities Search

Given a social network G and two parameters k and d, a straightforward approach

is to enumerate all possible subgraphs in G meeting minimum truss value k where

the number of subgraphs could be as large as O(2n). It then filters the candidates

having a node pair with their distance above the spatial closeness threshold d. So the

time complexity of this approach is at least O(2n) where n is the number of vertices

in G. Obviously, it is infeasible to use this approach to support online (k,d)-MCCs

search, particularly for the large scale social networks. Thus, we will propose efficient

algorithms to achieve real-time response with theoretical guarantee.

To address the challenge of efficiency, we first develop an exact (k,d)-MCCs search

algorithm by proposing novel pruning techniques. During the search, we explore tech-

niques to prune the search space significantly by considering upper bound based earlier

7

termination, heuristic search order, and conditions for reusing pruning computation.

Before searching, we also propose pre-pruning techniques for reducing magnitudes

of input data. To design polynomial algorithms, we develop a novel approximation

schema with spatial accuracy guarantees. Notice, our proposed approximation scheme

can provide adjustable spatial error ratios based on user’s requirement on the spatial

accuracy. To further improve the performance of the approximation algorithm, we

propose more pruning techniques and also design the novel index TQ-tree. The main

contributions of our work are summarised as follows.

• We propose a novel co-located community model and formally define the (k,d)-

MCCs search problem.

• We develop an efficient exact algorithm for finding (k,d)-MCCs by proposing

effective techniques for pruning before and during the search.

• We also develop a spatial approximation algorithm that offers a variable spatial

error ratio ranging from 2√

2+ ε to√

2+ ε′. The efficiency of the approximation

algorithm is further improved by proposing more effective pruning techniques

and a novel TQ-tree index.

• We conduct extensive experimental studies on five real datasets to demonstrate

the efficiency and effectiveness of the proposed algorithms.

1.2.2 Contextual Community Search

Whether there exists an efficient algorithm for searching contextual community is

unknown. Although our proposed community model is based on weighted densest

subgraph, existing exact and approximation algorithms running in polynomial time

only work on separate density functions, i.e. weighted/unweighted degree density

function or unweighted triangle density function. For our more complicated con-

textual community search, building on the theory frameworks of flow networks and

approximation algorithms we confirm that there are indeed (both in theory and prac-

tice) efficient algorithms.

8

More precisely, given a graph G and a set of query attributes Q, our first approach

carefully constructs a flow network N that guides the community search. Together

with binary search probing, the approach in total runs in time O(|V (N)|3 log(|V (G)|))

where V (.) and E(.) define vertex and edge sets respectively and N is a constructed

flow network. By formulating the contextual community search as an optimisa-

tion problem, we then construct a different flow network N with parameters help-

ing a monotonic search of contextual community. Along this second approach, we

manage to avoid a pitfall implementation running in O(V (G)|V (N)|3) with an el-

egant parametric maximum flow technique. This technique eventually turns the

runtime into O(|V (N)|3) or better. Note that the aformentioned runtime complex-

ities are worst-case guarantees while in practice they are also very much reduced

with the consideration of query context. To achieve even extra runtime scalability,

we also propose a fast 13-approximation algorithm. The algorithm can run in time

O(|V (G)| log(|V (G)|)+|E(G)| log(|V (G)|)+|Tri(G)|) with simple degree and triangle

indices. Overall, the main contributions of our work are summarised as follows:

• We propose and study a novel and useful contextual community (CC) search

problem.

• Two network flow based exact algorithms are designed to solve CC search in

polynomial time.

• An approximate solution is proposed and analyzed (with an approximation ratio

of 13), which significantly improves over the runtimes of exact algorithms.

• Our empirical studies on real social network datasets demonstrate the superior

effectiveness of CC search methods under different query contexts. Extensive

performance evaluations also reveal the superb practical efficiency of the pro-

posed CC search algorithms.

9

1.2.3 Batch Keyword Query Processing on Graph Data

In spite of batch query processing having been studied extensively [91, 84, 55, 49,

24, 63, 63, 11], we observe that all the existing techniques cannot be applied to

our problem - batch keyword query processing on graph data. The main reasons

come from the following significant aspects. (1) Meaningful Result Semantics : r-

clique can well define the semantics of keyword search on graph data as r-clique can

be used to discover the tightest relations among all the given keywords in a query

[58], but there is no existing work that studies batch query processing with this

meaningful result semantics; (2) Complexity of the Optimal Batch Processing : it is

an NP-complete problem to optimally process multiple keyword queries in a batch.

This is because each single query corresponds to several query plans, and obviously

we cannot enumerate all the possible combinations of single query plans to get the

optimal query plan for multiple queries; (3) Not-available Query Historic Information:

unlike the batch query processing [107], we do not have the assumption that we know

the result sizes of any subqueries before we actually run these queries because this

kind of historic information is not always available.

Although we can simply evaluate the batch queries in a pre-defined order and

re-use the intermediate results in the following rounds as much as we can, there is

no guarantee the batch queries can be run optimally. To address this, we firstly

develop two heuristic approaches which give preferences for processing keywords with

short sizes and maximise keyword overlaps. Then we devise a cardinality estimation

cost model by considering graph connectivity and result semantics of r-clique. Based

on the cost model, we can develop an optimal batch query plan by extending A*

search algorithm. Since the A* search in the worst case could be exhaustive search,

which enumerates all possible global plans, we propose pruning methods, which can

efficiently prune the search space to get the model based optimal query plan.

We make the following contributions in this work:

• We propose and study a new problem of batch keyword query processing on

native graph data, which is popular to be used in modern data analytics and

10

management systems.

• We formalise the proposed problem, which is NP-complete. To address it, we

develop two heuristic solutions by considering the features of batch keyword

query processing.

• To optimally run the batched queries, we devise an estimation-based cost model

to assess the computational cost of possible sub-queries, which is then used to

identify the optimal plan of the batch query evaluation.

• We conduct extensive experiments on DBLP and IMDB dataset to evaluate the

efficiency of proposed algorithms and verify the precision of the cost model.

1.3 Organisation

The reminder of this thesis is organised as follows:

• Chapter 2 introduces the related work on cohesive subgraph models, commu-

nity search, spatial cohesiveness models, keyword search and multiple query

processing.

• Chapter 3 presents the co-located community search problem, the corresponding

algorithms and the experimental results.

• Chapter 4 presents the contextual community search problem, the correspond-

ing algorithms and the experimental results

• Chapter 5 presents the batch keyword query processing problem on attributed

graph data, the corresponding algorithms and the experimental results.

• Chapter 6 concludes our research and provides the possible extension of this

thesis and other unexplored areas as future research direction.

11

12

Chapter 2

Literature Review

The cohesive subgraph search problem has been studied extensively. In the context

of cohesive subgraph search in attributed graph data, studies has been recent and

limited. Finding cohesive subgraph in attributed graph is closely related to research

problems including community finding, and keyword search. In this chapter, we

conduct a detailed literature review on works related to the problems studied in

this thesis. We firstly start from reviewing cohesive subgraph models in Section 2.1,

which includes popular models and algorithm development path for each of the model

respectively. Next, works regarding two important applications of cohesive subgraph

search, i.e. community discovery and keyword search, are disscused in details. In

specific, Section 2.2 presents works regarding community discovery, which mainly

focuses on works that finding communities considering different attributes. While

Section 2.3 presents works related to keyword search on 54724 graph data, including

works proposing different search semantics as well as works dealing with batch queries.

2.1 Cohesive Subgraph Models

There are a number of cohesive subgraph models that are dedicated for various

scenarios in the literature. To define a cohesive subgraph, there are predefined cohe-

siveness measurements. A subgraph that satisfies certain cohesiveness measurements

is considered as cohesive. Some popular cohesive subgraph models, closely related to

13

this thesis, are introduced as follows.

2.1.1 k-truss

The concept of k-truss subgraph is proposed in [26] by Cohen. A k-truss is defined as

a non-trivial and connected subgraph such that every edge in the subgraph has no less

than k-2 common neighbours, in which the non-trivial constraint excludes isolated

vertices. A maximal k-truss is a k-truss if it is not a subgraph of another k-truss.

A k-truss subgraph also ensures that the the subgraph will remain connected if less

than k-2 edges are removed from the subgraph. The truss number of an edge is the

largest value of k such that the edge is in the k-truss.

Truss decomposition computes the truss number for each of the edges in a graph.

The first truss decomposition algorithm is proposed in [26]. The major ideal is that

for all possible k from k = 2 the algorithm iteratively deletes any edge with common

neighbours no greater than k-2 in the residual graph until no such edge remains (all

maximal k-2 trusses are computed), and then increases k by 1 and repeats the process

until the remaining graph becomes empty set. The algorithm uses a queue to store

edges having common neighbours no greater than k-2 for the current k. As such, as k

increasing, edges having high number of common neighbours will be revisited repeat-

edly to check whether it should be moved to the queue or not, which lowers the effi-

ciency and makes the time complexity of the algorithm become O(∑

v∈V (G) deg(v)2).

Observed such shortness, Wang et.al [99] firstly sort the edge according to the num-

ber of their common neighbours in none descending order, then remove an edge such

that has minimum number of common neighbours and has less than k-2 common

neighbours in the remaining graph iteratively. After each removal, since each of the

edges induced by the common neighbours of the removed edge only loses one common

neighbour, Wang et.al adopt a bucket sort based technique that the each effected edge

will be moved to an appropriate position in the sorted edges such that the remaining

edges are still in none descending order according to the number of their common

neighbours in the remaining graph in constant time. As such, the time complexity

of truss decomposition is improved to O(|E(G)| 32 ). In addition, Wang et.al [99] also

14

study I/O effect truss decomposition algorithm if the input graph is impossible to be

fitted in main memory.

Distributed truss decomposition algorithms are studied in [16]. They firstly de-

sign algorithms based on existing MapReduce triangle listing algorithm. Then they

propose an algorithm based on graph-parallel abstraction, which significantly reduces

I/O overhead and improves the performance. In particular, Shao et al study maximal

k-truss detection distributed algorithm, where different from truss decomposition,

maximal k-truss detection focuses on finding a maximal k-truss for a explicit k. They

construct a triangle complete subgraph for each computation node in the distributed

system, show that each triangle complete subgraph can be used to find local k-truss

in parallel, and prove that the union of local k-trusses is exactly the global k-truss.

2.1.2 Density

In graph theory, there are several approaches that define the density of a subgraph.

Among various density formulations, the most famous one is degree density, measuring

the average degree of vertices in a subgraph. Recently, triangle density [97], measuring

the average number of triangles of vertices involved in a subgraph, is proposed to find

more dense subgraph compared to degree density.

One of the fundamental problem for graph density is to find a subgraph max-

imising a given density function, where this problem is known as finding a maximum

density subgraph. The first algorithm solve the maximum degree density problem

is proposed by [56]. They reduce the maximum degree density problem into a 0-1

none-linear fractional programming problem. And then, they adopt fractional pro-

gramming solver proposed in [30], in which a fractional programming problem is

solved by finding optimal results of a set of problems related to the original frac-

tional programming problem. The algorithm proposed in [56] is bounded by |V (G)|

number of finding min s-t cuts, which is because that the set of problems related to

the fractional programming problem answering the maximum degree density problem

can be reduced to min s-t cut problem. Later on, Goldberg discovers that the maxi-

mum degree density problem can be reduced to solving a series of minimum capacity

15

cut computations, by applying flow network techniques. Different from algorithm

in [56], Goldberg proposes a carefully designed directed flow network based on the

original graph. The algorithm, proposed by him, iteratively guesses the optimum

density using a binary search convention. The min s-t cut of the carefully designed

directed flow network guides the direction of the binary search and makes the guess

converge to the optimum density. When optimum density is determined, the maxi-

mum degree density subgraph can be derived according to S of the last min s-t cut

satisfying S \ s 6= ∅, where s is the source vertex in the carefully designed directed

flow network. Such algorithm can be bounded by log(|V (G)|) number of finding min

s-t cuts. The algorithms discussed above all consider algorithm for min s-t cut as a

blackbox. Since the min s-t cut problem having been studied extensively till now, the

best algorithm, according to the best of our knowledge, can run in O(V (N)E(N)),

where N denotes the flow network. Observing the series of min s-t cut problems for

finding the maximum degree density graph are highly co-related, Gallo et al [38] name

similar problems as parametric flow network problem and propose a algorithm which

can be used to solve the maximum degree density graph problem. The algorithm has

a time complexity of O(|V (N)||E(N)| log( |V (N)|2|E(N)| )). Their algorithm for the maxi-

mum degree density graph problem is based on the framework used in [30]. The trick

here is they find that if solving the series min s-t cut problems using push-relabel

algorithm [41], the min s-t cut problems can be solved incrementally. As such, they

prove that taking advantages of reuse the overall time cost is equivalent to solve a

single in s-t cut problem using the push-relabel algorithm. Solving the maximum

degree density problem approximately has also drawn a great of attention. In [15],

a 2-approximation scheme is proposed and then the algorithm is improved and has

time complexity of O(|V (G)|+ |E(G)|) by Khuller et al [61].

The discussed techniques are adopted to solve the maximum density subgraph

problem in directed graph in [61]. Recently, Tsourakakis [97] also shows the maximum

triangle density subgraph problem can be solved using the discussed frameworks for

maximum density subgraph problem. Although the discussed framework can be used

to solve problems having different density functions in [61] and [97], the adoptions

16

are challenging and none-trivial.

If we add size constraint to the maximum degree density subgraph problem, the

new problems becomes intractable. The problems, densest at least size l [7], at most

size l [7], and densest size l subgraph problems [36, 60], are all NP-complete, which

implies problems of finding maximal and minimal maximum degree density subgraph

are all NP-hard. As such, approximation algorithms running in polynomial algorithm

are proposed. Densest at least size l subgraph problem has a 3-approximation algo-

rithm proposed in [7] and has a 2-approximation algorithm proposed by Andersen [6].

On the other hand, densest at most size l and size l subgraph problems are difficult

to have error bounded approximation algorithm running in polynomial time [7].

Most recently, the concept of local dense subgraph draws attention [80, 86, 95]. Lo-

cal dense subgraph in general combines the degree (triangle) density and a constraint

of the minimum degree (triangle) each vertex involved in . A graph is called local

dense subgraph if the degree (triangle) density and the minimum degree (triangle)

constraint of the subgraph go beyond a given threshold.

2.1.3 Clique

2.1.3.1 Cliques in graphs

Clique, as the most cohesive subgraph, is firstly introduced by Erdos et al [32] and

the term clique is named by Luce et al [76]. Given a graph, a clique subgraph is a

complete graph in which every pair of vertices has an edge. A clique is called maximal

clique if there is no super-graph of the clique that is also a complete graph. A clique is

called maximum clique if there is no other complete subgraph having size larger than

the size of the maximum clique. The definition of clique is strict. Luce introduces the

concept of r-clique to relax the pairwise structural distance, defined by the length of

shortest path between two vertices, from 1 to a given integer r, i.e., a r-clique is a

subgraph in which the shortest path between any two vertices is no greater than r.

Maximal clique enumeration problem has been studied extensively. The most

popular algorithms are based on a backtracking framework proposed in [10]. A major

17

optimisation used for clique enumeration is proposed in [4]. The optimisation is based

on an observation that the vertices of maximal cliques containing a vertex must be

the neighbours of the vertex. As such, during the enumeration, clique enumeration

on a whole graph can be divided into a set of small result disjoint problems, which

can be solved separately. The question regarding how to divide the problem so that

the clique enumeration can achieve optimal running time has been answered in [96].

They propose a greedy strategy, in which for each recursion state they always divide

the current problem into the fewest number of result disjoint subproblems. They

also prove that such strategy makes the algorithm proposed in [10] run in O(3|V (G)|

3 ),

which is worst-case optimal, since give a |V (G)| vertices graph the total number of

maximal cliques is up to 3|V (G)|

3 .

Recently, Wang et al propose a new clique enumeration algorithm based on com-

puting a summary of a set of maximal cliques which are redundancy-aware. Since

computing redundancy-aware maximal cliques is efficient and these cliques can be

extended to maximal cliques in the graph, the maximal clique enumeration algo-

rithm based on the redundancy-aware maximal cliques is shown to be more effective

compared to backtracking based clique enumeration.

In addition, maximal clique enumeration in sparse graphs is studied in [31]. In

that, Eppstein et al propose an optimisation, greedily selecting vertices having lowest

degeneracy, to make clique enumeration in a sparse graph run in O(d(|V (G)|−d)3d3 ),

where d is the degeneracy of the sparse graph. Besides, a truth polynomial delay

maximal clique enumeration algorithm for a sparse graph is proposed by Chang et

al [14].

2.1.3.2 Cliques in unit disk graph

A unit disk graph(UDG) is a set of vertices embedded in 2D space, where any two

vertices have an edge if their Euclidean distance is no greater than a given distance

threshold. Cliques in a UDG means that vertices in the cliques are all spatially having

pairwise distance no greater than the given threshold.

Interestingly, finding a maximum clique in a UDG is polynomially solvable [1, 25].

18

The major reasons are as follows. Firstly, given two vertices having an edge in a

UDG, the vertices in a maximum clique containing this edge must be located in an area

which is the intersection of two circles that use the two vertices as centres respectively

while have radius of the given distance threshold. Secondly, subgraph induced by the

two vertices and vertices in the intersection is the complement of a bipartite graph.

Thirdly, finding a maximum independent in the bipartite graph is equivalent to finding

a maximum clique in the subgraph induced by the two vertices and vertices in the

intersection, where finding a maximum independent in a bipartite graph is polynomial

solvable according to Konig’s theorem using Hopcroft–Karp algorithm [48]. As such,

by trying all edges, a maximum clique in a UDG can be found, running in O(|E(G)|)×

the complexity of finding a maximum independent in a bipartite graph.

Although a maximum clique in a UDG is polynomial solvable, the problem hard-

ness for finding all maximal cliques in a UDG is unknown and it is an open problem to

be further investigated. Gupta et al. [45] report the total number of maximal cliques

could grow exponentially with the size of a UDG graph and propose polynomial algo-

rithms that generate near maximal cliques. Exact maximal cliques enumeration for

the neighbourhood neighbourhood graph is also studied in [94], in which their algo-

rithm is still based on [10] but they propose a new problem division strategy using

geometry properties.

2.1.4 k-core

A maximal connected subgraph having vertices with degree at least k is called k-

core [89]. Regarding k-core, there are two classic problems known as core decomposi-

tion and core maintenance. Finding k-cores for all possible k for a graph is known as

core decomposition. The well known O(|E(G)|+ |V (G)|) core decomposition in mem-

ory algorithm is achieved by progressively removing vertices with minimum degree

while efficiently maintaining vertices in non-increasing order according to their most

recent degrees [8]. The I/O efficient core decomposition has been studied in [101, 20]

for massive graphs that cannot be held within main memory. When graph is dynam-

ically updating, incrementally computing new core number for the affected vertices

19

is known as core maintenance. The core maintenance problem has been studied ex-

tensively in [5, 72, 87, 110, 100].

By removing the maximal constraint of k-core, the concept minimum k degree

constrained subgraph, denoted as δ(k)-subgraph, is adopted. The concept of δ(k)-

subgraph was firstly introduced to community search in [93] and then has been widely

used for modelling communities such as [70, 71, 69, 109]. The major advantages of

δ(k)-subgraph are as follows. Firstly, when modelling communities it is intuitive to

interpret the meaning of δ(k), i.e., every member in a community has at lest k friends.

Secondly, it is easy to compute δ(k)-subgraphs, which can be done in linear time w.r.t.

the size of an input graph.

In addition to community modelling, δ(k)-subgraph (or k-core) has nice properties

for other applications as well. Firstly, given a δ(k)-subgraph, the maximum clique

subgraph which it may contain is at most size k or a size k clique must be contained in

a δ(k)-subgraph if it exists. This property can be used to compute the upper bound

of the clique size given a graph or can be used to prune graphs when finding size k

cliques [109, 75]. On the other hand k-core can be used to approximate the degree

densest subgraph with an approximation factor of 2 [95].

2.2 Community Discovery

Community discovery, having been extensively studied, is an important application

of cohesive subgraph search. Most existing community discovery works can be cate-

gorised into community detection, and community search. Community detection in

general has no explicit searching criteria from user, whereas community search find

communities based on certain given criteria. In the following, we introduce commu-

nity discovery works in spatial attributed and textual attributed networks in prior

and then discuss other community discovery works based on the categorisation.

20

2.2.1 Community Discovery in Spatial Attributed Networks

Community discovery in spatial attributed networks aims to find communities whose

members are intensively connected and spatially close.

Community detection. In [17], Chen et al modify the objective function of fast

modularity maximisation algorithm (known as CNM), and make the detected com-

munities sensitive to spatial distance. More clearly, for each edge (u, v) in a graph

G, they assign them a distance decay, therefore, they generate a weighted matrix.

Applying CNM algorithm on the weighted matrix, they maximise the geo-modularity

(previous maximise modularity on a adjacent matrix). Such spatial community model

only captures the spatial distance information over socially adjacent vertices. In re-

ality, none-adjacent vertices in the detected communities using this model proposed

could be spatially very far away from each other, thus, the spatial feature captured by

the model is limited and the quality of the community is comprised. In [33], they use

similar model as [17]; however, rather than maximise the modularity, they predefine

a score for a size l community and for a detected size l community, they evaluate the

quality of the detected community by the difference of real score and predefined score.

Such ranking somehow mitigates the shortage of the model proposed in [17]. In [44],

a two-step method is devised to cluster vertices, which considers both spatial and

structural features of a large network. Firstly, they cluster vertices with contiguity

constraints, which results a spatial contiguous tree. Next, they further partition the

tree to generate structural dense subgraphs.

Community search. Fang et al [34] propose a community model with three con-

straints: 1) the community contains query vertex, 2) all vertices in the community are

in a spatial enclosed cycle, 3) the community structurally is a connected k-core. Their

algorithms are designed to find a community not only meet the model constraints but

also with minimised diameter of the enclosed cycle. The shortage of the model is that,

the smallest connected k-core containing the query vertex may be inherently spatial

sparse. And it is not applicable to query communities with a set of input vertices.

In this thesis, we consider users’ spatial information in k-truss search, and propose

21

a novel community model, named as co-located community. It can unveil maximum

size communities given certain spatial and social cohesiveness thresholds.

2.2.2 Community Discovery in Textual Attributed Networks

Community discovery in textual attributed networks aims to find communities that

are structural cohesive while textually co-related.

Community detection. Several works [77, 73] consider topic LDA model and graph

structure to detect attributed communities. In [77], the proposed Pairwise-Link-LDA

can be adopted to detect communities in attributed graph replacing directed edges

with un-directed edges. In [73], Liu et al. propose a refined LDA mode, merging

graphical model into Topic-LDA, which can be used to detect attributed community.

Unified distance is also considered for detecting attributed communities. In [111], at-

tributed communities are detected by using proposed structural/attribute clustering

methods, in which structural distance is unified by attribute weighted edges. Cheng et

al. [19] propose a better algorithm for detecting communities using method in [111].

Other attributed community detection methods are as follows. Xu et al. [106] propose

a Bayesian based model. Ruan et al. [85] propose an attributed community detection

method that links edges and content, filters some edges that are loosely connected

from content perspective, and partitions the remaining graphs into attributed com-

munities. Huang et al. [52] propose a method based on an entropy-based model [18].

In [22], attributed communities are detected by finding structural cohesive subgraphs

containing common attribute greater than a threshold and merging the detected co-

hesive subgraphs if necessary. Recently, Wu et al. propose an attributed community

model [102] based on an attributed refined fitness model [62].

Community search. Li et al. [67] propose a keyword-based correlated network

computation over large social media. They firstly find small r-cliques containing

query keyword, and merge small r-cliques if their mutual similarities are greater than

a threshold. In [53], a community model sensitive to textual information is proposed.

Given a set of keywords, a set of query vertices, an integer k that measures the

structural cohesiveness, and an integer d that measures the communication cost (the

22

same definition as [53]), the model is defined as follows. 1)The community contains

all query vertices. 2)The community is a connected k truss with query distance no

great than d. 3)The community is textually most related to the set of keywords.

In [109], they find (k, r)-core community such that socially the vertices in (k, r)-core

is a k-core and from similarity perspective pairwise vertices similarity is more than

a threshold r. Recently Li et al. propose a skyline community model for searching

communities in attributed graph [69]. Their model intends to find communities that

have diversified attributes.

In this thesis, we will propose a contextual community (CC) model. Compared

to existing models CC is designed to be a general framework. Firstly, CC employs

subgraph density as a parameter-free cohesiveness measure to avoid novice user spec-

ifying k in k-core, k-truss etc. Secondly, CC finds multifarious communities. If user

provided query context closes to the attributes contained in a ground truth com-

munity, CC search indeed finds near ground truth communities. Further, if query

covers attributes in multiple ground truth communities, unlike others that bias struc-

ture cohesiveness first and then context match - might result nothing (for somewhat

large k) or community members mostly irrelevant to query context (after lowering k),

CC simultaneously models structure cohesiveness with subgraph density and context

match with scores/weights in computing weighted density so that CC search flexibly

finds community that is most cohesive relative to the query context.

2.2.3 Community Discovery in Networks without Attributes

Community detection. A number of community detection methods have been

reported in [37]. Mancoridis et al propose a community detection method based

on graph partitioning, the objective is to maximise the difference between internal

edge ratio over inter-cluster edge ratio. In [39, 78], Girvan et al use betweenness to

detect community structure. They find edges that are most between communities

and then progressively remove such edges in the remaining graph until no such edge

remains. Rezvani et al [83] propose a fitness metric based objective function and

finding communities maximising the fitness metric.

23

Community search. Li et al. [70] consider k-clique as structural cohesiveness metric

and rank the cliques using outer influence score. In [68], k-core is used to model the

social cohesiveness and internal influential score is used to rank the communities. In

general, the goal of local community search is to search a community that contains

vertices near a set of query vertices. In [51, 3], maximal triangle-connected k-truss

containing a query vertex metric are considered as communities. In [54], they define

a community model on a query vertex set as follows. 1)The community contains all

vertices in the vertices set. 2)The community structurally is a connected k-truss.

3)The longest shortest path from a vertex that not in query vertex set to the query

vertices set is minimised. Cui et al. [28], search local optimal community modelled

as connected subgraphs, containing the query vertices and maximising the minimum

degree of its vertices. k-clique based model is proposed in [108], in which a community

is defined as the maximal k-clique adjacent connected subgraphs, named as k-clique

percolation community. They study the problem of searching the densest clique

percolation community with maximum k containing all query vertices. Adapting

from minimal Steiner tree subgraph, Hu et.al. [50] propose algorithms searching a

community that is a minimal connected Steiner tree containing all query vertices

while maximising the cardinality.

2.3 Keyword Search on Graph Data

Keyword search has been extensively studied in the field of database, especially on

graph data. Existing works can be categorised into: (1) proposing search seman-

tics while finding results for individual queries and (2) efficiently answering multiple

queries in batch.

2.3.1 Keyword Search Semantics

The existing approaches aim at finding either Steiner tree based answers or subgraph

based answers. Steiner tree based answers [9, 47, 57] generates trees that cover all

the search keywords and the weight of a result tree is defined as the total weight of

24

the edges in the tree. Under this semantics, finding the result tree with the smallest

weight is a well-known NP-complete problem. The graph-based methods generate

subgraphs such as r-radius graph [66], r-community [81] and r-cliques [58]. In an r-

radius graph, there exists a central node that can reach all the nodes containing search

keywords whose distance is less than r. In an r-community, there are some centre

nodes. There exists at least one path between each centre node and each content

node such that the distance is less than r. Different from r-radius and r-community,

the r-clique semantics studied in this thesis is more compact and does not required

the existence of a central node. It refers to a set of graph nodes which contain search

keywords, and between any two nodes that contain keywords, we can find a path with

a distance less than r.

Nevertheless, all of the existing works focus on single query processing not multiple

query processing. Although, the proposed indices in [47, 58, 66, 81] can be used

for more than one query, the evaluation techniques are exclusively designed for one

query a time. In other words, none of the existing works have studied utilising

possible reusable computations over multiple keyword queries and leveraging shared

computations to improve the performance for multiple keyword query evaluation at

run time. In this thesis, we study these problems.

2.3.2 Batch Query Processding

On XML data, Yao et al [107] propose a log based query evaluation algorithm to find

the optimal plan to compute multiple keyword queries under SLCA semantics [105].

Recently, multiple keyword query optimization over relational databases (rather than

native graphs) has also been studied [55]. This work assumes all the keyword queries

haven been transformed into candidate networks (which are similar to SQL query

plans), and then multiple SQL query optimization techniques are used thereafter,

i.e., common SQL query operations (or subqueries) in the candidate networks are

considered.

Multiple query optimisation in database. On relational data, multiple SQL

query optimisation have been studied in the early works [84, 91, 92], in which the

25

main focus is to smartly handle shared operations among SQL queries. These works

decompose complex SQL queries into subqueries and consider reusing common sub-

queries based on cost analysis. Recently, Kathuria et al [59] propose an approximation

algorithm within Volcano optimisation framework [43], solving multiple SQL query

optimisation problem with theoretical guarantees.

Different from the above works, our problem focuses on native graph data where

data does not have to be stored in relational tables and query results are modelled

using r-clique semantics. The solution of the previous works cannot be applied to our

problem, because both the data model and the query semantics are different.

26

Chapter 3

Maximum Co-located

Communities Search

The problem of k-truss search has been well defined and investigated to find the

highly correlated user groups in social networks. But there is no previous study to

consider the constraint of users’ spatial information in k-truss search, denoted as co-

located community search in this chapter. The co-located community can serve many

real applications. To search the maximum co-located communities efficiently, we first

develop an efficient exact algorithm with several pruning techniques. After that, we

further develop an approximation algorithm with adjustable accuracy guarantees and

explore more effective pruning rules, which can reduce the computational cost signif-

icantly. To accelerate the real-time efficiency, we also devise a novel quadtree based

index to support the efficient retrieval of users in a region and optimise the search

regions with regards to the given query region. Finally, we verify the performance of

our proposed algorithms and index using five real datasets.

Chapter map. In Section 3.1, we give an overall introduction to the problem of

maximum co-located community search. Section 3.2 presents the proposed co-located

community model and formally defines the maximum co-located community search

problem. Section 3.3 presents the baseline, efficient exact algorithms and effective

techniques for pruning before and during the search. Section 3.4 presents a spa-

tial approximation algorithm that offers a variable spatial error ratio ranging from

27

2√

2 + ε to√

2 + ε′(ε and +ε′ are constant error factors), the proposed effective prun-

ing techniques, and index further speeding up the spatial approximation algorithm.

Experimental results are shown in Section 3.5, followed by the chapter summary in

Section 3.6.

3.1 Introduction

We study the co-located community search problem that reveals the maximum com-

munities with high social and spatial cohesiveness, denoted as (k,d)-MCC s search.

The social cohesiveness is defined using the minimum truss value k [26] and the spa-

tial cohesiveness is parameterised by a user-specified distance value d. As such, our

proposed (k,d)-MCCs search problem can allow users to easily affirm the quality of

the resultant communities, which also fills in the research gap on the type of GCS

with spatial constraint.

Given a social network G and two parameters k and d, a straightforward approach

is to enumerate all possible subgraphs in G meeting minimum truss value k where

the number of subgraphs could be as large as O(2n). It then filters the candidates

having a node pair with their distance above the spatial closeness threshold d. So the

time complexity of this approach is at least O(2n) where n is the number of vertices

in G. Obviously, it is infeasible to use this approach to support online (k,d)-MCCs

search, particularly for the large scale social networks. Thus, this chapter focuses on

devising efficient algorithms to achieve real-time response with theoretical guarantee.

To address the challenge of efficiency, we first develop an exact (k,d)-MCCs search

algorithm by proposing novel pruning techniques. During the search, we explore tech-

niques to prune the search space significantly by considering upper bound based earlier

termination, heuristic search order, and conditions for reusing pruning computation.

Before searching, we also propose pre-pruning techniques for reducing magnitudes

of input data. To design polynomial algorithms, we develop a novel approximation

schema with spatial accuracy guarantees. Notice, our proposed approximation scheme

can provide adjustable spatial error ratios based on user’s requirement on the spatial

28

a

b

c

j

id

he

g

f

n k

l

t

u

o

s

p

r

q

m

1

(a) Graph data

tu

s qr

p

ml

kn

b a

c djie h

f g

o

(b) Spatial DIST

a

b

c

de

f

gh

i

j

k

l

mn o

p

q

r

s

t

u

(c) Spatial network

Figure 3-1: Spatial attributed graph

accuracy. To further improve the performance of the approximation algorithm, we

propose more pruning techniques and also design the novel index TQ-tree.

3.2 Problem Definition

We consider a social network graph G = (V,E), which is an undirected graph

with vertex set V (G) and edge set E(G), where vertices represent social users and

edges denote their friendships. For each vertex v ∈ V (G), it has a spatial attribute

(v.x, v.y), where v.x and v.y denote its spatial positions along x− and y−axis in a

two-dimensional space.

Co-located community. A co-located community is a subgraph J ⊆ G satisfying:

(1) connectivity: J is connected, (2) structural cohesiveness: all vertices in J are

connected intensively, and (3) spatial cohesiveness: all vertices in J are spatially

close with each other.

Structural cohesiveness. We consider truss as the metric to measure the structural

cohesiveness of a co-located community. Truss measures the number of triangles that

each edge is involved in a graph. Given J , let us denote a triangle involving vertices

u, v, w ∈ V (J) as 4uvw. The support of an edge e(u, v) ∈ E(J), denoted by sup(e, J),

is the number of triangles containing e, i.e., sup(e, J) = |{4uvw : w ∈ N(v, J) ∩

N(u, J)}|, where N(v, J) and N(u, J) are the neighbours of v, u in J correspondingly.

Next, we define the truss of a co-located community J as follows:

29

Definition 3.2.1. Subgraph truss. The truss of J ⊆ G, where |V (J)| ≥ 2, is the

minimum support of an edge in J plus 2, i.e., τ(J) = 2 + mine∈E(J){sup(e, J)}.

J is a connected k-truss if it is both connected and τ(J) ≥ k. Intuitively, a k-

truss is subgraph in which each connection (edge) (u, v) has at least k − 2 common

neighbours. A k-truss with a large value k indicates strong internal connections over

members. In a k-truss, each node should have degree at least k−1, implying a k-truss

must be a (k − 1)-core. A connected k-truss is also (k − 1)-edge-connected.

Spatial cohesiveness. Let ed(u, v) denote the spatial distance between vertices u

and v. We first introduce the concept of spatial co-location to measure the spatial

cohesiveness. Then we define co-located community formally.

Definition 3.2.2. Spatial co-location. Given a distance

threshold d, a subgraph J ⊆ G is a spatial co-location graph if for every pair u, v ∈

V (J), ed(u, v) ≤ d holds.

Definition 3.2.3. Co-located community. Given a graph G, a positive integer

k, and a spatial distance d, J is a co-located community, if J satisfies the following

constraints:

• Structural cohesiveness. J is connected, τ(J) ≥ k.

• Spatial cohesiveness. J is a spatial co-location graph w.r.t. a spatial distance

d.

In general, when searching a community, users may want to maximise the mem-

bers contained in the community once they fix the spatial and social cohesiveness

parameters. Therefore, in this chapter, given a graph G, we study finding the max-

imum co-located communities, denoted as (k,d)-MCCs where k stands for k-truss, d

for spatial distance, M for maximum and CC for co-located community. Now we

formally define the problem of (k,d)-MCCs search.

Problem 3.2.1. (k,d)-MCCs search. Given a graph G, positive integer k and

number d, return any of those maximum co-located communities J ⊆ G, satisfying

constraints:

30

• J is a co-located community.

• There is no another co-located community J ′ such that |V (J ′)| > |V (J)|.

For example, in Figure 3-1(b), vertices in dark blue coloured areas are co-located.

Similarly, in Figure 3-1(a), three possible co-located communities are in blue coloured

areas with k = 4. The (4,d)-MCC here is the subgraph containing vertices {d, e, f, g, h, i}

with cardinality 6, as it is the maximum.

We may find (k,d)-MCCs from G by inspecting the whole graph. However, to

improve the search performance, we only want to search the parts of G that may

contain (k,d)-MCCs. To achieve that, we introduce the theorem below:

Theorem 3.2.1. (k,d)-MCCs of a graph G can be found from one of the maximal

connected k-trusses of G if they exist.

The proof is trivial since vertices that are not part of a maximal connected k-truss

clearly cannot meet the structural cohesiveness requirement in Definition 3.2.3.

By Theorem 3.2.1, the intuitive steps to find (k,d)-MCCs in G include: (1)

compute maximal connected k-trusses (note: these k-trusses are non-overlapped),

(2) search the local (k,d)-MCCs in each of these k-trusses, and (3) find the global

(k,d)-MCCs from the locals by comparing the cardinalities.

Analysis. A k-truss index can be built within O(|E(G)| 32 ) for a graph G. The

k-truss index for G is essentially a list of edges associated with their edge trusses

defined by τ(e,G) = maxH⊆G∧e∈E(H) {τ(H)} [51]. With the k-truss index, given a k,

we can retrieve all maximal connected k-trusses in G in polynomial time. However,

it is still challenging to find local (k,d)-MCCs within a maximal connected k-truss

due to: (1) the total number of spatial co-location subgraphs in the k-truss could be

exponential [45] and (2) there is no guaranteed monotonic relationship between the

size of a co-location subgraph and the size of its co-located communities.

From now on, we focus on finding (k,d)-MCCs in a maximal connected k-truss T ⊆

G. The notations used frequently in following sections are summarised in Table 3.1.

31

Table 3.1: Notations

Notation DefinitionT initially a maximal connected k-truss graphT ′ spatial neighbourhood network for TT ′0 a connected component of T ′, T ′0 ⊆ T ′

u, v, w individual verticesed(u, v) spatial distance between u, v

gd(u, v, T ) distance between u, v in Tdeg(u,G) degree of u in GN(u,G) neighbours of u in G

τ(G) the minimum truss of GA,A a maximal clique, a set of maximal cliques

R,P,X vertices setsT (R), T ′(R) subgraphs of T and T ′ induced by vertices in R

c a square spatial space cell with width wm, M a landmark cell and a set of landmark cells

r a square spatial region consisting of cellsζ an integer, denoting a number of cellsVr a set of vertices located in a region rp an error-bounded search bound region

K(r) the k-truss in a region r

3.3 Finding Exact Results

We first introduce a definition as follows:

Definition 3.3.1. Spatial neighbourhood network. Given a T and a distance

d, a spatial neighbourhood network for T is a graph T ′, which is an undirected graph

with V (T ′)=V (T ) and E(T ′)={(u, v)|ed(u, v) < d ∧ u, v ∈ V (T )}.

Finding a (k,d)-MCC is equivalent to finding an unextendable vertex set R such

that the R-induced subgraph T ′(R) of T ′ is a clique while the R-induced subgraph

T (R) of T contains a connected k-truss GR = (R,ER) where ER ⊆ E(T (R)).

Next, we show the baseline algorithm to find (k,d)-MCCs.

3.3.1 Baseline Algorithm

Given T and T ′, the baseline algorithm is to find all the maximal cliques contained

by T ′, and check the sorted maximal cliques one by one. For each maximal clique

32

A (the set of vertices in a maximal clique), we need to compute local (k,d)-MCCs in

T (A). After all maximal cliques have been checked, we compare the cardinalities of

the local (k,d)-MCCs and get the global (k,d)-MCCs.

Algorithm 1: baseline(T ,T ′, b = 0)

1 R ←mccSearch(T, T ′, b);2 Return R;3 Procedure mccSearch(T, T ′, b)4 A ← bkp(∅, V (T ′), ∅);5 sort A in descending order by clique cardinality;6 for each A ∈ A do7 if |A| > b then8 R′ ← maximum k-trusses in T (A) ;9 b← kdmccCollect(R,R′, b);

10 Procedure kdmccCollect(R,R′, b)/* R′[0]: first element of R′ */

11 if |V (R′[0])| == b then12 collect R′ into R;

13 if |V (R′[0])| > b then14 b← |V (R′[0])|;15 replace R by R′;

Baseline algorithm. The baseline algorithm is presented in Algorithm 1. It ensures

the correctness by giving every maximal clique A ∈ A a chance. To improve the search

efficiency, Algorithm 1 uses a heuristic rule and a bound to prune small maximal

spatial cliques. The heuristic rule assumes the larger the size of a spatial clique is, the

larger the size of the contained (k,d)-MCCs may be. The heuristic rule is implemented

by sorting the generated maximal cliques (line 5). The bound b is initialised as 0 and

is continuously updated as the maximum size of the (k,d)-MCCs found so far. A

maximal clique is pruned if its size is less than b.

Collect candidate results. In Algorithm 1, Procedure kdmccCollect is used to

collect candidate results. It checks the maximality of the currently found (k,d)-MCCs

in R′ and determines if they should be added into previously found results in R, or

replace R, or be discarded (lines 12 to 16 in Algorithm 1). During the process, the

upper bound will be updated if necessary.

33

Avoid duplication. Since (k,d)-MCCs are contained by spatial-clique-induced sub-

graphs of T , it is possible that multiple spatial cliques contain the same (k,d)-MCC.

To avoid duplication, we assign a unique key to each (k,d)-MCC based on the vertices

it contains. Before a new (k,d)-MCC is collected into the result R, duplication will

be checked by verifying if its key has already existed.

Example. We show an example using Algorithm 1 to find (4,d)-MCCs. The input

social graph is the 4-truss in Figure 3-1(a) and its spatial network is in Figure 3-1(c).

Firstly, maximal cliques can be obtained and sorted by size (see Table 3.2). The upper

bound history and the corresponding (k,d)-MCCs after each iteration are displayed in

Table 3.3. The iteration stops when the upper bound b = 6 is larger than the sizes of

the remaining cliques.

Table 3.2: Maximal cliques contained in Figure 3-1(c)

Cad. Cliques Cad. Cliques8 {a, b, c, d, e, h, i, j} 6 {d, e, f, g, h, i}4 {r, s, p, q}, {m,n, k, l} 2 {p, o}, {o, l},{t, u}

Table 3.3: Enumeration trace

Iter. clique bound R0 NULL b = 0 ∅1 {a, b, c, d, e, h, i, j} b = 4 {{a, c, b, j}} {{d, e, h, i}}2 {d, e, f, g, h, i} b = 6 {{d, e, h, g, f, i}}3 {r, s, p, q} b = 6 {{d, e, h, g, f, i}}

Time complexity. The dominating part of Algorithm 1 is to list all maximal cliques

that would be O(3|V (T )|

3 ), using the state-of-the-art algorithm bkp in [96]. Another

part is to find k-trusses with maximum cardinality in T (A) where A is the vertex set

contained by a maximal clique of T ′. To compute the maximum k-trusses in T (A), we

use the method in [99] with the time complexity bounded by O(|E(T )| 32 ). Therefore,

the complexity of Algorithm 1 is O(3|V (T )|

3 +∑

A∈A |E(T (A))| 32 ), where A is the set

of maximal spatial cliques contained by T ′.

34

3.3.2 Efficient (k,d)-MCC Search

The baseline method finds (k,d)-MCCs in two steps. Firstly, it generates all vertex

sets meeting the requirement of spatial cohesiveness, i.e. spatial cliques. Secondly,

it verifies social cohesiveness for each generated spatial clique and finds (k,d)-MCCs

from each clique and select the maximums.

However, a valid observation is that: if we check social cohesiveness right after

a clique is generated, i.e., find the candidate (k,d)-MCC(s) from the found maximal

clique before enumerating all the rest cliques, we can use the size of the largest

candidate (k,d)-MCCs as a bound to stop generating unpromising cliques, i.e., cliques

which are not possible to contain larger (k,d)-MCCs. Moreover, as the size of the

candidate (k,d)-MCC(s) becomes larger, the pruning also becomes more effective.

As a result, in this section, we develop an efficient (k,d)-MCC search algorithm. It

is different from the baseline in two folds. Firstly, after a spatial clique is generated,

we search for the (k,d)-MCCs in the clique-induced social graph immediately, and

the bound will be updated as the largest size of (k,d)-MCCs found so far. Secondly,

before generating a clique, we check whether the current clique search branch is able

to generate candidate (k,d)-MCCs with sizes greater than the current bound, if not,

we terminate the clique search branch.

The (k,d)-MCC search algorithm is shown in Algorithm 2. It is based on the maxi-

mal clique enumeration algorithm [96] which will be briefly reviewed in Section 3.3.2.1

with four non-trivial modifications: (1) finding candidate maximum (k,d)-MCCs im-

mediately after generating a maximal clique (line 7); (2) terminating a search branch if

no larger (k,d)-MCCs exist based on four pruning conditions (line 5), Section 3.3.2.2;

(3) a heuristic rule to find larger (k,d)-MCCs at early stages by carefully selecting

promising vertices to expand the candidates (line 10), Section 3.3.2.3; (4) reducing

the cost of computing pruning conditions by possibly reusing previous results (related

to line 5), Section 3.3.2.4.

35

Algorithm 2: effiMCCSearch(T, T ′)

1 b← 0, R ← ∅;2 mccbkp(∅, V (T ′), ∅);3 return R;4 Procedure mccbkp(R,P,X)

5 terminate this branch based on termination conditions;6 if P ∪X == ∅ then7 R′ ← find maximum connected k-truss in T (R);8 b←kdmccCollect(R,R′, b);9 u← select a pivot from P ;

10 for each v ∈ P \N(u, T ′) do11 mccbkp(R ∪ {v}, P ∩N(v, T ′), X ∩N(v, T ′));12 P ← P \ {v};13 X ← X ∪ {v};

3.3.2.1 Revisit of Maximal Clique Enumeration

Maximal clique enumeration. bkp [96] works on three vertex sets R, P and X and

finds all the maximal cliques in T ′. In each recursion state, R records the clique found

so far, P contains the vertices to be added into R and X contains the vertices that

were previously added into R and now have been explicitly excluded. P and X are

disjoint and they together contain all the vertices that are adjacent to all the vertices

in R. Initially, R and X are empty and P is V (T ′). From P , bkp picks a v ∈ P , adds

v to R and removes v’s non-neighbours from P and X, i.e., P ← P ∩ N(v, T ′) and

X ← X ∩N(v, T ′). Then bkp recursively calls itself and performs the same operation

on the newly generated R, P and X until the set P becomes empty. It then reports

a maximal clique if the current X is empty. The reason is that if X 6= ∅, it implicitly

means R is not maximal because vertices in X can be added into R to form a larger

clique. After finishing the recursive search branch of adding v into R, bkp restores R,

removes v from P , adds v into X, and then expands R with the next vertex in P .

Pruning search branches with pivots. Given a search state R, P and X, let

u ∈ P 1, the intuition is that, cliques generated by expanding R with a vertex in

P ∩N(u, T ′) can always be further expanded by adding u subsequently. Therefore, it

1Note that u can be chosen from P ∪X

36

is safe to expand R with P \N(u, T ′) only. To pursue the maximum pruning power,

a vertex u maximising |P ∩N(u, T ′)| shall be chosen, called a pivot.

Clearly, once a maximal clique is generated, we can search for (k,d)-MCCs imme-

diately, line 7 Algorithm 2. The largest size of the candidate (k,d)-MCCs found so far

will be used as a bound. In the following, we focus on how prunings, order heuristics,

computation reuse are implemented, respectively.

3.3.2.2 Terminating Unpromising Branches Earlier

The idea is that we estimate the upper bound of the (k,d)-MCCs in the current search

branch. If the upper bound is smaller than the found bound b, we terminate the

search branch. There are four upper bounds. (1) If |R ∪ P | < b, we can terminate

the branch. This means, if the largest possible clique is already smaller than b, the

possible (k,d)-MCCs contained are thus smaller than b. (2) Let K(R ∪ P ) be the

maximum connected k-truss in the induced graph T (R ∪ P ), then |V (K(R ∪ P ))|

is the upper bound. This is without considering spatial constraints in T ′. (3) The

largest possible truss number within the induced subgraph T ′(R ∪ P ) is the upper

bound of the maximum clique in T ′. This is without considering social constraints in

T . (4) Considering both (2) and (3), we can have a tighter bound, defined based on

(k, k′)-truss below:

Definition 3.3.2. (k, k′)-truss. Given T , T ′ and a vertex set S such that S ⊆

V (T ) ∧ S ⊆ V (T ′), if T (S) is a connected k-truss in T and T ′(S) is a connected

k′-truss in T ′, we say (T (S), T ′(S)) is a (k, k′)-truss. For ease of discussion, we also

call S a (k, k′)-truss.

Let k′max be the largest possible truss number such that a (k, k′max)-truss is con-

tained in T (R∪P ) and T ′(R∪P ), k′max is a tight upper bound of the size of (k,d)-MCCs

in the current recursion branch.

The above bounds are applied one after another following the discussed order. This

is because their computation cost increases accordingly and we want to terminate an

unpromising branch as early as possible. If a pruning with a loose bound is enough,

37

we can avoid computing a tighter bound expensively.

3.3.2.3 Search Order

Given that having a larger (k,d)-MCC size will help the algorithm terminate earlier,

we design a heuristic rule aiming to obtain large (k,d)-MCCs first. The rule is as

follows: given a search state with R and P , when we need to select which vertex in

P \N(u, T ′) should be added into R first (line 10, Algorithm 2), we choose from the

vertices in the (k, k′max)-truss contained in T (R ∪ P ) and T ′(R ∪ P ) in prior because

adding such vertices is likely to generate larger (k,d)-MCCs. Among the vertices in

the (k, k′max)-truss, we consider adding the vertex v with the largest deg(v, T ′), where

ties are broken arbitrarily.

3.3.2.4 Computation Reuse for Pruning

Finding the upper bounds in cases (2)(3)(4) in Section 3.3.2.2 may not be cheap, even

though truss decomposition is in polynomial time [51]. However, a nice observation

is that a search state (R,P ,X) and its child state (Rc,Pc,Xc) are likely to have similar

truss results. Suppose Rc = R ∪ {v}, Pc = P ∩ N(v, T ′), it is easy to see Rc ∪

Pc ⊆ R ∪ P . As a result, the maximum k-trusses in T (Rc ∪ Pc) are subsets of the

maximum k-trusses in T (R ∪ P ), hence the computation can be done incrementally

using truss maintenance techniques [51] by passing the existing T (R∪P ) and T ′(R∪P )

and truss indices to its child recursions. Similarly, (k, k′max)-truss can be computed

incrementally as well.

On the other hand, there are some special cases where we can cheaply determine

that the child state cannot be pruned: (1) if |R ∪ P | = |Rc ∪ Pc|, the child state’s

upper bounds are the same as the parent’s; (2) let K be the maximum k-truss in

T (R ∪ P ), if V (K) ⊆ Rc ∪ Pc, the child state cannot be pruned; (3) let S be the

(k, k′max)-truss in T (R ∪ P ) and T ′(R ∪ P ), if S ⊆ Rc ∪ Pc, the child state cannot be

pruned. Proofs are omitted as correctnesses are obvious.

38

3.3.2.5 Example and Discussion

Example. We show an example using Algorithm 2 to search for (k,d)-MCCs. Given

the T and T ′ in Figures 3-1 (a) and (c). Initially, R = ∅ and P = {1, a, . . . , u}

Algorithm 2 tries to terminate the recursions by computing all four upper bound sizes,

firstly producing a (4,6)-truss with vertices {d, e, f, g, h, i}. Then pivot h is selected,

making Algorithm 2 only need to expand R from P \N(h, T ′) = {h, r, s, q, p, o,m, n, k,

l, t, u} rather than P . Next, based on the order heuristic rule, h is added into R,

reducing P to {a, b, c, d, e, f, g, i, j}. Such recursions continue until the first (k,d)-

MCC, {d, e, h, g, f, i} is discovered while bound computation can be reused from when

d is added into R = {h}. After the first result is produced, b is updated to 6. Using

this bound, when Algorithm 2 backtracks to the recursion state, in which r is added

into R with R = {r}, P = {s, p, q} and X = ∅. The first upper bound pruning

condition is applied to terminate this branch because |R ∪ P | < 6. Other search

branches that will find cliques in Table 3.2 will also be pruned with the proposed

termination conditions.

Discussion. In [109], Zhang et al. proposed (k, r)-core which uses k-core instead of

k-truss to represent social cohesiveness. For the purpose of comparison, we adapt the

AdvMax algorithm proposed in [109] for finding (k,d)-MCCs and denote it as KRM.

KRM may have smaller search space, because social constraint on T is also checked

during the clique enumeration. However, after incorporating social constraint check

along the way, the powerful pivot-based pruning for clique enumeration cannot be used

because the classic pivot pruning works only for the structure part (That might be

why the work [109] used binary search rather than bkp). On the other hand, we have

studied adapting the pivot idea considering both structural and social constraints.

Unfortunately to determine such pivots is very complicated and the pruning power

of such pivots cannot be guaranteed. Experimental performance comparison between

our algorithms and KRM can be found in Section 3.5.

39

3.3.3 Prunings before (k,d)-MCCs Enumeration

In practice, a maximal connected k-truss T and its corresponding spatial neighbour-

hood network T ′ can be pruned before (k,d)-MCC enumeration. The aim is to reduce

the size of the input as much as possible so that (k,d)-MCC enumeration can be more

efficient.

Pruning vertices in T ′ (I). We introduce a k-truss property first, followed by the

explanation and the pruning rule.

Property 3.3.1. For every vertex v in a k-truss graph T , v has deg(v, T ) ≥ k − 1

[26].

Intuitively, if T (V (T ′)) contains a k-truss, then each v ∈ V (T ′) should have as

least k− 1 neighbours in T . Accordingly, v should also have at least k− 1 neighbours

in T ′.

Pruning Rule 3.3.1. For each v ∈ V (T ′), if deg(v, T ′) < k − 1, v can be pruned

from T ′.

Pruning edges in T ′ (II). Next, we show another k-truss property which can be

used to prune edges in spatial network T ′. The idea is that, if two vertices are far

from each other in T making them not able to be in the same k-truss, even though

they are spatially close in T ′, their link in T ′ can be discarded when enumerating

(k,d)-MCCs.

Property 3.3.2. The structural diameter of a connected k-truss T with |V (T )|

vertices is no more than b2|V (T )|−2kc[26].

Pruning Rule 3.3.2. Given T and T ′, let T ′0 be a connected component of T ′.

An edge e(u, v) ∈ E(T ′0) can be pruned if gd(u, v, T (V (T ′0))) > b2|V (T ′

0)|−2k

c, where

gd(u, v, T (V (T ′0))) denotes the distance between u and v in T (V (T ′0)).

Pruning Rule 3.3.2 is correct: suppose vertices u, v can co-exist in a connected k-

truss K ⊆ T (V (T ′0)), then we will have gd(u, v, T (V (T ′0))) ≤ gd(u, v,K) ≤ b2|K|−2kc ≤

b2|V (T ′0)|−2k

c, this contradicts with the pruning condition.

40

Pruning vertices in T (III). Let D be a set of vertices pruned from T ′, obviously

they should also be removed from T . After removing D from T , another set of vertices

D′ in T may be further removed due to truss maintenance [51]. D′ will need to be

pruned from T ′.

Cascading pruning effect. We summarise the cascading pruning effect here:

(1) Pruning I will cause Pruning II and III, because after pruning vertices in T ′,

gd(u, v, T (V (T ′0))) becomes larger and b2|V (T ′0)|−2k

c becomes smaller, so more edges

may be further pruned from T ′. Also, the pruned vertices of T ′ should be removed

from T ; (2) Pruning II will cause Pruning I, because, after some edges are pruned,

some vertex degrees will decrease which may lead to new vertices be pruned from T ′;

(3) Pruning III will cause Pruning I, which has been explained.

In implementation, vertex pruning I, III are prioritised as they are cheap. Shortest

path index [104] is maintained to help with pruning II. Pruning will stop if no more

changes are caused.

3.4 Finding Spatial Approximate Result

Since all the proposed exact algorithms in Section 3 can have exponential time in the

worst case, we aim to design a polynomial algorithm by relaxing spatial constraints.

The polynomial algorithm can approximately find co-located communities (which are

still k-trusses but have vertices within longer distances). The spatial distances can

be theoretically bounded. We firstly discuss how (k,d)-MCC should be approximated.

Then we propose three types of search bound regions: upper bound region, tight

bound region and error-bounded region. Using the bound regions, we design an

algorithm that can find approximate results meeting a user-specified spatial error

ratio requirement.

3.4.1 How to Approximate (k,d)-MCCs

It is desirable to have efficient algorithms to find approximate results. Accordingly,

several questions are interesting: How approximation should be defined? What are

41

good approximation results? Can users specify their own approximation preference,

i.e. to what extent the discovered results are approximate? We will answer these

questions in this section.

Define approximation. Firstly, a (k,d)-MCC is considered cohesive both struc-

turally and spatially. In general, both structural and spatial constraints can be re-

laxed, however, since the exponential number of exact (k,d)-MCCs comes from check-

ing spatial constraints, we only study the approximate results with spatial constraints

relaxed. Let us define an α-approximation of a (k,d)-MCC below:

Definition 3.4.1. Approximate (k,d)-MCC. Let J be a (k,d)-MCC, J ′ be a (k,d’)-

CC satisfying J ⊆ J ′ and d ≤ d′, we consider J ′ as an α-approximation of J with

spatial error ratio α = d′

d, where α ∈ [1,+∞).

Here, J ′ is a k-truss with the maximum distance between vertices in V (J ′) no

more than d′. Technically, α can be less than 1, but this is not desired.

Reasonable approximation. From the definition, α-approximation of a (k,d)-MCC

is not unique: the maximal α-approximation of a (k,d)-MCC is a (k,αd)-MCC, while

the minimal α-approximation of a (k,d)-MCC is the (k,d)-MCC itself. Both the maxi-

mal and minimal α-approximations lead to the exponential number of co-located com-

munities. As a result, a polynomial algorithm that can discover any α-approximations

should suffice. However, superiority does exist among approximations, eg., let J ′1, J′2

be two α-approximations of a particular (k,d)-MCC, if J ′1 ⊆ J ′2, J′1 is considered better

than J ′2, because J ′1 is “cleaner”.

Specify error ratio. Ideally, users should be able to specify their preferred spatial

error ratios, because different users may have different requirements. With a given

spatial error ratio α, the approximate algorithm finds α-approximation results ac-

cordingly. In the next section, we introduce using spatial index to guarantee error

ratio.

42

ba

j d

h

g

f

ie

cm2

m1

(a) Upper bound

ba

j d

h

g

f

ie

cm2

12

3

(b) Tight bound

ba

j d

h

g

f

ie

c

(c) Vertex residence

ba

j d

h

g

f

ie

c

x

m

0

1

m

0

2

(d) Truss residence

Figure 3-2: Rectangular regions

3.4.2 Spatial Index and Search Bounds

The idea of searching in polynomial is to delegate a spatial index to check spatial

constraints. The outcome is, with the index, we are able to cheaply locate a region or

a (limited) number of regions where we can focus on checking structural constraints

only when searching for k-trusses, because the user-specified spatial error ratio can

be guaranteed on the k-truss results discovered within the located regions. In the

following, we will introduce the spatial index first, and then elaborate on how two

typical bound regions are identified and how error-bounded search regions can be

identified.

Space division. We consider the space is divided into equal-sized cells. Each cell is

a w×w square. w is fixed once the space is divided. Vertices are distributed into the

43

cells. If a cell is not empty, we call it a landmark cell. Rectangle region and square

region are defined below, used later.

Definition 3.4.2. Rectangle region. A rectangle region is a subspace of the entire

space, with a rectangle shape containing only complete cells. Square region is defined

similarly.

Now the problem is, for each landmark cell m, to identify proper square bound

regions from which k-trusses should be discovered. Two types of bound regions are

interesting: (a) the upper bound region is a big bound region that can cover all the

exact (k,d)-MCCs; (b) the tight bound regions are a set of regions, each of which

covers some exact (k,d)-MCCs and they together cover all the exact (k,d)-MCCs. The

tight bound regions can provide the best possible error ratio among all the bound

regions covering the exact (k,d)-MCCs. In the following, we introduce them in detail.

Upper bound region. Given a landmark cell m and a distance d, the upper bound

region rm identified by m is an area covering all possible vertices whose distances to

every vertex in m are no greater than d. Apparently, the theoretical upper bound

region is irregular. However, for easy computation, we define square upper bound

region as the minimal square region that covers the upper bound region, formally as:

let ζ be an integer such that (ζ − 1)w < d ≤ ζw, the square region centred at m

with side size (2ζ+1)w is the square upper bound region. In later discussions,

upper bound region is used short for square upper bound region, denoted as rm.

In Figure 3-2 (a), we show two landmark cells m1 and m2 and their upper bound

regions in red and blue. Next, we show the spatial error ratio of the upper bound

region.

Lemma 3.4.1. The spatial error ratio of the upper bound region is 2√

2 + ε, where

ε = 3√2·wd

.

Proof sketch. The upper bound region rm is a square with side size (2ζ + 1)w. The

longest distance drm within rm is bounded by the diagonal distance√

2(2ζ + 1)w.

Combining drm ≤ 2√

2(ζ + 1)w with (ζ − 1)w < d. We have drmd≤ 2√

2 + ε, where

ε = 3√2·wd

.

44

The 2√

2+ε error ratio may be loose in most applications, because spatial closeness

has been relaxed to nearly 3d (2√

2 + ε ≈ 3). In the following, we introduce tight

bound regions which can bound approximate results within√

2 + ε′ error ratio.

Tight bound region. The upper bound region has side size (2ζ+1)w > 2d. On the

other hand, we observed that an exact (k,d)-MCC must be able to fit into a d-square

(a square with side size d). This motivates us to search for the approximate results

from square regions as small as possible while still not losing any exact (k,d)-MCCs.

To this end, we define tight bound region formally as follows: let ζ be an integer such

that (ζ−1)w < d ≤ ζw, the square region containing m with side size (ζ+1)w

is a square tight bound region. Again, we use tight bound region short for square

tight bound region. Note that, a landmark cell m can identify ζ2 number of tight

bound regions because there are ζ2 (ζ + 1)w-sized squares within a (2ζ + 1)w-sized

square.

Lemma 3.4.2. The spatial error ratio of a tight bound region is√

2 + ε′, where

ε′ = 2√2wd

.

The proof is similar to that of Lemma 3.4.1.

In Figure 3-2 (b), we show three possible tight bound regions (dashed squares)

given the landmark cell m2, supposing ζ has been identified as 2. Next, we give the

spatial error ratio of a tight bound region below.

The previously discussed bound regions are typical cases providing dedicated spa-

tial error ratios. Next, we discuss how to identify bound regions satisfying a user-

specified error ratio.

Error-bounded region. Given a landmark cell m, a distance d, let α be a user-given

error ratio, then αd is the maximum spatial distance allowed. Let ζ ′ be an integer,

an error-bounded region is a square region containing the landmark cell

m with side size (ζ ′ + 1)w satisfying argmaxζ′{√

2(ζ ′ + 1)w ≤ αd|ζ ′ ∈ Z}. With

w, d, α given, ζ ′ can be determined as b αd√2wc − 1. After that, all the square regions

containing m with side size (ζ ′ + 1)w, from the square region centred at m with side

size (2ζ ′ + 1)w, are retrieved as α-error-bounded regions.

45

3.4.3 Prunings

Inspecting error-bounded regions for all landmark cells is costly. If we can somehow

know that the k-trusses covered by one error-bounded region r1 is a subset of those

covered by another error-bounded region r2, the error-bounded region r1 can be dis-

carded. Next, we introduce how prunings can be supported by checking containment

between vertex residence regions and truss residence regions.

Definition 3.4.3. Vertex residence region. Given a square region r, the vertex

residence region Rev(r) is the minimum rectangle subspace of r containing all the

vertices in r.

For example, in Figure 3-2 (c) shows the vertex residence regions of the two upper

bound regions from Figure 3-2 (a) (assuming all the vertices are in the same k-truss).

Pruning Rule 3.4.1. Pruning vertex residence region. Let r1, r2 be two

bound regions, Rev(r1), Rev(r2) be two vertex residence regions, r1 can be pruned

if Rev(r1) ⊆ Rev(r2).

For example, in Figure 3-2 (d), since the vertex residence region within the red

rectangle contains the vertex residence region within the blue, the inner blue rectangle

can be pruned.

Defining vertex residence region as rectangle region is for easy computation. An

alternative way is to consider only those non-empty cells as vertex residence region,

however, checking containment between irregular regions is more expensive though

this approach provides better pruning power.

Similarly, a more powerful pruning condition is to consider only those k-truss

vertices.

Definition 3.4.4. Truss residence region. Given a region r, let K(r)2 be the

k-truss contained in r, the truss residence region Ret(r) is the minimum rectangle

subspace of r that contains all the vertices V (K(r)).

2K(r) may be a disconnected graph. It may have a set of connected components, each of whichis a connected k-truss.

46

Pruning Rule 3.4.2. Pruning truss residence region. Let r1, r2 be two bound

regions, Ret(r1), Ret(r2) be two truss residence regions, if Ret(r1) ⊆ Ret(r2), r1 can

be pruned.

3.4.4 Error-bounded Approximation Algorithm

In this section, we introduce the entire error-bounded approximation algorithm. Sim-

ilar to the exact algorithms, we start with a maximal connected k-truss T from the

structural point of view. Recall that T does not fully satisfy the spatial constraint,

i.e. every two vertices are within d. The key idea of the approximate (k,d)-MCC

search is as follows: Firstly, we retrieve the landmark cells (denoted as M) which

together hold all the vertices of T , V (T ) ⊆ V (M). Secondly, for each landmark cell

m ∈M , the algorithm computes the α-error-bounded regions of m. Thirdly, for each

α-error-bounded region, local maximum structural k-trusses are identified. Last, the

final maximums are selected among the local maximums and will be returned as the

approximate (k,d)-MCCs with the guaranteed spatial error ratio α.

Algorithm 3: apxSearch(T,M, α)

1 b← 0, R ← ∅;2 prune each landmark cell m ∈M by applying Pruning Rules 3.4.1 and 3.4.2

on the upper bound region rm of m ;3 sort all m ∈M according to the size of K(rm) in descending order;4 for m ∈ M do5 if |V (K(rm))| ≥ b then6 P ← generate error-bounded regions with ζ ′ = b αd√

2wc − 1;

7 pruning P based on Pruning Rule 3.4.1;8 sort all p ∈ P according to |Vp| in descending order;9 for p ∈ P do

10 if |Vp| ≥ b then11 R′ ← maximum connected k-trusses in T (Vp);12 b← kdmccCollect(R′,R, b);

13 return R;

Algorithm 3 shows how to search approximate (k,d)-MCCs given a spatial error

ratio α. It firstly prunes landmark cells using Pruning Rules 3.4.1 and 3.4.2 .

47

q1

q1:1 q1:2

q1:1:1 q1:1:2 q1:1:3 q1:2:3

fsg fpg fl,og fa,b,c,jgfq,rg fm,n,kg

q1:2:3:2q1:1:1:2 q1:1:2:1 q1:1:2:3 q1:1:3:2 q1:1:3:4

: : : : : :

q1:3

q1:2:4 q1:3:4

fg,hg ft,ug

: : :: : :

fdg

q1:2:4:1 q1:2:4:3 q1:3:4:3q1:2:3:4

fe,f,ig

Figure 3-3: TQ-tree

Table 3.4: Truss ids and union of truss-to-vertex descriptions

Truss ID k Verticest1 2 a . . . u and 1t1.1 4 a . . . ut1.1.1 5 d, e, h, i, f

Vertex Truss IDa t1, t1.1b t1, t1.1... ...

Then for each m, it generates error-bounded regions and prunes them according to

Pruning Rule 3.4.1 (lines 6 and 7). For each search region, it then computes the

local approximate results in T (Vp). Rather than computing k-trusses from sketch, we

compute them incrementally using K(rm). In this way, duplicated computation can

be avoided. In order to terminate the algorithm early, the size of the best approximate

results found so far is used as a bound (lines 5 and 10), and search bound regions are

sorted (lines 3 and 8).

Time complexity. The time complexity of Algorithm 3 is O(|V (T )| · (b αd√2wc −

1)2 · |E(T )| 32 ). That is because, the running time of Algorithm 3 is dominated by

the nested loop. The outside loop is bounded by |V (T )| since there are at most

|V (T )| number of landmark cells. The bound of inner loop is (b αd√2wc − 1)2, where

(b αd√2wc − 1)2 corresponds to the number of error-bounded regions. The computation

inside the nested loop is dominated by truss maintenance bounded by |E(T )| 32 .

3.4.5 Truss Attributed Quadtree Index

In this section, we show how to index the divided cells associated with useful truss

pre-computations using the proposed truss attributed quad tree, denoted by TQ-tree.

48

Table 3.5: Description files for a branch

q1 q1.2 q1.2.3 q1.2.3.4t1.1:q1.1,q1.2,q1.3;t1.1.1:q1.2

t1.1: q1.2.3,q1.2.4;t1.1.1:q1.2.3,q1.2.4

t1.1: q1.2.3.2,q1.2.3.4;q1.1.1:q1.2.3.2,q1.2.3.4

t1.1:e, f, i;t1.1.1:e, f, i

It can speedup the operation that retrieves vertices of a k-truss a region contains.

Besides, we can have multiple choices of cell size helping locate the error-bounded

regions efficiently.

TQ-tree components. We firstly introduce several components of the index: truss

list, vertex to truss description, truss to quad description and truss vertex to leaf

quad description below:

A truss list contains all connected trusses in a graph G. For each truss t, an

identifier is assigned, denoted as t.id. Since a truss with a small k may contain

trusses with larger k’s, the id we assign to a truss is similar to Dewey Decimal, which

explicitly expresses the containment relationships. For instance, in Figure 3-1 (a), we

have truss ids for different k’s in Table 3.4.

Given a vertex v, vertex-to-truss description returns all identifiers of the connected

trusses containing v. Using Hash table, given a vertex v and a k-truss t, we can check

whether t contains v in constant time.

Given a truss identifier tid, and a non-leaf quad q of TQ-tree, truss-to-quad de-

scription for q returns direct children of q that contains at least one vertex of tid

identified truss.

Given a truss identifier tid, and a leaf quad ql of TQ-tree, truss-vertex-to-leaf-quad

description returns all vertices of tid identified truss located in ql.

TQ-tree. Firstly, a TQ-tree is a quad tree indexing all divided cells, in which the

divided cells are leaf quads of TQ-tree. For each none-leaf quad in TQ-tree, in addition

to its spatial quaternary information, it is attached a list of truss-to-quad descriptions.

Similarly, for each leaf quad, in addition to the bounding quad information, it also

49

contains a list of truss-vertex-to-leaf-quad descriptions.

For example, the TQ-tree for the dataset displayed in Figure 3-1 is displayed in

Figure 3-3. We only show the partial quadtree for 4-truss in Figure 3-3. The ids

for all trusses are shown in the right side of Table 3.4. The union of vertex-to-truss

description for vertexes in t1.1 are shown in the left side of Table 3.4. The description

files for the branch q1 to q1.2.3.4 are displayed in Table 3.5.

Retrieving vertices from a region. In this section, we show that given a T and

a query region, the running time of getting vertices of T from a region r can be

bounded by O(|Vr|) by using TQ-tree, where |Vr| is the number of vertices of T in r.

The idea is that: we first use the boundaries provided by r, and the description files

attached in TQ-tree to explore limited branches of TQ-tree and then get the content

quads via depth-first traversing. More specifically, we start from the root of TQ-tree,

and collect all quads such that (1) they contain vertices in T , (2) they are in the r

covering spatial space. Given a quad q in TQ-tree, since the operations that check

whether a child of q is in the boundaries of r and whether the child contains T can

be done in constant time, getting a vertex of T in r only depends on the height of

the quad, which is determined once TQ-tree is created. Therefore, it is clear that the

running time of retrieving content quads is proportional to |Vr|.

Searching granularity. Intuitively, different layers of TQ-tree divide the whole

space into cells with different sizes. When the height of TQ-tree increases, the space

is divided into cells with smaller sizes. Therefore, according to the query distance,

we may vary the search granularity, that is, we can select the minimum cells that

construct the bound regions discussed in Section 3.4.2. Given a TQ-tree with height

h, the possible number of search granularities is equal to h. The possible sizes of the

search granularity is {2x × q.w|0 ≤ x ≤ h}, where q.w is the size of the leaf quads in

TQ-tree.

Selecting the search granularity affects the search efficiency and effectiveness. For

instance, if we select search granularity as small as a dozen of meters while the

query distance is over thousand of meters, constructing an error-bounded region with

such small leaf quads would be low efficient. On the other hand, selecting a search

50

Table 3.6: Implemented algorithms

Name AlgorithmExact Algorithm 1 + pre-pruning

EffiExact Algorithm 2 + pre-pruningAdvMax Algorithm in [109] for searching a maximum (k,r)-core

AdvMaxAll Adapted AdvMax to find all maximum (k,r)-coresKRM Adapted AdvMax to find all (k,d)-MCCs + pre-pruning

Apx1 Algorithm 3 with α = 2√

2 + ε and index

Apx2 Algorithm 3 with√

2 + ε′ and indexApx1Ini Apx1 without indexApx2Ini Apx2 without index

SAC Algorithm Exact+ in [34]GeoModu Algorithm in [17]

Table 3.7: Statistic information in datasets

Dataset #vertices #edges #checkins kmaxGowalla [23] 196,591 950,327 6,442,890 29

Brightkite [23] 58,228 214,078 4,491,143 43Foursquare [88, 64] 4,899,219 28,484,755 1,021,970 16

Weibo [65] 1,019,055 32,981,833 32,981,833 11Twitter [65] 554,372 2,402,720 554,372 16

granularity much larger than the query distance would be very fast to get the results

but the error ratio would be large, i.e., the effectiveness would be low. We will discuss

the choices of search granularity that can balance search efficiency and effectiveness

in experimental studies.

3.5 Experimental Results

In this section, we test all algorithms in Table 3.6 on a MAC with Intel i7-4870HQ

(3.7GHz) and 32GB main memory.

Datasets. We conducted the experiments over five real social network datasets

including Gowalla, Brightkite, Foursquare, Weibo and Twitter. Each social user

contains some check-in locations. Table 3.7 presents the statistics for all datasets.

Since we only need one check-in for each vertex, we select the most frequent check-in

as the spatial coordinate for a vertex, if it has multiple check-ins.

51

Table 3.8: Parameter settings

Parameter Range Default value

kGowalla, Brightkite: 5,7,9,11,13,15,17 11

Weibo: 5,6,7,8,9,10,11 9Twitter Foursquare:3,5,7,9,11,13,15 11

d 500, 1000, 1500, 2000, 2500, 3000 2000q.w 100, 200, 400, 800, 1600 400n 20%, 40%, 60%, 80%, 100% 100%

Table 3.9: TQ-tree construction

Dataset Time (Sec) Space (MB)Gowalla 101 31

Brightkite 75 17Foursquare 5102 1680

Weibo 4812 1423Twitter 473 152

Parameter settings. The experiments are evaluated using different settings of query

parameters: k (the minimum truss number) and d (the distance threshold, in meters)

as well as different settings of dataset parameters: q.w (the search granularity) and

n (the percentage of vertices). The ranges of parameters and their default values are

shown in Table 3.8, in which we select reasonable k based on datasets. Furthermore,

when we vary the value of a parameter for evaluation, all the other parameters are

set as their default values.

Index construction. Space division: the width of minimum cells in TQ-tree for the

whole space is set to be 100 meters. The index construction time and size for each

dataset are displayed in Table 3.9.

3.5.1 Efficiency Evaluation

Scalability. To verify the scalability of our algorithms, AdvMaxAll (adapted Ad-

vMax [109] for searching all maximum (k′, r)-cores) and KRM (searching all (k,d)-

MCCs), we choose different sizes of sub-datasets by selecting different percentages

of vertices in each dataset. For AdvMaxAll, we set k′ as k-1, where k is the default

value for the corresponding dataset. We implemented KRM by adapting the problem

52

20% 40% 60% 80% 100%Percentage

10− 1

100

101

102

Tim

e(s)

Exact

EffiExact

Apx1

Apx2

KRM

AdvMaxAll

(a) Gowalla

20% 40% 60% 80% 100%Percentage

10− 1

100

101

102

Tim

e(s)

Exact

EffiExact

Apx1

Apx2

KRM

AdvMaxAll

(b) Brightkite

20% 40% 60% 80% 100%Percentage

100

101

102

103

Tim

e(s)

Exact

EffiExact

Apx1

Apx2

KRM

AdvMaxAll

(c) Foursquare

20% 40% 60% 80% 100%Percentage

10− 1

100

101

102

Tim

e(s)

Exact

EffiExact

Apx1

Apx2

KRM

AdvMaxAll

(d) Twitter

20% 40% 60% 80% 100%Percentage

10− 1

100

101

102

Tim

e(s)

Exact

EffiExact

Apx1

Apx2

KRM

AdvMaxAll

(e) Weibo

Figure 3-4: Scalability

53

setting of AdvMax from k-core to k-truss. From the results in Figures 3-4 (a) to (e),

we can see that the exact algorithms run much slower when the data size is equal

to or larger than 80%. However, for the approximate algorithms, their time costs

increase almost linearly as the data sizes increase for all datasets. For all datasets,

our EffiExact outperforms KRM 30% and AdvMaxAll outperforms EffiExact 10% in

average. Surprisingly AdvMaxAll does not outperform EffiExact significantly. This is

mainly because the size of candidate vertices of searching (k,r)-cores is larger than

that of searching (k,d)-MCCs on all real datasets. In the following experiments, we

focus only on our algorithms (i.e., excluding KRM and AdvMaxAll).

Effect of k. Figures 3-5 (a) to (e) evaluate the performance of the algorithms when

we vary the value of k. In general, all algorithms take less time when k increases.

This is because increasing k will result in the decrease of the sizes of k-trusses. The

EffiExact runs consistently faster than Exact, especially when k is large. In addition,

the study shows that the approximate algorithms outperform the exact algorithms

greatly. The performance can be improved in two orders of magnitude in average

over all datasets. Compared with Apx2, Apx1 is much faster due to its low accuracy

guarantee. Although Apx2 is slower, it can provide more effective results, which will

be discussed in Section 3.5.2.

Effect of d. Figures 3-6 (a) to (e) show the time cost when we vary the distance value

of d from 500 to 3000. With the increase of d, the exact algorithms consume time

exponentially. This is because increasing d will require the algorithms to explore larger

spatial neighbourhood network. The experimental results also confirm our theoretical

analysis that the time complexities of exact algorithms are exponential to the size of

spatial neighbourhood network. In our experiment, EffiExact is faster than Exact by

3-5 times. Different from exact algorithms, the time cost of approximate algorithms

increases slowly for all datasets. In most cases, the approximate algorithms can

answer a (k,d)-MCC search within 10 seconds, which is able to answer the real time

search. Only for Foursquare, it requires to consume about 40 seconds for answering

a (k,d)-MCC search.

Effect of granularity. Figures 3-7 (a) to (e) demonstrate the time cost when we vary

54

5 7 9 11 13 15 17k

100

101

102

Tim

e(s)

Exact

EffiExact

Apx1

Apx2

(a) Gowalla

5 7 9 11 13 15 17k

100

101

102

Tim

e(s)

Exact

EffiExact

Apx1

Apx2

(b) Brightkite

3 5 7 9 11 13 15k

101

102

103

Tim

e(s)

Exact

EffiExact

Apx1

Apx2

(c) Foursquare

3 5 7 9 11 13 15k

101

102

103

Tim

e(s)

Exact

EffiExact

Apx1

Apx2

(d) Twitter

5 6 7 8 9 10 11k

10− 2

10− 1

100

101

102

103

Tim

e(s)

Exact

EffiExact

Apx1

Apx2

(e) Weibo

Figure 3-5: Effect of k

55

500 1000 1500 2000 2500 3000d

100

101

102

103

Tim

e(s) Exact

EffiExact

Apx1

Apx2

(a) Gowalla

500 1000 1500 2000 2500 3000d

100

101

102

Tim

e(s)

Exact

EffiExact

Apx1

Apx2

(b) Brightkite

500 1000 1500 2000 2500 3000d

101

102

103

Tim

e(s)

Exact

EffiExact

Apx1

Apx2

(c) Foursquare

500 1000 1500 2000 2500 3000d

101

102

103

Tim

e(s) Exact

EffiExact

Apx1

Apx2

(d) Twitter

500 1000 1500 2000 2500 3000d

100

101

102

Tim

e(s) Exact

EffiExact

Apx1

Apx2

(e) Weibo

Figure 3-6: Effect of d

56

100 200 400 800 1600Search granularity (m)

0

20

40

60

80

100

120

Tim

e(s

)

Apx1

Apx2

Apx1Ini

Apx2Ini

(a) Gowalla


0

10

20

30

40

50

60

70

Tim

e(s

)

Apx1

Apx2

Apx1Ini

Apx2Ini

(b) Brightkite


0

50

100

150

200

250

300

Tim

e(s

)

Apx1

Apx2

Apx1Ini

Apx2Ini

(c) Foursquare


0

50

100

150

200

Tim

e(s

)

Apx1

Apx2

Apx1Ini

Apx2Ini

(d) Twitter


0

25

50

75

100

125

150

175

Tim

e(s

)

Apx1

Apx2

Apx1Ini

Apx2Ini

(e) Weibo

Figure 3-7: Effect of search granularity

57

3 6 9 12 15 18k

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Vert

ices f

ilte

ring r

ati

o

Gowalla

Brightkite

Weibo

Foursquare

Twitter

(a) Varying k

500 1000 1500 2000 2500 3000d

0.1

0.2

0.3

0.4

0.5

0.6

Vert

ices f

ilte

ring r

ati

o

Gowalla

Brightkite

Weibo

Foursquare

Twitter

(b) Varying d

Figure 3-8: Exact algorithm pruning effectiveness

the search granularity. To show the power of index, we also implemented algorithms

Apx1 and Apx2 without the support of index, denoted as Apx1Ini and Apx2Ini respec-

tively. Overall speaking, the time cost of both algorithms decreases as the search

granularity increases. This is because the space would be divided into less number of

cells for the larger search granularity, and the time complexity is proportional to the

number of cells. More clearly, Apx2Ini is very sensitive to granularity since the time

complexity of Apx2Ini is proportional to ζ4.

3.5.2 Effectiveness Evaluation

3.5.2.1 Exact Algorithms

We present the trends of pre-pruning effectiveness in exact algorithms when the pa-

rameters k and d vary. To show the effectiveness, let’s introduce two metrics as below.

Metrics. Let T be the union of maximal connected k-trusses in G. Let G′ be the

graph after pruning. The vertex pruning ratio is measured by |V (G′)||V (T )| . Let t1 and t2 be

the running time of a algorithm applying and without applying pruning rules. The

time saved ratio is defined as t2−t1t2

.

Effect of k. Figure 3-8 (a) reports the vertex pruning ratios as we change the k value.

For datasets Gowalla and Brightkite, their pruning effectiveness becomes higher as

k value increases. Interestingly, for the datasets Weibo, Foursquare and Twitter,

their pruning effectiveness becomes higher at the beginning and then decreases when

58

Gowalla Brightkite Weibo Twitter Foursquare0

0.2

0.4

0.6

0.8

1

Verti

ces f

ilter

ing

ratio

0

0.2

0.4

0.6

0.8

Tim

e sa

ved

ratio

P1P2TrussCAS

P1TSP2TSTrussTSCASTS

(a) Pruning rules 1 and 2

Gowalla Brightkite Weibo Twitter Foursquare0

0.2

0.4

0.6

Regi

on p

runi

ng ra

tio

P31P32

P4P31TS

P32TSP4TS

0

0.2

0.4

0.6

0.8

Tim

e sa

ved

ratio

(b) Pruning rules 3 and 4

Figure 3-9: Effectiveness of pruning rules

the value of k increases further. The main reason is that, in the three datasets,

their vertices have good spatial and social distributions, i.e., the vertices with higher

social cohesiveness also tend to have spatial closeness with each other. Therefore, the

pruning effectiveness becomes less significant when k is high. Actually, similar trends

occur in Gowalla and Brightkite if we further increase the value of k.

Effect of d. Figure 3-8 (b) shows the vertex pruning effectiveness ratio when we

change the value of d. For all datasets, the pruning effectiveness decreases with the

increase of d value. This is because as d increases, the spatial cohesiveness constraint is

relaxed so that the explored spatial network becomes large, which makes less number

of vertices to be filtered. However, when d is at the interval of 1500 to 2000 meters,

our pruning technique can prune more than 50% vertices averagely, which makes

EffiExact run much faster than Exact in all configurations.

Effect of pruning rules. Figure 3-9 (a) reports the vertex pruning ratios with the

left scale and time saved ratios with the right scale for pruning vertices (P1) and

edges (P2) in T ′, and pruning vertices in T (Truss) individually, or applying these

rules interchangeably by cascading pruning (CAS). The coloured bars correspond to

pruning ratios while the bars with hatches correspond to time saved ratios when

applying pruning rules in EffiExact. For all datasets, it shows that applying rules

individually has limited pruning effectiveness with less than 15% vertices pruned, and

less than 12% of time saved by P1(P1TS) and P2(P2TS), though a bit improved by

Truss. However, applying these rules interchangeably can prune much more vertices

59

and save much more time. For Gowalla and Weibo, over 60% of vertices can be

filtered out and more than 50% of time can be saved (shown by CAS(CASTS)).

3.5.2.2 Approximate Algorithms

Region pruning ratio. Region pruning ratio is defined as the ratio of the number

of regions to be evaluated overs the number of regions actually evaluated.

Figure 3-9 (b) shows the region pruning ratios with left scale (time saved ratio

with right scale) when applying pruning rules 3 and 4. The bars P31(P31TS) and

P32(P32TS) demonstrate the pruning ratios (time saved ratios) when applying prun-

ing rule 3 in Apx1 and Apx2. In all datasets, pruning rule 3 is more effective when

pruning tight bound regions compared to applying it to prune upper bound regions.

However, the pruning ratio of pruning rule 4 (shown by P4) outperforms rule 3 when

pruning tight bound regions for all datasets. The reason is that rule 4 considers truss

residence regions.

Figure 3-10 (a) shows the upper bound region pruning ratios when vary the search

granularity for algorithm Apx1. Overall speaking, the pruning ratio decreases as

search granularity increases for all datasets. This is because smaller search granularity

makes the region become more compact, i.e., the effect regions consisting of small sized

cells are closer to minimum bounding box. The pruning ratio for Foursquare is slightly

worse than others but the average pruning ratio is still over 0.5, i.e., more than 50%

percentage of regions are pruned by proposed pruning techniques. Figure 3-10 (b)

shows the tight bound region pruning ratios when varying the search granularity for

algorithm Apx2. The pruning effect trend is similar to upper bound region pruning.

Effect of granularity on approximation ratio. We demonstrate the correlation

between the theoretical and actual approximation ratios when we run the algorithms

Apx1 and Apx2 over all datasets, and show the trend when the search granularity

changes. They are plotted in Figures 3-11 (a) and (b). The x-axis in the figures

represents the calculated theoretical approximation ratios according to search gran-

ularity. For Apx1, the theoretical approximation ratios are calculated as 2.9, 2.97,

3.11, 3.4, and 3.96. For the Apx2, the theoretical approximation ratios are calculated

60

100 200 400 800 1600Search granularity(m)

0.2

0.3

0.4

0.5

0.6R

egio

n f

ilte

ring r

ati

o

Gowalla

Brightkite

Weibo

Foursquare

Twitter

(a) Apx1

100 200 400 800 1600Search granularity(m)

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Regio

n f

ilte

ring r

ati

o

Gowalla

Brightkite

Weibo

Foursquare

Twitter

(b) Apx2

Figure 3-10: Region pruning ratio

2.75 3.00 3.25 3.50 3.75 4.00Theoretical approx. ratio

1.50

1.75

2.00

2.25

2.50

2.75

3.00

3.25

3.50

Actu

al appro

x.

rati

o

Gowalla

Brightkite

Foursquare

Twitter

Weibo

(a) Apx1

1.50 1.75 2.00 2.25 2.50 2.75Theoretical approx. ratio

0.50

0.75

1.00

1.25

1.50

1.75

2.00

2.25

2.50

Actu

al appro

x.

rati

o

Gowalla

Brightkite

Foursquare

Twitter

Weibo

(b) Apx2

Figure 3-11: Approximation ratio

as 1.48, 1.56, 1.70, 1.98, and 2.55. From the experimental results, we can see that

the real approximation ratios of Apx1 and Apx2 are much smaller than theoretical

approximation ratios. For both algorithms, their real approximation ratios increase

as the search granularity increases. The approximation ratio of Apx1 increases almost

linearly as the theoretical approximation ratio increases. For Apx2, we can observe a

sudden increase for all datasets after the search granularity goes beyond 400. Within

400, we can see that the actual approximation ratios are all less than 1.4. Therefore,

in real applications we may need to tune the search granularity to balance the search

efficiency and effectiveness.

61

Gowalla Brightkite Foursquare Weibo Twitter0

0.25

0.5

0.75

1No

rmol

ized

spat

ial d

ensit

y

ExactApx1

Apx2AdvMax

SACGeoModu

Figure 3-12: Density study

3.5.2.3 Spatial Density

We verify the spatial closeness of the maximum co-located community found by our

algorithms Exact, Apx1 and Apx2 by comparing with the state-of-the-art spatial-aware

community models: SAC [34], GeoModu [17] and AdvMax [109].

SAC. It finds a k-core G′, containing a query vertex and vertices are spatially con-

tained in a minimum covering circle with smallest radius.

GeoModu. It refines the weight of each edge eu,v in graph G as 1dηu,v

where dηu,v is

the normalised spatial distance from u to v and η is a decay factor. It then detects

the communities using modularity maximisation.

AdvMax. It find the maximum (k, r)-core which is a k-core containing pairwise

vertices with spatial distance no greater than d and maximising the cardinality.

We refine the structural cohesiveness k-core in SAC using k-truss. For GeoModu,

we setup η = 1 and select the community with highest spatial density defined as

follows.

Spatial density. Given a community J with spatial diameter d, the spatial density

of J is defined as:∑u,v∈V (J) ed(u,v)

2|V (J)| .

For SAC, we randomly select 200 query vertices, set up the structural cohesiveness

as 11-truss, generate exact communities, compute spatial density and get the average

spatial density for each dataset. For GeoModu, we set η=1, detect all communities

and compute the average spatial density for each dataset. For Exact, Apx1, Apx2

62

(a) (4,1.5km)-MCC (b) (9,5km)-MCC

Figure 3-13: Case study

and AdvMax, we setup k=11 (k=10 for AdvMax), randomly select 200 different query

distances between 500 to 2000 meters, generate maximum co-located communities,

compute spatial density and get the average spatial density for each dataset. All

results are normalised by D−DminDmax−Dmin , where D is an average spatial density, Dmax and

Dmin are the extremes.

Figure 3-12 shows that Exact performs the best and outperforms AdvMax due to

its structural tightness. AdvMax also performs reasonably well because both Exact

and AdvMax tend to enlarge the cardinality as much as possible for the given distance

threshold. As expected, the approximate algorithms Apx1, Apx2 perform worse than

Exact but reasonably acceptable. As shown in the figure, Apx2 performs better than

AdvMax in most datasets except for Weibo. Also Apx2 performs better than Apx1 in

all datasets due to lower error ratio. Compared with the above algorithms, SAC and

GeoModu perform much worse mainly because they don’t intend to include as many

vertices as possible. In particular, GeoModu has the lowest spatial dense communities

over all datasets and is listed as 0 after normalisation.

3.5.2.4 Case Study

We conducted two case studies on Gowalla to demonstrate the effectiveness of (k,d)-

MCC. In contrast to connected k-truss, our models can ensure the spatial closeness

over community members.

63

Figure 3-13 (a) shows a community with k=4 and d=1.5km. All the members

are around Gothenburg University in Sweden. The community members are good

candidates for some meetup activities since (1) they have strong social relationships,

i.e., each member has 3 friends in the community and members who are not friends

are connected by their mutual friends; and (2) the longest distance between them is

bounded by 1.5km.

Figure 3-13 (b) shows a community with k=9 and d=5km. We can see that there

are some members around the downtown in Austin that are very close with each other.

And there are some members in outskirt that are relatively distant to members in

the downtown area. Removing any of these relatively distant members makes the

community collapse from social cohesiveness perspective. Although the query has

d = 5km, the actual distance between most members is less than 3.3km.

3.6 Conclusion

In this chapter, we study the maximum co-located community search problem in

large scale social networks. We propose a novel community model, co-located com-

munity, considering both social and spatial cohesiveness. We develop efficient exact

algorithms to find all maximum co-located communities. We design approximation

algorithm with guaranteed spatial error ratios. We further improve the performance

using proposed TQ-tree index. We conduct extensive experiments on large real-world

networks, and the results demonstrate the effectiveness and efficiency of the proposed

algorithms.

64

Chapter 4

Contextual Community Search

In this chapter, we propose a novel parameter-free community model called contextual

community, for searching a community in an attributed graph. The proposed model

only requires a query context providing a set of keywords describing the desired com-

munity context, where the returned community is designed to be both structure and

attribute cohesive w.r.t. the query context. We show that both exact and approxi-

mate contextual community can be searched in polynomial time. The best proposed

exact algorithm bounds the running time by a cubic factor or better using an elegant

parametric maximum flow technique. The proposed 13-approximation algorithm sig-

nificantly improves the search efficiency. In the experiment, we use six real networks

with ground-truth communities to evaluate the effectiveness of our contextual com-

munity model. Experimental results demonstrate that the proposed model can find

near ground-truth communities. We test both our exact and approximate algorithms

using twelve large real networks to demonstrate the high efficiency of the proposed

algorithms.


contextual community search. Section 4.2 presents the contextual community (CC)

model and CC search problem. Sections 4.3, 4.4 and 4.5 present two network flow

based exact algorithms that are designed to solve CC search in polynomial time. Sec-

tion 4.6 presents an approximate solution is proposed and analysed (with an approxi-

mation ratio of 13), which significantly improves over the runtimes of exact algorithms.

65

Experimental results are shown in Section 4.8, followed by the chapter summary in

Section 4.9.

4.1 Introduction

We propose a novel parameter-free community model, namely the contextual com-

munity that only requires a query to provide a set of keywords describing an appli-

cation/user context (e.g. location and preference). As such users can search desired

communities without detailed information of them, this is in contrast to existing

community search models where additionally a set of known query vertices as well as

community cohesiveness parameters (e.g. k as the minimum vertex degree) are re-

quired. But still, the returned contextual community is designed to be both structure

and context cohesive.

Given the query context, the most popular cohesive measurements like k-core and

k-truss are often unsuitable. On one hand, there exists an inherent contradiction that

a larger k value may imply a smaller community to be found that is potentially in-

sufficient to match the provided context. On the other hand, while considering more

about the context match, vertices (edges) may very likely fail to meet the minimum

number requirement of neighbours (common neighbours) of k-core (k-truss). More-

over, imposing the k bound on the community can be deemed to be inflexible and

deciding the best k that satisfies both context and structure requirements is chal-

lenging. Therefore, for seeking a proper contextual community we instead adopt the

notion of relative subgraph density that is parameter-free and relative in the number

of considered motifs/units and the number of their induced vertices. The search goal

is then reduced to finding a subgraph having the highest density. However, as shown

in [97] if the considered motifs or community signatures are simply edges, the found

community might be large and not absolutely cohesive. On the other hand, [97] shows

that triangle motifs would be better signatures to produce a truly cohesive subgraph,

but in this case size of the returned community might drop dramatically, thereby

affecting the desirable vertex coverage. To alleviate such shortcomings, we instead

66

account for both involved triangle and edge motifs as a unified density measure to

suitably explain the structure cohesiveness of a contextual community.

As discussed previously, in real applications, it would be desirable that, by sim-

ply accepting a set of keywords about the desired community context, a community

search system is able to find a community that is highly relevant to the provided con-

text. Intuitively, this translates to, vertices with attributes close to the context shall

be considered as members of the desired community in an attributed graph. How-

ever, overemphasising the co-relation between found vertex attributes and the query

context may cause the search to result in a small and loosely connected subgraph.

This is actually against the structural requirement of being a community and be-

comes another popular research topic, keyword search [9, 47, 57, 66, 81, 58]. To avoid

such pessimistic situation, we can relax context cohesiveness by tolerating commu-

nity vertices that are themselves less relevant to the query context but instead their

surrounding vertices are more relevant. As shown in Section 4.2.2, this relaxation

is naturally achieved with triangle and edge (the aforementioned subgraph density

motif) contextual scores/weights aggregated from nodal contextual relevance. Notice

that our weighted motif (triangle and edge) measure ensures relaxed but strong in-

ternal context cohesiveness in a community since all the involved units are matched

against the query context.

Contextual community. Based on the desiderata of contextual community search,

we propose a weighted density based contextual community model. First, a contex-

tual weight is assigned to each small motif of a subgraph. It measures the context

prevalence of a motif. Then, the context weighted density of a subgraph is calculated

as the division of the aggregated weight over all motifs by the total number vertices

in the subgraph. Finally, the subgraph with the highest contextual weighted density

w.r.t. the query context is returned as the best community. Notice that the intuition

behind our contextual community model is: every designated community member

should be involved in many structurally overlapped edge and triangle motifs which

are themselves prevalent in the specified query context. In real life, these units are

analogous to mutual friendships and family circles.

67

Although our proposed community model is based on weighted densest subgraph,

existing exact and approximation algorithms running in polynomial time only work

on separate density functions, i.e. weighted/unweighted degree density function or

unweighted triangle density function. For our more complicated contextual commu-

nity search, building on the theory frameworks of flow networks and approximation

algorithms we confirm that there are indeed (both in theory and practice) efficient

algorithms.

More precisely, given a graph G and a set of query attributes Q, our first approach

carefully constructs a flow network N that guides the community search. Together

with binary search probing, the approach in total runs in time O(|V (N)|3 log(|V (G)|))

where V (.) and E(.) define vertex and edge sets respectively and N is a constructed

flow network. By formulating the contextual community search as an optimisa-

tion problem, we then construct a different flow network N with parameters help-

ing a monotonic search of contextual community. Along this second approach, we

manage to avoid a pitfall implementation running in O(V (G)|V (N)|3) with an el-

egant parametric maximum flow technique. This technique eventually turns the

runtime into O(|V (N)|3) or better. Note that the aforementioned runtime com-

plexities are worst-case guarantees while in practice they are also very much reduced

with the consideration of query context. To achieve even extra runtime scalability,

we also propose a fast 13-approximation algorithm. The algorithm can run in time

O(|V (G)| log(|V (G)|)+|E(G)| log(|V (G)|)+|Tri(G)|) with simple degree and triangle

indices.

4.2 Problem Definition

4.2.1 Preliminary

Graph with attributes. An attributed graph is denoted as G = (V,E,A), where

V (G), E(G), A denote the set of vertices in G, the set of edges in G, the set of

attributes. Each vertex v ∈ V (G) is attached with a set of attributes A(v) ⊆ A.

68

1 2

5 6

7 8

12

13 14

109

3

11

4

fk1; k2; k3g

fk1; k2; k3g

fk2g fk2g

fk2g

fk2g

fk1; k2; k3gfk3g

fk1; k2; k3gfk3g

fk3g fk3g

fk1; k3g

fk1; k2g

5

(a) Attributed graph

1 2

5 6

7 8 9

fk1; k2; k3g

fk1; k2; k3g

fk1; k2; k3gfk3g

fk1; k2; k3g

fk1; k3g

fk1; k2g

5

(b) Contextual community

1 2

5 6

7 8

fk1; k2; k3g

fk1; k2; k3g

fk3gfk1; k2; k3g

fk1; k3g

fk1; k2g

5

(c) Triangle density model

1 2

5 6

7 8 9

3

fk1; k2; k3g

fk1; k2; k3g

fk2g

fk1; k2; k3gfk3g fk1; k2; k3g

fk1; k3g

fk1; k2g

5

(d) Degree density model

Figure 4-1: Example

Given v ∈ V (G), deg(v,G) denotes the degree of v in G and N(v,G) denotes the

neighbours of v in G.

Triangles in graphs. A triangle in G is a cycle of length 3. A triangle induced on

vertices u, v, w ∈ V (G) is denoted as 4uvw and when these vertices are not specified

we omit the subscript. Given a subgraph H ⊆ G, Tri(H) denotes the set of triangles

in H.

Degree density. Given a subgraph H ⊆ G, the degree density of H is defined as

ρ(H) = |E(H)||V (H)| . We call it degree density because |E(H)| =

∑v∈V (H) deg(v,H)

2.

Triangle density. Given a subgraph H ⊆ G, the triangle density of H is defined as

ρ4(H) = |Tri(H)||V (H)| .

69

4.2.2 Problem definition

As discussed previously, the found CC shall be large, cohesive and relevant to query

context. We address these perspectives by proposing a context weighted density

function.

Our CC model is inspired from observations of real communities. A community in

real world consists of small units like family circles and mutual relationships. Over-

lapping small units together form a bigger community while the overlaps can be both

structure and context/attribute overlap. Edge and triangle motifs in a social net-

work are employed as our basic units. To capture both overlaps and ensure that the

obtained community satisfies a given query context, these basic units are assigned

with context scores measuring their prevalence of query context. A group of users are

considered as a contextual community if in average they are involved in many basic

units that are rich in the given query context.

In the following, we formally define the CC model and search problem starting

from defining context scores of the basic motifs, i.e. scores of edge and triangle motifs

in a subgraph H ⊆ G.

Definition 4.2.1. Context score. Given motifs (u, v) ∈ E(H), (u, v, w) ∈ Tri(H)

and a set of attributes Q as the query context, the context score function are as follows.

• Edge context score: w(e(u, v)) = |Q ∩ A(u)|+ |Q ∩ A(v)|

• Triangle context score: w((u, v, w)) =∑

e∈{(u,v),(u,w),(v,w)}w(e)

Intuitively, the score should award motifs in union covering more query context

and containing vertices that have more query context. The defined context score well

satisfies these intuitions. For the first intuition, an example is as follows. Given query

Q = {k1, k2, k3}, triangle (2, 8, 9) in Figure 4-1 (a) is superior over triangle (3, 4, 11)

since all query contexts are covered by (2, 8, 9), which can be differentiated by the

defined context score. On the other hand, an example for the second intuition is:

given the same query Q, the defined score also rewards triangle {2, 8, 9} a higher

score compared to triangle {8, 12, 14}, which makes sense since all vertices in {2, 8, 9}

70

cover all attributes whereas in {8, 12, 14} only vertex 8 covers all attributes. Another

alternative score function |∪v∈motif A(v) ∩ Q| only encourages motifs covering more

query context while fails to encourage motifs containing more vertices that have query

context. It cannot differentiate triangles such as {2, 8, 9} and {8, 12, 14}.

Next, we define context weighted density of a subgraph H ⊆ G based context

scores.

Definition 4.2.2. Context weighted density. Given a set of query attributes, the

attribute score function, H ⊆ G with Tri(H) containing the set of triangles in H.

The context-weighted density of H, denoted by AD(H) is defined as:

AD(H) =

∑4∈Tri(H)w(4) +

∑e∈E(H)w(e)

|V (H)|. (4.1)

The context weighted density mixes weighted triangle density and weighted edge

density. The superiorities of the context wighted density are twofolds:

Size versus cohesiveness. Given a subgraph H, the density function considers both

edges and triangles in H. If edges are considered only, it is likely to fail to detect

cohesive communities according to the observations in [97]. If triangles are considered

only, we may find very dense subgraphs, while miss vertices that are only involved

in edges or vertices that are involved in a relatively small number of triangles, which

in consequence may penalize the size of the found H. This may contradict with the

purpose of community search.

High adaptability. Given a query context, vertices that are related to the query con-

text may be loosely or intensively connected. An ideal cohesive measurement for CC

shall be adaptable to these situations. When vertices are loosely connected, edge

density part of the context weighted density will favour more related edges being

included so that density AD(H) can increase. When vertices are intensively con-

nected, the triangle score part will help to incorporate the dense parts and ensure the

effectiveness.

Problem 4.2.1. CC search. Given a social network graph G and a set of query

attributes Q, return a CC modelled as a subgraph H ⊆ G satisfying:

71

• H is connected.

• There is no connected H ′ ⊆ G such that AD(H ′) > AD(H).

Example. An attributed graph is shown Figure 4-1 (a), which is adopted from [51].

Given Q = {k1, k2, k3}, applying CC search we can get CC as shown in Figure 4-1(b).

The vertices containing all query contexts are included in the found CC. Compared

to CC, if using weighted triangle density only, we get a smaller community, as shown

in Figure 4-1(c), missing vertex 9 that covers all query attributes. On the other hand,

if using degree density only, the found community is shown in Figure 4-1(d), which

includes vertex 3 that only covers a query context.

4.2.3 Why contextual community search

Compared to existing works, our CC model and search is designed to be a general

framework with several advantages as follows.

• CC employs subgraph density as a parameter-free cohesiveness measure, which

avoids specifying k in k-core, k-truss etc., thus it is easy to use.

• CC balances the cohesiveness of the found community by considering both tri-

angles and edges. Triangles reflect denser nature of the community similar to

k-truss and edges allow flexibility which somewhat similar to k-core.

• CC search avoids certain bad query effect [53], i.e., the search returns empty set

or very loose connected subgraphs because of inappropriate query input. That

is because CC simultaneously models structure and contextual cohesiveness.

• CC search indeed finds near ground truth communities as shown in the experi-

mental studies if user-provided query context is close to attributes contained in

a ground truth community.

72

4.2.4 Pre-prunings

Before describing the main CC search algorithms, we first show a simple yet effective

pruning rule, which helps us quickly exclude vertices that are irrelevant to CC.

Pruning Rule 4.2.1. Given a vertex v ∈ V (G), and given a query context Q, v can

be pruned if following two conditions are satisfied simultaneously.

• A(v) ∩Q = ∅.

• for each u ∈ N(v,G), A(u) ∩Q = ∅.

The correctness of the Pruning Rule 4.2.1 is immediate. Given any H, removing

vertices confirming the pruning rule will not decrease the numerator value of ρ(H)

but will reduce the denominator of ρ(H).

Applying the pruning rule, we can easily filter out irrelevant vertices; at the same

time compute the contextual scores. As such, the input graph could be divided into

disjoint subgraphs. Nevertheless, the algorithms discussed in the following can be still

applied. Then, after the pruning the global optimal CC can be much more efficiently

derived.

4.3 A Flow Network Based Approach

In this section, we first present a polynomial time exact algorithm for finding CC with

a carefully constructed auxiliary flow network. Before introducing the algorithm, we

revisit the preliminaries about flow networks.

4.3.1 Flow Network Preliminaries

Directed flow network. The flow network considered in this chapter is a directed

graph N = (V,E) with a set of nodes V and directed edge set E, having a unique

source node s, a unique sink node t, and a non-negative capacity c(u, v) for every

directed edge (u, v). Note that we prefer using the term node when discussing about

73

flow network, and vertex in the context of social network. Following the flow net-

work convention, we extend the capacity function to arbitrary node pairs by defining

c(u, v) = 0, if (u, v) /∈ E(N), which implies f(u, v) = 0 if (u, v) /∈ E(N).

Flow. A flow f on N is real-valued function on node pairs satisfying three constraints.

First, the capacity constraint, i.e., f(u, v) ≤ c(u, v) for every (u, v) ∈ V (N) × V (N).

Second, the anti-symmetry constraint, i.e., f(u, v) = −f(v, u) for every (u, v) ∈ V (N)

× V (N). Third, the conservation constraint, i.e.,∑

u∈V (N) f(u, v) = 0 for every v ∈

V (N) \ {s, t}. The value of the flow f is defined as∑

v∈V (N) f(v, t). A maximum

flow is a flow of maximum value.

s-t cut. If S and T are two disjoint node subsets such that S ∪ T = V (N) and

S ∩ T = ∅, then the capacity or the cut value across the cut (S, T ) is c(S, T ) =∑v∈S,u∈T c(u, v). (S, T ) is an s-t cut if the source node s ∈ S and the sink node

t ∈ T . A minimum or min s-t cut is with the minimum cut value c(S, T ).

Algorithm 4: Binary search based algorithm

Data: G1 low ← AD(G), Ho ← ∅;2 high ←

|Q|×(6C3|V (N)|+2|V (N)|2)|V (N)| ;

3 while high − low ≥ min interval do4 mid ← (high+ low)/2;5 H ′o ← tryOpt(N ,mid);6 if H ′o 6= ∅ then7 Ho ← H ′o;8 low ← mid;

9 else10 high← mid;

11 return Ho;

Maximum flow and min s-t cut. If f is a flow of N , the flow across an s-t cut

(S, T ) in N is f(S, T ) =∑

u∈S,v∈T f(u, v). The conservation constraint indicates that

the flow across any cut is equal to the flow value f . The capacity constraint indicates

that for any flow f and cut (S, T ), f(S, T ) ≤ c(S, T ) shall hold, which implies the

value of a maximum flow is equal to the capacity of a minimum cut, i.e. the max-flow

min-cut theorem.

74

4.3.2 Algorithm Framework

Intuition. We design an auxiliary flow network such that 1) part S of a min s-t cut

(S, T ) contains a candidate CC and 2) the min s-t cut can guide us how a guessed

context/attribute weighted density of CC eventually arrives at the optimal context

weighted density. With such auxiliary flow network, by solving a sequence of min s-t

cut problems, we approach the optimal density by iteratively guessing its value with

a half-interval method that is analogous to binary search. We carefully study the

stop condition and correctness of the algorithm so that the candidate CC from the

last computed min s-t cut is guaranteed to be CC.

Major steps. Algorithm 4 shows a binary search based framework for searching CC.

In each iteration (lines 4 to 11), the algorithm guesses a context weighted density,

denoted by mid, in a binary search manner, and tries to find a candidate CC with

a closer density to the optimum. Such candidate CC is found by computing the

minimum s-t cut on the flow network to be introduced later.

Function tryOpt. This function takes the carefully designed flow network N and

a newly guessed attribute-weighted density mid as parameters. It first updates the

flow network so that the previously guessed context weighted density is replaced by

mid. The update operation will be discussed in great detail later. The function

then calculates the min s-t cut for the updated flow network, generates a candidate

CC from nodes in part S, and returns a new candidate CC H ′o which is S induced

subgraph of G. The binary search proceeds depending on H ′o is empty or not (we will

prove its correctness later). At this moment, we treat min s-t cut solver as a black

box since we are not modifying it. We will revisit min cut algorithm for our improved

approaches later.

Similar ideas to ours have been applied in graph mining [42, 97, 61] for finding

degree densest subgraph, triangle densest subgraph, directed densest subgraph etc.

However, adopting flow theory for CC search is non-trivial where the challenges are

1) designing appropriate flow network that satisfies the desired intuition of CC and

binary search (the correctness), and 2) bounding algorithm’s runtime complexity

75

3

1

4

6

2

5

fk1; k2; k3g

fk1; k2; k3g

fk1; k2; k3g

fk3g

fg

fg

(a) Sample network

3

1

6 5

4

2

t

1

2

1

2

s

0

2

23

3

2

-5 -1

2

-5-7-1

2-8

(b) Degree

134

3

1

6 5

4

2

t

s

346

245

2×14

3

14

3

2×14

314

3

2×14

3

14

3

2×14

314

32×14

3

2×14

3

14

3

14

3

-14

3

-28

3

-14

3-10

2

32×2

3

2

3

-2

3

-2

3

2×2

3

2

3

2×2

3

(c) Triangle

Figure 4-2: Warm-up flow network illustrations

according to CC definitions (efficiency). In the following subsections, we present the

proposed auxiliary flow network in detail, prove the correctness of Algorithm 4 and

analyze the algorithm’s time complexity.

4.3.3 Warm-up for Flow Network Construction

Consider relaxing the objective of flow network N to take into account either edge

or triangle weights and let S of the min s-t cut always contain a candidate CC. To

simplify the discussion, given a cut (S, T ) for N , let XS(XT ) denote the set of nodes

in N representing vertices in G belonging to S(T ) and let YS(YT ) denote the set of

nodes in N representing triangles in G contained in S(T ). The relaxed objective AD

can be expressed as: AD(G(XS)) =∑

e∈G(XS)w(e) or

∑4∈Tri(G(XS))

w(4), where

(S, T ) is the min cut of N . In the following, we establish connection between cut

value c(S, T ) and each relaxation of AD.

The first relaxation is as follows. Given a graph G, constructing a flow network

such that its any cut (S, T ) with S \{s} 6= ∅ has cut value c(S, T ) and objective value∑e∈G(XS)

w(e). Intuitively∑

e∈G(XS)w(e) equals to the difference between

∑v∈G(Xs)∑

u∈N(v,G)w((v,u))

2and

∑v∈XS

∑u∈XT

w((v,u))2

, in which if (v, u) 6= E(G), w((v, u)) = 0.

To relate cut and objective values, we construct the network as follows. For each

vertex in G we create a node for N and we add a single s and t to N as well.

76

W.o.l.g., the discussion in this subsection assumes there is an edge from s to every

node in N having capacity large enough such that it does not affect the min cut

discussed below. Adding legitimate capacities to these outgoing edges from the source

node s will be discussed in the next subsection. To ensure the expression of c(S,T )

contains∑

v∈G(Xs)

∑u∈N(v,G)

w((v,u))2

, for each node v we create a directed edge to

t with capacity of −∑

u∈N(v,G)w((v,u))

2. To ensure c(S,T ) contains

∑v∈XS

∑u∈XT

w((v,u))2

, for each edge in G and their corresponding involved vertices u and v, we create

two directed edges (u, v) and (v, u) with the same capacity of w((u,v))2

. As a result,

given any cut (S, T ), c(S, T ) will be expressed as −∑

v∈G(Xs)

∑u∈N(v,G)

w((v,u))2

+∑v∈XS

∑u∈XT

w((v,u))2

, which is the negation of∑

e∈G(XS)w(e). We demonstrate such

constructed flow network in Figure 4-2(b) for a sample graph in Figure 4-2(a), given

a query Q = {k1, k2, k3} The cut denoted by dash line partitions the network into

S = {s, 1, 3, 4, 6} and T = {t, 2, 5} and breaks edges of interests {(4, 2), (4, 5)} that

have direction from S to T . XS induced subgraph of G contains V (GXS) = {1, 3, 4, 6},

which has objective value∑

e∈G(XS)w(e) = 24. The cut value c(S, T ) = −25 + 1 =

−24, which is the negation of 24.

The second more complicated relaxation involving triangles is as follows. Given

a graph G, constructing a flow network such that its c(S, T ), where (S, T ) is a min

cut and S \ {s} 6= ∅, is associated with∑4∈Tri(G(XS))

w(4). Similar as before,∑4∈Tri(G(XS))

w(4) can be expressed as the difference between∑

v∈G(Xs)

∑4∈Tri(v,G)

w(4)3

and∑

u∈XS

∑v,w∈XT

w((u,v,w))3

+∑

u,v∈XS

∑w∈XT

2w((u,v,w))3

, where if (v, u, w) 6=

Tri(G), w((v, u, w)) = 0. To achieve that we construct the network as follows. For

each vertex in G we create a vertex node for N , for each triangle in G we create a

triangle node for N and we create s and t as well. Similar to the first relaxation,

we create edges from s to each vertex node and assume their capacities are large

enough. To ensure c(S,T ) contains∑

v∈G(Xs)

∑4∈Tri(v,G)

w(4)3

, for each vertex node

v we create a directed edge to t with capacity of −∑4∈Tri(v,G)

w(4)3

. To ensure

c(S,T ) contains∑

u∈XS

∑v,w∈XT

w((u,v,w))3

we create directed edges as follows. For

each triangle node we create three directed edges to each vertex nodes involved in the

triangle with capacity of 2w(4)3

. For each vertex node, we create a directed edge with

77

capacity of w(4)3

to each triangle node it involves. Notice that with such configuration,

only the minimum cut (S, T ) ensures c(S,T ) contains∑

u∈XS

∑v,w∈XT

w((u,v,w))3

+∑u,v∈XS

∑w∈XT

2w((u,v,w))3

. This property will be formally proven in Lemma 4.3.1.

As an example, in Figure 4-2(c) that is alternatively constructed for the sample

graph in Figure 4-2(a). The cut containing node vertices {1, 3, 4, 6} could be the

one indicated by either black or red dash line. Clearly, the cost of the black dash

line is lower by 23. And in fact it is the min cut which correctly reflects that the

node 4 loses 23

weight because the cut only breaks one count of triangle {4, 2, 5}

aggregated by these vertices in the sample graph. The {1, 3, 4, 6} induced subgraph

G has∑4∈Tri(G(XS))

w(4) = 28. The min cut of N , (S, T ), denoted by black dash

line, has c(S, T ) = −563−10+ 2

3= −28, which is the negation of

∑4∈Tri(G(XS))

w(4).

Through combining the above constructed networks, i.e., create a union of edges

and nodes and combine the capacities on the edges from vertex nodes to t, we are able

to get a flow network having min cut (S, T ) with c(S, T ) containing∑

e∈G(XS)w(e)

and∑4∈Tri(G(XS))

w(4), which is very close to the AD(G(XS)). In addition there is

a challenge that most existing min cut algorithms do not allow negative capacities.

To make N only contain positive capacities and its min cut contain AD(G(XS)), we

modify the construction as follows. For each vertex node in N to t, we add/enlarge the

previously combined capacity by∑4∈Tri(G) w(4) +

∑e∈E(G)w(e) + g, where g acts

as our guessed optimal contextual weight density. By doing that, c(S, T ) must contain

the expression |XS|(g−AD(G(XS))) whereas previously c(S, T ) = −|XS|AD(G(XS)),

which includes the context weighted density of XS induced subgraph in G.

Until now, we have successfully constructed the flow network whose min-cut (S, T )

can be used to derive a candidate CC G(XS). Next, we show the complete construc-

tion in Algorithm 5 by adding capacities from S to vertex nodes and explain how

obtained min cut values direct binary search.

4.3.4 CC Auxiliary Flow Network

In this subsection, following the warm-up intuitions and constructions we show how

the complete auxiliary flow network N for CC is carefully constructed and tuned for

78

Algorithm 5: Auxiliary flow network construction

Data: G1 V (N)← ∅, E(N)← ∅, C(N)← ∅;2 create s and t vertex, V (N)← V (N) ∪ {s} ∪ {t} ;3 foreach 4 ∈ Tri(G) || v ∈ V (G) do create a node n, V (N)← V (N) ∪ {n};4 foreach v ∈ V (N) do5 if v is a vertex in V (G) then6 create an edge to every triangle (4) node that v participates in with

capacity of w(4)3

;7 create an edge from s to v with capacity of

∑e∈E(G)w(e) +∑

4∈Tri(G)w(4);

8 create an edge from v to t with capacity of∑

e∈E(G)w(e) +∑4∈Tri(G)w(4) + g −

∑u∈N(u,G)

w((v,u))2−∑4∈Tri(v,G)

w(4)3

;

9 create an edge from v to each node in N(v) with capacity of w((u,v))2

;

10 if v is a triangle 4 ∈ Tri(G) then11 create an edge to every vertex node form the triangle with capacity of

2w(4)3

12 return N ;

directing CC search.

The whole construction of N is displayed in Algorithm 5. In addition to a source

node s and a sink node t, the node set in N contains two types of nodes: triangle node

and vertex node, which are corresponding to triangles and vertices in a social network

G (lines 2 to 3). Lines 4 to 11 assign capacities to the created directed edges. The ca-

pacity for edges from S to vertex nodes are set to be∑

e∈E(G)w(e) +∑4∈Tri(G)w(4).

The reason is that for different guessed and tuned g, we want the cost of the min cut

always contains the term |V (G)|(∑4∈Tri(G)w(4) +

∑e∈E w(e)), which is important

for proving binary search correctness. The purpose of other assigned capacities have

been explained in the warm-up constructions. Next we formally introduce and prove

the min cut value of the constructed N .

Min s-t cut. We show it with a lemma as follows.

79

Lemma 4.3.1. The min s-t cut of N , denoted by (Sm, Tm) must be in form of:

c(Sm, Tm) = |V (G)|(∑

4∈Tri(G)

w(4) +∑e∈E

w(e))

+|XS|(g − AD(G(XSm)))

(4.2)

Proof sketch. The proof is conducted by showing correctness of two auxiliary lem-

mas.

Firstly, we show how to express the min s-t cut in N . For the sake of simplification,

let X denote the set of nodes in N corresponding to vertices in V (G), let Y denote the

set of triangles vertices in N corresponding to triangles in Tri(G). And accordingly,

we use XS(YS) and XT (YT ) to denote the set of nodes in X (Y ) in S and T parts

after applying s-t cut to N . And let Trii(XS) denote the the set of triangles induced

by exactly i number of nodes in set XS. The lemma below shows the expression of

min s-t cut.

Lemma 4.3.2. Let (S, T ) be a minimum s-t cut in N , the capacity of the c(S, T )

must be in form of:

c(S, T ) =∑v∈XT

c(s, v) +∑v∈XS

c(v, t) +∑v∈XS

∑u∈XT

c(v, u)

+∑

(u,v,w)∈Tri1(XS)∧u∈XS

c(u, (u, v, w))

+∑

(u,v,w)∈Tri2(XS)∧u,v∈XS

(c(u, (u, v, w)) + c(v, (u, v, w)))

+∑

(u,v,w)∈Tri2(XS)∧u,v∈XS

c((u, v, w), w)

(4.3)

Proof sketch. Case 1: XT = {t}; C(S, T ) equals to∑

v∈XS c(v, t) and the correctness

of Lemma 4.3.2 is clear.

Case 2: XS = {s}; C(S, T ) equals to∑

v∈XT c(s, v) and the correctness of Lemma 4.3.2

is clear.

Case 3: T \ {t} 6= ∅ and S \ {s} 6= ∅; The correctness of line (1) in the equation

in Lemma 4.3.2 is clearly correct. When breaking a triangle 4, there two situations

80

may happen corresponding to lines 2 to 4 in Equation 4.3. Firstly, only one vertex

u is in XS. In this situation, to get the min cut, the triangle must be in YT , if

not, we can always get a cut having less cut capacity by moving the triangle from

YS to YT , whereas such cut capacity correctly reflects u lose one triangle. Secondly,

two vertices u, v of (u, v, w) are in XS. The triangle could be either in YT or YS

and in both situation, the designed N correctly reflects two triangles are lost from

u, v’s perspective. If the triangle is in YT , the min-cut breaks two directed edges

(u, (u, v, w)) and (v, (u, v, w)) (line 3), which correctly shows two triangles are lost.

If the triangle is in TS, since w is not in XS, the min cut breaks the directed edge

((u, v, w), w) (line 4), which correctly reflects u, v lose two triangles since the capacity

of the edge from a triangle node to a vertex node is twice of the capacity from a vertex

node to a triangle node. In this case, we can always get a less cost cut by putting

the triangle in YS. We can conclude that the equation in Lemma 4.3.2 expresses the

form of a min s-t cut for N .

Replacing capacities into Equation 4.3, we show the lemma as follows.

Lemma 4.3.3. c(S, T ) can be organised as:

c(S, T ) = |V (G)|(∑

4∈Tri(G)

w(4) +∑e∈E

w(e))

+|XS|(g −∑4∈Tri(G(Xs))

w(4) +∑

e∈E(G(Xs))w(e)

|XS|)

(4.4)

Proof sketch. Firstly, by replacing the capacity into Equation 4.3 we can get equa-

tion as follows.

c(S, T ) = (|XS|+ |XT |)(∑

4∈Tri(G)

w(4) +∑e∈E

w(e))

+∑v∈Xs

(g −∑

u∈N(v,G)

w((u, v))

2−

∑4∈Tri(v,G)

w(4)

3)

+∑v∈XS

∑u∈XT

w((u, v))

2+

∑4∈Tri1(XS)

w(4)

3+ 2

∑4∈Tri2(XS)

w(4)

3.

(4.5)

Now we show two equivalences as follows.∑

v∈Xs∑

u∈N(v,G)w((u,v))

2−

∑81

v∈XS∑

u∈XTw((u,v))

2is equivalent to

∑e∈E(G(XS))

w(e).∑

v∈Xs∑

4∈Tri(v,G)w(4)

3-∑

4∈Tri1(XS)w(4)

3-2∑4∈Tri2(XS)

w(4)3

is equivalent to∑4∈Tri(G(XS)) w(4).

Using the two equivalences we can transfer Equation 4.5 to

c(S, T ) = (|XS|+ |XT |)(∑

4∈Tri(G)

w(4) +∑e∈E

w(e))

+∑v∈Xs

g −∑

e∈E(G(XS))

w(e)−∑

4∈Tri(G(XS))

w(4),(4.6)

which can be organised to equation in Lemma 4.3.3 clearly.

The correctness of Lemma 4.3.3 immediately show the correctness of Lemma 4.3.1.

In particular, the obtained expression for c(Sm, Tm) will help us direct the binary

search, which will be discussed in great detail next.

Update flow network. tryOpt in Algorithm 4 updates/tunes the created N . Es-

sentially in every iteration, it replaces the value of g with the updated mid for edges

created in line 8 of Algorithm 5.

4.3.5 Correctness and Time Complexity

Why min s-t cuts direct the search? Next we show with a lemma that the

min s-t cut of N involving g can help us determine whether the current guessed

context weighted density is higher or lower than the optimal one after introducing

some notations.

Notations. Ho denotes the CC, i.e., Ho = argmaxH{AD(H)|H ⊆ G} and the corre-

sponding optimal contextual weighted density is AD(Ho). Ho′ denotes the candidate

CC obtained from the updated N . g denotes a guessed contextual weighted density

Lemma 4.3.4. Given a min s-t cut (Sm, Tm), if Sm \ {s} 6= ∅, then g < AD(Ho′);

else if Sm \ {s} = ∅ then g > AD(Ho′) .

Proof sketch. First of all, given the cut with S = {s}, the cost of the cut is

|V (G)|(∑4∈Tri(G)w(4) +

∑e∈E w(e)). Since (Sm, Tm) is the minimum cut, we shall

have the inequality: c(Sm, Tm) ≤ |V (G)| (∑4∈Tri(G) w(4) +

∑e∈E w(e)). To prove

82

the lemma, for convenience we instead prove its contrapostive holds which is: 1) if

g ≥ AD(Ho′), then Sm \ {s} = ∅; 2) if g ≤ AD(Ho′) then Sm \ {s} 6= ∅ where

Ho′ = G(XSm).

We prove 1) by contradiction. Assuming g < AD(G(XSm)) while Sm \ {s} =

∅. But this only happens when g = AD(G(XSm)) as in this case S = {s} and

c(Sm, Tm) = |V (G)|(∑4∈Tri(G)w(4) +

∑e∈E w(e)), a contradiction. Hence, g ≥

AD(G(XSm)) holds.

Now for 2), assuming there exists a G(XS′) with context weighted density ad′,

g ≥ ad′ while S \ {s} 6= ∅. The c(S ′m, T′m) now becomes |V (G)| (

∑4∈Tri(G) w(4)

+∑

e∈E w(e)) + |XS′ |(g − ad′). Based on the derived inequality, (S ′m, T′m) shall be

no greater than |V (G)| (∑4∈Tri(G) w(4) +

∑e∈E w(e)), which holds if and only if

g ≤ ad′. This contradicts to our assumption that g ≥ ad′. As such, the second part

must hold.

How binary search stops? To ensure the binary search coverage of CC within

a finite number of searches, we have to show that there is a finite range of context

weighted densities. The range of the possible densities is as follows.

{wn| 0 ≤ w ≤ |Q| × (6C3

|V (N)| + 2|V (N)|2), 1 < n ≤ |V (G)|} (4.7)

While we show the smallest search interval (min interval in Algorithm 4) between

different densities is 1n(n−1) as follows. This interval can be expressed as w1

n1− w2

n2,

i.e. n2w1−n1w2

n1n2. When n1 = n2, since the minimum difference between w1 and w2 is

1, we have the minimum interval of 1n. When n1 6= n2, we have n1n2 ≤ n(n − 1)

and n2w1 − n1w2 ≥ 1, which together makes the minimum interval 1n(n−1) . Further,

equation 4.7 can guide the initial search interval for Algorithm 4.

The termination condition of Algorithm 4 can now be determined with the mini-

mum search interval and the conditions in Lemma 4.3.4, which is formally stated as

follows.

Stop Condition 4.3.1. If there is community Ho with density score of AD(Ho),

while there is no community H ′o with score less than AD(Ho) + 1n(n−1) .

83

Correctness. Now we are ready to present the correctness theorem.

Theorem 4.3.1. Algorithm 4 correctly find CC with the aid of the proposed auxiliary

flow network.

The theorem is immediate from the correctness of Lemma 4.3.4 and Stop Con-

dition 4.3.1.

Time complexity. The time complexity of Algorithm 4 is O (log (|V (G)| |Q|) ×

(|V (N)|3)). The time to construct N is dominated by triangle counting O(|E(G)| 32 )

and we only need to construct N once and then tune the guessed weighted density.

The total number of binary searches is bounded by log(|V (G)||Q|). For each binary

search we need to run a min s-t cut algorithm on N , which can be bound by (|V (N)|3))

using first-in first-out (FIFO) version of the preflow algorithm proposed in [40].

Algorithm 6: Iterative optimization algorithm

Data: G1 Ho ← G;2 ado ← AD(Ho);

3 H′o ← argmaxH′{l(ad,H ′)|ad = ado, H

′ ⊆ G};4 while l(ado, Ho ← H

′o) 6= 0 do

5 ado ← AD(Ho);6 H ′o ← argmaxH′{l(ad,H ′)|ad = ado, H

′ ⊆ G};7 return Ho;

4.4 An Improved Approach

Mathematically the objective function of CC model is to maximize a fractional den-

sity function. Following this observation, we propose an algorithm solving the CC

search problem via an iterative optimization framework. A pitfall implementation of

this framework based on network flows can be easily implemented with time com-

plexity of O(|V (G)| × (|V (N)|3)). However, as shown in the next section, we further

achieve a more sophisticated implementation taking time O(|V (N)|3) via the tech-

nique of parametric maximum flow. In this section, we focus on showing how the

84

optimization framework guarantees to find CC and in next section we show its run-

time improvement.

4.4.1 Optimization Framework

The overall idea of the algorithmic framework is as follows. We start off by considering

the whole graph G as a candidate CC and then use the stop condition in line 4 of

Algorithm 6 (will be explained later) to check if G itself is a CC or not. If not,

we generate a better solution by solving a subproblem argmaxH′ {l(ad,H ′)|ad =

ado, H′ ⊆ G}, in which l(ad,H ′) is defined as:

l(ad,H ′) =∑

4∈Tri(H′)

w(4) +∑

e∈E(H′)

w(e)− ad× (|V (H ′)|) (4.8)

and check the optimality again. Improved solution gets repeatedly generated as such

until the stop condition is met, i.e., a CC is found. These detailed steps are displayed

in Algorithm 6.

In fact in theoretical computer science and network optimization research [38, 56],

such framework has been adopted to solve various optimization problems including

degree densest subgraph. However, it becomes non-trivial to prove the framework

together with our carefully designed Algorithm 7 actually solves the CC search prob-

lem involving contextual weighted edges and triangles. Further, how fast is problem

solved with the framework is unclear. Specifically, our challenges are: 1) Can lines

4 to 7 in Algorithm 6 ensure Ho converges to the optimal solution? 2) How to solve

the subproblem argmaxH′{l(ad,H ′)|ad = ado, H′ ⊆ G}? And 3) how best can Algo-

rithm 6 run in worst case? We answer questions 1) and 2) in the following subsections

and 3) in the next section.

4.4.2 Algorithm Correctness

Algorithm 6 is clearly correct if argmaxH′{l(ad,H ′)|ad = ado, H′ ⊆ G} (line 6) always

returns H ′o having a higher density score, i.e., the current iteration generates a can-

85

didate community having the so-far highest attribute-weighted density. We formally

prove this property with lemmas as follows. Notice that in the following arguments

we follow the same symbols as in Algorithm 6.

Lemma 4.4.1. Let ad′ = AD(H ′o) generated by Algorithm 6, then ad′ > ado always

holds before the algorithm terminates.

Proof sketch. By definition, we have l(ad′, H ′o) =∑4∈Tri(H′

o)w(4) +

∑e∈E(H′

o)w(e)

− ad′ × (|V (H ′o)|) = 0. Then after substracting ado × (|V (H ′o)|) from both sides of

the equation, we get∑4∈Tri(H′

o)w(4) +

∑e∈E(H′

o)w(e) − ado × (|V (H ′o)|) = ad′ ×

(|V (H ′o)|) − ado× (|V (H ′o)|). Therefore if the left side of the equation is greater than

0, then ad′ > ado must hold. For this, we now prove an auxiliary lemma as follows.∑4∈Tri(H′

o)w(4) +

∑e∈E(H′

o)w(e) − ado × (|V (H ′o)|) > 0,

Lemma 4.4.2. Let ado = AD(Ho), then l(ado, H′o) > 0 before Algorithm 6 termi-

nates.

Proof sketch. Since H ′o ← argmaxH′{l(ad,H ′)|ad = ado, H′ ⊆ G}, by definition

argmaxH′{l(ad,H ′)|ad = ado, H′ ⊆ G} > l(ado, Ho) must hold before the algo-

rithm terminates. Since l(ado, Ho) = 0 by definition, then any subgraph returned

by argmaxH′{l(ad,H ′)|ad = ado, H′ ⊆ G} is strictly greater than 0. Hence the cor-

rectness of the lemma is clear which ensures the correctness of Lemma 4.4.1.

Iteration stop condition. Next, we show the stop condition as follows. Let ado =

AD(argmaxH{AD(H)|H ⊆ G}), which will be arrived after certain iterations based

on Lemma 4.4.1, then l(ado, Ho ← H ′o) = 0 means Ho is the result of argmaxH

{AD(H)|H ⊆ G}.

As a conclusion, Lemmas 4.4.1, 4.4.2 and the stop condition together guarantee

the correctness of Algorithm 6.

4.4.3 Solving the Subproblem

Here we prove the optimization problem argmax′H{l(ad,H ′) | ad = ado, H′ ⊆ G}

can be solved by finding min s-t cut in our carefully constructed flow network. In

86

Algorithm 7: Flow network for solving a subproblem

Data: G1 V (N)← ∅, E(N)← ∅, C(N)← ∅;2 create vertices s and t, V (N)← V (N) ∪ {s} ∪ {t} ;3 for 4 ∈ Tri(G) || v ∈ V (G) || e ∈ E(G) do4 create a node n, V (N)← V (N) ∪ {n};5 if n is a triangle (u, v, w) then6 create a directed edge from s to 4 with capacity of w((u, v, w));7 for each edge node n′ representing (u, v), (u,w), (v, w) do8 create a directed edge ((u, v, w), n′) with infinite capacity;

9 if n is an edge (u, v) then10 create a directed edge from s to e with capacity of w(e);11 for each vertex node n′ representing u, v do12 create a directed edge ((u, v), n′) with infinite capacity ;

13 if n is a vertex node then14 create a directed edge (n, t) with capacity of ado;

15 return N ;

particular, we first show in Algorithm 7 the detailed construction of the flow network

N given a current ado and the social network G. Note that this new construction

is quite different from the previous one where edge nodes are also included and the

structural mapping G→ N becomes more transparent with added infinite capacities.

Next we show the equivalence between the problems of solving argmaxH′ {l(ad,H ′)|d =

do, H′ ⊆ G} and finding the min s-t cut in N . We achieve this by the following lem-

mas.

Lemma 4.4.3. Valid cut. For every s-t cut in N without edges having infinite

capacity, nodes in S corresponding to valid subgraph H in G, i.e., (1) if an edge node

is in S the involved two vertex nodes are in S and (2) if a triangle node is in S the

involved three vertex and edge nodes are in S.

Lemma 4.4.3 is obvious from the construction in Algorithm 7.

Lemma 4.4.4. The min s-t cut for N corresponds to a solution of argmaxH′ {l(ad,H ′) | ad =

ado, H′ ⊆ G}.

Proof sketch. Firstly, the min s-t cut must be a valid cut defined by Lemma 4.4.3.

87

Then we show the following correspondence. Mathematically, let (Sm, Tm) be the min

s-t cut while let (S, T ) be a valid s-t cut. Since both of them are valid cut, c(Sm, Tm)

=∑

v∈Sm c(v, t) +∑

v∈Tm c(s, v) and c(S, T ) =∑

v∈S c(v, t) +∑

v∈T c(s, v). Clearly,

the inequality c(Sm, Tm) ≤ c(S, T ) holds.

The inequality∑

v∈Sm c((v, t)) +∑

v∈Tm c((s, v))≤∑

v∈S c((v, t)) +∑

v∈T c((s, v))

holds. We will show its detailed transformations leading to Equation 4.8.

Now, by the flow network we constructed,∑

v∈Tm c(s, v) can be expressed as∑4∈Tri(G)w(4) +

∑e∈E(G)w(e) − (

∑4∈T (S)w(4) +

∑e∈E(S)w(e)), where T (S)

denotes triangles represented by triangle nodes in S and E(S) denotes the set of

edges represented by edge nodes in S. Since a min s-t cut is valid cut (defined by

Lemma 4.4.3), vertex, edge and triangle nodes in S \ {s} can form some Ho and

Ho contains triangles and edges represented by T (S) and E(S). Therefore, we can

rewrite the expression into∑4∈Tri(G)w(4) +

∑e∈E(G)w(e) − (

∑4∈Tri(Ho)w(4) +∑

e∈E(Ho)w(e)).

Next we show alternatively the expression of∑

v∈Sm c((v, t)). Clearly, if (v, t)

exists in E(N), v must be vertex node by construction and they must be contained in

Ho. As a result,∑

v∈Sm c((v, t)) can be expressed as∑

v∈V (Ho)ado, i.e. ado×|V (Ho)|.

The left part of the inequality now becomes∑4∈Tri(G)w(4) +

∑e∈E(G)w(e) −

(∑4∈Tri(Ho)w(4) +

∑e∈E(Ho)

w(e)) + ado × |V (Ho)|. By the same approach, the

right part of the inequality can be expressed as∑4∈Tri(G)w(4) +

∑e∈E(G)w(e)

− (∑4∈Tri(H)w(4) +

∑e∈E(H)w(e)) + ado × |V (H)|, in which H is the subgraph

consisting of vertex nodes in S and containing all triangles and edges represented by

triangle and edges nodes in S. By negating both sides of the above newly expressed

inequality and after simplification, we get∑4∈Tri(Ho)w(4) +

∑e∈E(Ho)

w(e) - ado ×

|V (Ho)| ≥∑4∈Tri(H)w(4) +

∑e∈E(H)w(e) − ado × |V (H)|. The inequality clearly

shows the min s-t is equivalent to finding the solution argmaxH′{l(ad,H ′) | ad =

ado, H′ ⊆ G}.

88

4.4.4 Analysing the Number of Iterations

At most n iterations. We show the number of iterations in Algorithm 6 is at most

n, where n = |V (G)|, thereby a first runtime complexity of O(|V (G)| × (|V (N)|3))

is implied. Note that we will show in the next section how to avoid this runtime

factor n via an incremental parametric flow framework. This claim of linear number

of iterations is formally proved as follows.

Proof sketch. Given any three consecutive iterations, let Ho, H′o and H ′′o be the

candidates generated by Algorithm 6. Accordingly, let ad = AD(Ho), ad′ = AD(H ′o)

and ad′′ = AD(H ′′o ). By the definition, we have H ′o is the result of argmaxH′ {

l(ad,H ′) | H ′ ⊆ G}. Now we are about to generate H ′′o based on H ′o. Vertices

contained in H ′′o can be categorised as follows. Firstly, it contains some vertices that

are in V (H ′o), denoted by V1, V1 ⊂ V (H ′o). Secondly, it contains some vertices that

are not in V (H ′o), denoted by V2. If proving V2 will always be ∅, we can conclude that

given any consecutive generated H ′o and H ′′o , H ′′o ⊂ H ′o will always hold.

We prove that by contradiction. We assume V2 6= ∅, which results in H ′o is not

the optimum results of argmaxH { l(ad,H) | H ⊂ G}, and that contradicts to our

precondition that H ′o is the optimum result for argmaxH {l(ad,H)|H ⊂ G}, which

proves V2 must be ∅. We start the derivation as follows.

By definition, we have H ′′o must be argmaxH { l(ad′, H) | H ⊂ G}, which leads to

inequality l(ad′, H ′′o ) ≥ l(ad′, V1). From the inequality we get AD(G(V2)) ≥ ad′, which

is derived based on H ′′o = V1∪V2. Now, we will show that based on AD(G(V2)) ≥ ad′,

we will get l(ad,G(V (H ′o) ∪ V2)) > l(ad,H ′o). Based on the fact that G(V (H ′o)∪V2)) =

H ′o ∪ G(V2), l(ad,G(V (H ′o) ∪ V2)) − l(ad,H ′o) can be simplified to l(ad,G(V2)). Since

we have Lemma 4.4.1, AD(G(V2)) ≥ ad′ shall hold, which is∑4∈Tri(G(V2))

w(4) +∑e∈E(G(V2))

> |V2|ad′, and l(ad,G(V2)) is∑4∈Tri(G(V2))

w(4) +∑

e∈E(G(V2))− |V2|ad,

we can see that∑4∈Tri(G(V2))

w(4) +∑

e∈E(G(V2))− |V2|ad ≥ |V2|ad′ − |V2|ad > 0.

That is∑4∈Tri(G(V2))

w(4) +∑

e∈E(G(V2))− |V2|ad > 0, i.e., l(ad,G(V (H ′o) ∪ V2))

> l(ad,H ′o). The derived inequality shows that if V2 6= ∅, H ′o is not the optimum

solution of argmaxH {l(ad′, H)|H ⊂ G} whereas H ′o should be optimum. Therefore,

89

we conclude that given any consecutive generated H ′o and H ′′o , H ′′o ⊂ H ′o will always

hold.

4.5 The Incremental Parametric Maximum Flow

From the previous discussion we can observe that Algorithm 6 solves a series of

strongly correlated maximum flow problems to find the CC. In this section we show

that by modifying the proposed flow network, we can apply the faster parametric

maximum flow algorithm which instead incrementally updates a flow network.

Intuition. Algorithm 6 successively computes maximum flow problems that are

closely related, i.e. during the course of iteration, the structure of flow network

doesn’t change much except the value of ad.

Indeed, in [38], Gallo et.al. have investigated such facts and proposed the so-called

parametric maximum flow technique that keeps and incrementally updates the state

of the successive flow networks. It turns out that successively solving maximum flow

problems is computationally equivalent to only solving one maximum flow, thereby

dramatically speed up the whole computation.

However, it is unclear that if this method can be adopted to solve the CC search

problem much faster (avoid n rounds of flow computation in our case). In this sec-

tion we give an affirmative answer to this through slightly modifying the previously

constructed flow network. We also prove that the new network indeed correctly finds

CC with the parametric maximum flow algorithm.

In the following, we first provide preliminaries about parametric maximum flow

and then continue with applying this to solve CC search.

4.5.1 Preliminaries

The parametric maximum flow algorithm is based on the preflow algorithm. The pre-

flow algorithm computes a maximum flow in a given flow network. Two concepts are

essential before presenting the preflow algorithm, that are preflow and valid labelling.

Preflow. A preflow f on N is a real-valued function on node pairs satisfying the

90

capacity and antisymmetry constraints discussed in Section 4.3.1, and the relaxation

of the conservation constraint that is defined as below:

∑u∈V (N)

f(u, v) ≥ 0 for all v ∈ V \ {s}. (4.9)

Given a preflow, the excess e(v) is defined for v as∑

u∈V (N) f(u, v) if v 6= s, or infinity

if v = s. The value of the preflow is defined as e(v). A node v ∈ V (N)\{s, t} is called

active if e(v) ≥ 0. A preflow is a flow if and only if e(v) = 0 for all v ∈ V (N)\{s, t}. A

node pair (u, v) is a residual directed edge for f if f(u, v) < c(u, v), and the difference

c(u, v)− f(u, v) is the residual capacity of the directed edge. A pair (u, v) that is not

a residual directed edge is saturated.

Valid labelling. A valid labelling d for a preflow f is a function from the nodes

to the non-negative integers or infinity, such that d(t) = 0, d(s) = |V (N)|, and

d(u) ≤ d(v) + 1 for every residual directed edge (u, v). The residual distance df (u, v)

from a node v to u is the minimum number of directed edges on a residual path

from u to v, or infinity if no such path is available. If d is a valid labelling, d(v) ≤

min{df (v, t), df (v, s) + n} for any node v.

Preflow algorithm. The algorithm consists of repeating one of the two operations

on an active node u as follows until there is no active node. When the algorithm

terminates, the preflow f becomes a maximum flow.

Push operation. Let v be a forward neighbour of u, if d(u) = d(v) + 1 and f(u, v) <

c(u, v), push the flow valued of min{e(u), c(u, v)− f(u, v)} from u to v.

Relabelling operation. If for all v in forward neighbours of u, d(u) ≤ d(v), relabel d(u)

to min{d(v)|v ∈ forward neighbours of u}+ 1.

Computation of min s-t cut. After running the preflow algorithm, a minimum

cut can be found as follows. For each node u ∈ V (N), replace d(u) by min{df (u, s)+

|V (N)|, df (u, t)}. Then, the cut (S, T ) defined by S = {u|d(u) ≥ |V (N)|} is a

minimum cut whose sink partition T is of minimum size. If desired, a cut (S ′, T ′)

of minimum-size S ′ can be computed as follows. For each u ∈ V (N) let d′(v) =

min{df (s, u), df (t, u) + |V (N)|}, and let S ′ = {d′(u) < |V (N)|}.

91

Algorithm 8: Parametric preflow algorithm

Data: G1 Ho ← G, ado ← AD(G);2 construct parametric flow network NR and set λ = ado; obtain H ′o from min

s-t cut in NR;3 while l(ado, Ho ← H ′o) 6= 0 do4 ado ← AD(Ho), λ ← ado;5 obtain H ′o from min s-t cut in NR;

6 return Ho;

Time complexity. We note the bounds derived in [38] as follows.

Lemma 4.5.1. For any active node v ∈ V (N), d(v) ≤ 2|V (N)| − 1. The value of

d(v) never decreases during the running of the algorithm. The total number of relabel

steps is therefore O(|V (N)|2).

Lemma 4.5.2. The number of saturating push steps through any residual directed

edge (u, v) is at most one per value of d(v). The total number of saturating push

steps is therefore O (|V (N)||E(N)|); there is an implementation that each of such

steps takes constant time.

Lemma 4.5.3. The total number of non-saturating push steps is O (|V (N)|2 |E(N)|);

there is an implementation that each of such steps takes constant time.

Lemma 4.5.4. There is an FIFO version of preflow algorithm with time complex-

ity of O(|V (N)|3). There is a dynamic tree version of preflow algorithm with time

complexity of O(|V (N)||E(N)| log( |V (N)|2|E(N)| )).

4.5.2 Parametric Flow Framework

Algorithm 8 shows how to find CC using a tailored parametric preflow algorithm. It

considers the progressively modified ado as a parameterised capacity in NR, where

NR will be discussed later in great detail. The overall structure of the algorithm is

similar to Algorithm 6, i.e., it continuously generates Ho with higher context weighted

density until reaching the stop condition.

92

Incremental computation. During each iteration, internally the algorithm main-

tains preflow labels via updating the labels computed from the previous iteration.

Further, in order to compute H ′o, preflow value and some edge capacities are updated

according to Ho generated in the previous iteration. More details about the update

procedure are discussed next after introducing our newly designed parametric flow

network NR.

4.5.3 CC Parametric Flow Network

In this section, we present the parametric flow network NR used in Algorithm 8 and

prove that it makes Algorithm 8 find CC correctly.

Parametric flow network basics. A parametric flow network is a flow network in

which edge capacity is a function of λ having following constraints. Firstly, cλ(s, v)

is a nondecreasing function of λ for all v 6= t. Secondly, cλ(v, t) is a nonincreasing

function of λ for all v 6= s. Thirdly, cλ(u, v) is constant for all u 6= s and v 6= t. The

maximum flow or minimum cut in a parametric network is the maximum or minimum

w.r.t. a particular value of the parameter λ.

Parametric flow network for CC search. The NR in Algorithm 8 can be gener-

ated using Algorithm 7 by: (1) reversing direction of all edges, (2) exchanging s and

t, and (3) keeping the same capacities for all edges except for edges directed from

s. The capacities for edges directed from s are considered are a function of λ, i.e.,

cλ(s, v) = λ and λ will be updated to some new generated ado up to the time. That

is, in our designed NR, λ = {ad0, . . . , ado}, where ad0 = AD(G) and ado = AD(Ho),

hence Ho is the CC. The designed NR clearly satisfies the capacity constraints of

parametric flow network.

To ensure the correctness of Algorithm 8, NR shall satisfy two constraints below

and we prove them with corresponding lemmas.

Objective invariant. The min s-t ofNR shall be equivalent to argmaxH′{l(ad,H ′)|ad =

ado, H′ ⊆ G}.

Valid preflow and labelling invariant. After replacing the capacities of all {(s, v)}

in E(NR) with newly generated ado and setting their flow to be ado, the updated NR

93

shall still maintain a valid preflow and labelling.

Lemma 4.5.5. The min s-t of NR for some ado is equivalent to argmaxH′ {l(ad,H ′)

|ad = ado, H′ ⊆ G}.

Proof sketch. The overall proof idea is similar to the proof of Lemma 4.4.4. Firstly,

for any min s-t cut, vertex nodes in T correspond to a valid subgraph Ho in G,

otherwise, the s-t cut must break some edges with infinity capacities.

Let (Sm, Tm) be a min s-t cut, such cut shall break no edges with infinite capacities.

Therefore c(Sm, Tm) can be expressed in∑

v∈Sm c(v, t) +∑

v∈Tm c(s, v). Then, for any

other s-t cut (S, T ) that does not break edges with infinite capacities, the inequality

c(Sm, Tm) ≤ c(S, T ), shall hold.

Since∑

v∈Sm c(v, t) +∑

v∈Tm c(s, v) equals to∑4∈Tri(Sm) w(4) +

∑e∈E(Sm)

w(e) +∑

v∈V (Tm) ado, in which Tri(Sm) denotes the set of triangles represented by

triangle nodes in Sm, E(Sm) denotes the set of edges represented by edge nodes

in Sm and V (Tm) denotes the set of vertices represented by vertex nodes in Tm.

Since∑

4∈Tri(Sm) w(4) +∑

e∈E(Sm)w(e) equals to∑4∈Tri(G) +

∑e∈E(G)w(e) −∑

4∈Tri(Tm)w(4) −∑

e∈E(Tm)w(e), we can derive that c(Sm, Tm) =∑4∈Tri(G) +∑

e∈E(G)w(e) −∑4∈Tri(Tm)w(4) −

∑e∈E(Tm)w(e) +

∑v∈V (Tm) ado.

Applying similar operations on (S, T ), we can get the inequality as follows:∑4∈Tri(G)

+∑

e∈E(G)w(e) −∑4∈Tri(Tm) w(4) −

∑e∈E(Tm)w(e) +

∑v∈V (Tm) ado ≤

∑4∈Tri(G)

+∑

e∈E(G) w(e) −∑4∈Tri(T ) w(4) −

∑e∈E(T ) w(e) +

∑v∈V (T ) ado. After simplifi-

cation we get:∑4∈Tri(Tm) w(4) +

∑e∈E(Tm) w(e) −

∑v∈V (Tm) ado ≤

∑4∈Tri(T )w(4)

+∑

e∈E(T )w(e) −∑

v∈V (T ) ado. Considering Ho as the subgraph induced by vertex

nodes in Tm, the equivalence between min s-t cut in NR and argmaxH′ {l(ad,H ′) |

ad = ado, H′ ⊆ G} is obvious.

Lemma 4.5.6. After replacing capacities to all {(s, v)} in E(NR) with new generated

ado, and setting their flow to be ado, the updated NR still maintains a preflow and

valid labelling.

Proof sketch. After calculating ado using push-relabel algorithm, the preflow in the

previous NR becomes a flow. Since push-relabel algorithm maintains valid labelling

94

during the algorithm and after replacing capacities to all {(s, v)} in E(NR) with

new generated ado, we do not modify the labels, therefore, the valid labelling is still

maintained after updating the capacity.

On the other hand, after deriving the new ado, NR should have a maximum flow,

i.e., for each node v, fn∈V (NR) = 0 shall hold. After setting flows for all {(s, v)} in

E(NR) to be the new generated ado, the flow in NR becomes a preflow since ado is

greater than previously generated ado.

The correctness of Lemmas 4.5.5 and 4.5.6 guarantees that the designed NR

makes Algorithm 8 output CC.

4.5.4 Time Complexity

The time complexity of Algorithm 8 is O(|V (N)|3). Now we briefly discuss the proof

idea. Overall speaking, even though Algorithm 8 progressively modifies the preflow, it

never reduces d(v) for all v ∈ V (N)\{s, t} during runtime. Also since d(v) is bounded

by 2|V (N)|−1 (Lemma 4.5.1), Algorithm 8 can be considered as solving a single min

s-t cut problem. A more detailed proof leading to this faster implementation can be

found in [38].

4.6 Approximation Algorithm

We have demonstrated via exact algorithms that CC can be found in polynomial

time. However, despite the solution optimality the runtime bottleneck lies in finding

min s-t cut in a large flow network. For further scalability of CC search, we instead

look for a greedy based approximation algorithm that trades off solution accuracy

for speed. Our proposed algorithm as follows is inspired from the peeling strategy in

[97, 61]. In our case, the algorithm iteratively removes the vertex affecting the least

context score. The main challenge here is to derive guaranteed approximation ratio

and runtime. Next, we show this approximate algorithm in Algorithm 9 followed by

its approximate ratio and time complexity.

95

Algorithm 9: Approximate CC

Data: G1 Ho ← ∅;2 do ← 0;3 while V (G) 6= ∅ do4 v′ ← argminv∈V (G){

∑4∈T (v,G)w(e) +

∑e∈E(v,G)w(e)};

5 E(G) ← E(G) \ E(v′, G), V (G) ← V (G) \ {v′}, G ← (V (G), E(G));6 if AD(G) > ado then7 Ho ← G;

8 return Ho ;

Theorem 4.6.1. Algorithm 9 is a 13-approximation for finding CC.

Proof sketch. Let Ho be the optimal community and ∀v ∈ V (Ho) let Tri(v,Ho) be

the set of triangles containing v in Tri(Ho). Further, let E(v,Ho) be the set of edges

containing v in E(Ho). Since AD(Ho) ≥ AD(H ′) where H ′ = G(V (Ho) \ {v}), we

can derive the inequality as below:

∑4∈Tri(v,Ho)

w(4) +∑

e∈E(v,Ho)

w(e) ≥ d(Ho) (4.10)

Now, let H be the subgraph in the iteration when the first vertex v ∈ Ho is

removed. Clearly, Ho ⊆ H and for each vertex u ∈ V (H), by the greediness of

Algorithm 9, the following inequality must hold.

∑4∈Tri(u,H)

w(4) +∑

e∈E(u,H)

w(e)

≥∑

4∈Tri(v,H)

w(4) +∑

e∈E(v,H)

w(e)

≥∑

4∈Tri(v,Ho)

w(4) +∑

e∈E(v,Ho)

w(e)

(4.11)

Further by inequalities (4.10) and (4.11), we can easily conclude that∑4∈Tri(u,H)w(4)+∑

e∈E(u,H)w(e) ≥ d(Ho).

96

Since the total weights of triangles and edges in H can be expressed as

AD(H)× |V (H)|

=∑

u∈V (H)

(1

3

∑4∈Tri(u,H)

w(4) +1

2

∑e∈E(u,H)

w(e))(4.12)

Then multiply both sides by 3, we get the inequality as follows.

3× AD(H)× |V (H)|

=∑

u∈V (H)

(∑

4∈Tri(u,H)

w(4) +3

2

∑e∈E(u,H)

w(e))

≥∑

u∈V (H)

(∑

4∈Tri(u,H)

w(4) +∑

e∈E(u,H)

w(e))

≥∑

u∈V (H)

(AD(Ho))

(4.13)

From (4.13) immediately AD(H) ≥ 13AD(Ho).

Time complexity. With simple index structures, the runtime of Algorithm 9 can be

bounded by O (|V (G)| log(|V (G)|) + |E(G)| log(|V (G)|)+ |Tri(G)|). Intuitively this

is because when removing a vertex v from G, we only need to update edge and triangle

changes for the affected vertices in N(v,G). For this, we can just update the context

scores of the affected vertices while in practice the number of affected edges and

triangles is significantly less than |V (G)|. Next, we show how pre-computed indices

actually help us locate affected vertices from their involved edges and triangles after

removing a vertex.

For each vertex, we hash the triangles it is involved in. This index takesO(|Tri(G)|)

space. This simple hash table can help us quickly identify affected triangles and

their involved vertices after removing a vertex. In addition, the graph adjacency

list can help us find affected edges and their vertices after removing a vertex. As

such there is an implementation of Algorithm 9 runs in time O(|V (G)| log(|V (G)|) +

|E(G)| log(|V (G)|) + |Tri(G)|).

97

4.7 Discussions

4.7.1 Finding Large and Connected CC

Algorithms 4, 6 and 9 are primarily designed for finding community with the highest

context weighted density. Since adding size constraint makes Problem 4.2.1 NP-hard

(resulting no polynomial algorithm), we instead harness with heuristics for reporting

a larger connected CC. Our first step is running a depth first search on Ho found by

Algorithm 4, 6 or 9 and outputing the largest connected component in Ho as CC. Then

for flow network Algorithms 4 and 6, we bias reporting larger Ho in every iteration.

We achieve this by first ranking min-cuts found by the preflow algorithm and then

select the min s-t cut partition containing the largest connected components.

Rank min-cuts. As discussed in Section 4.5.1 push-and-relabel algorithm can gen-

erate two types of min s-t cut when the maximum flow has been computed. The

first type maximises the size of S while the second type maximises the size of T . We

denote the first type of cut as (SM , T ) while the second type as (S, TM). We assign

them a score as follows. Let XSM be the vertex nodes in SM , the score of SM is

defined as the size of the largest connected subgraph of G(XSM ), denoted by cs(SM).

Scores are defined for T , S, and TM via the same approach.

Heuristics. Greedily, for Algorithm 4 when finding a min s-t cut after comput-

ing the maximum flow in every iteration, we always select the cut having S ′ =

argmaxS′{cs(S ′)|S ′ ∈ {SM , S}}. Similarly, for Algorithm 6, when finding a min

s-t cut after computing the maximum flow, we always select the cut having T ′ =

argmaxT ′{cs(T ′)|T ′ ∈ {TM , T}}.

4.7.2 State-of-the-art Maximum Flow Algorithms

The essential part of finding exact CC is solving a series of maximum flow problems.

Finding maximum flow has been studied extensively. The time complexity of finding

maximum has been improved from unbounded [82] to O(|V (N)||E(N)|) for any graph

and O( |V (N)|2log(|V (N)|)) when N satisfying |E(N)| = O(|V (N)|) [79].

98

Our implementation. We use FIFO preflow based algorithm with the heuristics

discussed in [21] as the subroutine for CC search in Algorithms 4 and 6. The main

reasons are: first of all, as mentioned in [21] this approach is practically faster than

other state-of-the-art algorithms having similar runtime complexity. Secondly, it is

relatively easy to implement.

Our implied time complexities. Algorithm 4 uses maximum flow algorithm

as a black box, therefore, best known algorithm for finding maximum flow can be

applied directly. As such, the time complexity of Algorithm 4 can be as fast as

O(|V (N)||E(N)| log(|V (G)|)), since the flow network constructed cannot guaran-

tee that |E(N)| = O(|V (N)|). For Algorithm 6, since it can be implemented as

O(|V (N)|) × time complexity of maximum flow algorithm, and the constructed flow

network satisfies |E(N)| = O(|V (N)|), this implies the algorithm can be as fast as

O( |V (N)|3log(|V (N)|)). For Algorithm 8, since it relies on the preflow algorithm, the best

known algorithm can make the time complexity become O(|V (N)||E(N)| log |V (N)|)

since |E(N)| = O(|V (N)|) is satisfied.


In this section, we conduct comprehensive and extensive experiment to verify the

effectiveness and efficiency of the proposed contextual community model and algo-

rithms. All the algorithms are implemented in Java and run on a MAC with Intel

Xeon (3.8 GHz) and 128GB main memory.

4.8.1 Experimental Setup

Datasets. We use a list of real-life and synthetic datasets in this experiments, as

shown in Table 4.1. These selected datasets are often applied to evaluate existing

methods of addressing the community search, attributed community search, and key-

word search problems. Facebook is one attributed network dataset with ground-truth

communities. The second line of datasets including DBLP1, and DBpedia and DBLP2

are attributed network datasets without the ground-truth communities, which have

99

Table 4.1: Statistic information of datasets

Dataset #vertices #edges #attributes # avg. triangles (|A(v)| > 3)Facebook 1.9K 8.9K 3064 43DBLP1 977K 3.5M 34213 121

DBpedia 8M 72M 45328 79DBLP2 1.5M 2.4M 13064 32Amazon 335K 926K 1674 19DBLP3 317K 1M 1584 133Youtube 1.1M 3M 5327 42

LiveJournal 4M 35M 11104 213Orkut 3.1M 117M 9926 55

Gowalla 197K 951K 3,890 29Brightkite 58K 214K 2,143 43Foursquare 5M 28M 4,970 16

Weibo 1M 32M 1,976 11Twitter 554K 2M 2,252 16

been used in previous works [35, 58].

We also generate two types of synthetic attributed network datasets. For the

datasets Amazon, DBLP3, Youtube, LiveJournal, and Orkut, they have the ground-

truth communities, but do not include the attribute information in the vertices. To

enrich the attribute information for the network datasets, we first generate a keyword

pool, and then randomly select 7 keywords for each community from the keyword pool.

After that, we assign each vertex with 0-7 of the keywords relating to the community

containing the vertex. For the community members, 80% of them must contain at

least one of the selected keywords. The above synthetic data generation is similar

to that in [53]. In addition, we generate the second type of synthetic datasets from

spatial social network data. For the datasets Gowalla, Brightkite, Foursquare, Weibo,

and Twitter, they have the structure information and the location information of the

users. To enrich the attribute information for the users, we retrieve the local sensitive

words for each location by calling for Google Map API and assign the sensitive words

to the corresponding users based on their location.

Compared models. To verify the performance of our proposed algorithms and

community model, we also implemented the variants of our model and the other

state-of-the-art works. All the compared models and algorithms have been provided

100

Table 4.2: Implemented algorithm for different community models

Model Algorithm Notation Detail

CCBinCC Implemented binary search based algorithm for searching CCMonoCC Implemented Algorithm 8 for searching CC

ACC ApxCC Implemented Algorithm 9 for searching approximate CCECC MonoCC Implemented Algorithm 8 for searching maximum contextual degree density communityTCC MonoCC Implemented Algorithm 8 for searching maximum contextual triangle density communityATC ATC Implemented LocATC[53] for searching attributed truss

in Table 4.2.

Context-weighted degree density (ECC). This model is a variance of contextual com-

munity model, which uses the contextual weighted edges as graph motif to identify

the most conhesive community Ho satisfying Ho=argmaxH{∑e∈E(H) w(e)

|V (H)| |H ⊆ G}.

Context-weighted triangle density (TCC). The contextual weighted triangles are only

used as the graph motif to find the most conhesive community Ho satisfying Ho =

argmaxH{∑

4∈Tri(H) w(4)

|V (H)| |H ⊆ G}.

Attributed truss (ATC in [53]). We also implemented the best algorithm proposed

in [53], denoted by ATC in this work, to search the attributed truss. Given a set

of query vertices, a set of query attributes, two integers k and d, ATC is to find

the (k,d)-truss satisfying constraints: (1) Every edge in the truss has at least (k-2)

common neighbours; and (2) The longest shortest path in the (k,d)-truss is no greater

than d.

Test queries. For each dataset, we randomly pick up and test 200 keyword queries

for each experiment. Each keyword query Q may contain 1-7 terms. The default

keyword queries only contain four terms in each of them. We report the average

performance of running the 200 keyword queries as the result in the experimental

evaluation. To increase the quality of the selected keyword queries, we select the

query terms from the attributed vertices appearing in the ground-truth communities

if the dataset has the ground-truth information. For the other datasets, we randomly

select the query terms and generate the test keyword queries.

In addition, we also generate 200 sets of test queries by following the similar

query generation in [53] where the size of the query vertices and the size of the query

attributes are set to 2.

Metrics. We use three evaluation metrics to verify the effectiveness of the proposed

101

f_1 f_2 f_3 f_4 f_5 f_6 f_7 f_8 f_9 f_100

0.2

0.4

0.6

0.8

1F1

scor

e

CCECC

TCCATC

ACC

Figure 4-3: F1 scores for facebook

contextual community model by simulating different query scenarios.

F1 score. We use F1 score metric to evaluate the quality of the resulting communities

if the dataset has the ground-truth communities. Given a resulting community Ho

and a ground truth community Hg to be targeted, the F1 is defined as F1 (Ho, Hg)

= 2prec(Ho,Hg)×recall(Ho,Hg)prec(Ho,Hg)+recall(Ho,Hg)

, where prec (Ho, Hg) is the average of precV (Ho, Hg) and

precE(Ho, Hg), while recall(Ho, Hg) is the average of recallV (Ho, Hg) and recallE

(Ho, Hg) defined as follows. The precision for vertex(edge) is precV (Ho,Hg) = |V (Ho)∩V (Hg)||V (Ho)|

(precE(Ho, Hg) = |E(Ho)∩E(Hg)||E(Ho)| ) . The recall for vertex (edge) is measured by recallV (Ho,

Hg) = |V (Ho)∩V (Hg)||V (Hg)| ( recallE(Ho, Hg) = |E(Ho)∩E(Hg)|

|E(Hg)| ).

Edge density and Jaccard similarity. For the other datasets without the ground truth

community information, we measure the average edge density and Jaccard similarity.

The edge density measures the cohesiveness of a community H, defined by ed(H) =

|E(H)||V (H)|2 . The Jaccard similarity reflects the attribute cohesiveness of pairwise vertices

from the perspective of query context Q, which is defined as J(u, v) = |Q∩A(u)∩A(v)|A(u)∩A(v) .

4.8.2 Effectiveness Evaluation

4.8.2.1 Dataset with both attributes and ground truth communities

The Facebook dataset has 10 ground-truth communities and their attribute informa-

tion. Figure 4-3 shows the F1 score of this dataset in the experiment. From the

102

Amazon DBLP Youtube LJ Orkut0

0.2

0.4

0.6

0.8

1F1

scor

e

CCECC

TCCATC

ACC

(a) F1 scores no ground-truth networks

Gowalla Brightkite Weibo Twitter Foursquare0.05

0.10.15

0.20.25

0.30.35

Edge

den

sity

CC ECC TCC ACC

(b) Edge density

Gowalla Brightkite Weibo Twitter Foursquare

1234567

Avg.

spat

ial d

ist. (

km)

CC ECC TCC ACC

(c) Average spatial distance

Figure 4-4: Effectiveness evaluation

results, we can see that The community found by CC method is clearly superior over

all the other models. For the ground truth communities f 4 and f 8, CC can find the

communities with the 90% accuracy that is close to the ground-truth communities.

The F1 scores for the communities returned by ACC method are sightly lower than

that of the communities found by CC. The reason is that ACC is approximate CC

having lower context weighted density compared to CC.

Based on the reported F1 scores, ACC performs better than ATC only except f 6.

The major reasons that lead to the lower performance of ATC are (1) ATC finds the

approximate communities based on their model but without theoretical guarantee;

and (2) the structural cohesive measurement of ATC may discard many edges con-

tained in the ground truth communities but do not satisfy the minimum number of

common neighbours constraint. DCC has the lowest performance in terms of F1 score

in this experiment. The reason is that DCC tends to find large communities which

may result in the much lower precision. Similarly, TCC cannot find the communities

that is relevant to the ground-truth communities. This is because the model of TCC

tends to find communities with high precision score, which may lead to the small size

103

of the resulting communities. The small sized communities often have lower score in

Recall. Hence, the TCC method has low F1 scores.

The reasons that our proposed CC search can find near ground truth communities

are as follows. Firstly, in real datasets, the structural characteristic of ground truth

communities is dense while does not confirm certain strict constraint such as k-truss

and k-core. Our edge and triangle density based model can capture such character-

istic. On the other hand, the ground truth commonties in real datasets mostly have

members sharing common interest, where our proposed contextual score can capture

such common interest features if the query context correctly reflects such common

interest. The query context tested in this subsection are randomly selected from

attributes contained in ground truth communities. As such, the finding results are

extremely promising.

4.8.2.2 Dataset with the ground-truth communities only

Figure 4-4 (a) shows the experimental results when we run the queries on the five

datasets, i.e., Amazon, and DBLP, Youtube, LoveJournal(LJ ) and Orkut.

Our proposed methods can find the communities that are most close to the ground-

truth communities. The average F1 score of CC is over 90%. Although the average

F1 score of ACC is about 80%, it still outperforms ATC for most of the datasets

except for DBLP. The main reasons of our methods achieving the superiority are

similar to the explanation in Section 4.8.2.1.

4.8.2.3 Datasets with the spatial attributes

We also evaluate the edge density and average pairwise spatial distance for commu-

nities found by different methods on five generated datasets, i.e., Gowalla, Brightkite,

Foursquare, Weibo, and Twitter. The experimental results are shown in Figure 4-4

(b) and (c). A spatial community will be considered as desired communities in many

applications if it has high structural cohesiveness and low average pairwise spatial

distance among the community members. Based on the edge density metric, TCC

can find the communities with the highest edge density for all datasets. The edge

104

DBLP1 DBpedia DBLP200.10.20.30.40.50.6

Edge

den

sity

CCECC

TCCATC

ACC

(a) Edge density

DBLP1 DBpedia DBLP20

200

400

600

800

Agg.

Jac.

sim

ilarit

y CCECC

TCCATC

ACC

(b) Aggregate Jaccard similarity

Figure 4-5: Attributed networks with no ground-truth

0 1 2 3 4 5 6 7|Q|

0.2

0.4

0.6

0.8

F1 sc

ore

f1AmazonDBLPYoutubeLjOrkut

(a) F1 score

0 1 2 3 4 5 6 7|Q|

0.10.20.30.40.50.60.7

Edge

den

sity

DBLP1DBpediaDBLP2

(b) Edge density

0 1 2 3 4 5 6 7|Q|

0100200300400500600700

Agg.

Jac.

sim

ilarit

y

DBLP1DBpediaDBLP2

(c) Agg. Jac. similarity

0 1 2 3 4 5 6 7|Q|

0.51

1.52.02.5

∞

Avg.

spat

ial d

ist. (

km)

GowallaBrightkiteFoursquareWeiboTwitter

(d) Avg. distance

Figure 4-6: Sensitivity w.r.t. query attribute size

105

20% 40% 60% 80% 100%Percentage of vertices

02468

101214

Runn

ing

time

(s) BinCC

MonoCCApxCCATC

(a) Facebook


020406080

100

Runn

ing

time

(s) BinCC

MonoCCApxCCATC

(b) Amazon


020406080

100

Runn

ing

time

(s) BinCC

MonoCCApxCCATC

(c) DBLP


0

50

100

150

200

Runn

ing

time

(s) BinCC

MonoCCApxCCATC

(d) Youtube


0

200

400

600

800

Runn

ing

time

(s) BinCC

MonoCCApxCCATC

(e) LiveJournal


0100200300400500600

Runn

ing

time

(s) BinCC

MonoCCApxCCATC

(f) Orkut

Figure 4-7: Scalability

106

density can be over 0.25 in average and 0.33 for Gowalla. CC can also find relative

large edge density communities especially for Brightkite and Foursquare, the edge

density for the found communities is as high as that for communities found by TCC.

In contrast, ACC performs almost the same as CC, which implies the approximate

CC s found by ACC are almost as dense as communities found by CC. In details,

TCC has the shortest average pairwise spatial distance for all datasets, i.e., less than

2 km in average. CC can find communities with the short average pairwise spatial

distance as well, especially for Weibo. The short average pairwise spatial distance of

the communities found by CC is similar to that of TCC. Figure 4-4 (b) and (c) also

show the communities searched by CC and ACC. They have the high edge density

and the short average pairwise spatial distance than the other methods, i.e., they can

find good spatial communities having high cohesiveness and short average pairwise

distance.

4.8.2.4 Datasets with the real attributes

We also evaluate the proposed model using the datasets with real attributes, i.e.,

DBLP1, and DBpedia and DBLP2. Figure 4-5 (a) and (b) illustrate the evaluation

of the different methods using the metrics of edge density and aggregate Jaccard

similarity with regards to different queries. The resulting communities with high edge

density and aggregate Jaccard similarity will be treated as the ideal results because

such community is more cohesive in structure and context, and its community size is

more larger.

From the results, we can see that TCC can find the communities with the highest

edge density for all datasets. The edge density can reach 0.45 in average and 0.5

for DBLP1. But their aggregate Jaccard similarity value is low due to the small size

of the communities to be returned. Another observation is that ECC can find the

communities with the highest aggregate Jaccard similarity value, but the lowest edge

density value for all datasets by comparing with all the other methods. It says that

the communities of ECC may have larger size and higher relevance w.r.t. query but

are very sparse from structural perspective. Furthermore, we can see that CC and

107


0100200300400500600700

Runn

ing

time

(s) BinCC

MonoCCApxCCATC

(a) DBpedia


05

10152025303540

Runn

ing

time

(s) BinCC

MonoCCApxCCATC

(b) Gowalla


010203040

Runn

ing

time

(s) BinCC

MonoCCApxCCATC

(c) Brightkite


0200400600800

Runn

ing

time

(s) BinCC

MonoCCApxCCATC

(d) Foursquare


0255075

100125150175

Runn

ing

time

(s) BinCC

MonoCCApxCCATC

(e) Weibo


020406080

100120

Runn

ing

time

(s) BinCC

MonoCCApxCCATC

(f) Twitter

Figure 4-8: Scalability cont.

108

1 2 3 4 5 6 7|Q|

100

101

102

Runn

ing

time

(s) BinCC

MonoCCApxCC

(a) Facebook

1 2 3 4 5 6 7|Q|

100

101

102

Runn

ing

time

(s) BinCC

MonoCCApxCC

(b) Amazon

1 2 3 4 5 6 7|Q|

10−1

100

101

102

Runn

ing

time

(s)

BinCCMonoCCApxCC

(c) DBLP

1 2 3 4 5 6 7|Q|

100

101

102

Runn

ing

time

(s)

BinCCMonoCCApxCC

(d) Youtube

1 2 3 4 5 6 7|Q|

101

102

103

Runn

ing

time

(s)

BinCCMonoCCApxCC

(e) LiveJournal

1 2 3 4 5 6 7|Q|

100

101

102

103

Runn

ing

time

(s)

BinCCMonoCCApxCC

(f) Orkut

Figure 4-9: Effect of |Q|

109

1 2 3 4 5 6 7|Q|

101

102

103

Runn

ing

time

(s) BinCC

MonoCCApxCC

(a) DBpedia

1 2 3 4 5 6 7|Q|

100

101

102

103

Runn

ing

time

(s) BinCC

MonoCCApxCC

(b) Gowalla

1 2 3 4 5 6 7|Q|

100

101

102

103

Runn

ing

time

(s) BinCC

MonoCCApxCC

(c) Brightkite

1 2 3 4 5 6 7|Q|

101

102

103

Runn

ing

time

(s)

BinCCMonoCCApxCC

(d) Foursquare

1 2 3 4 5 6 7|Q|

101

102

103

Runn

ing

time

(s)

BinCCMonoCCApxCC

(e) Weibo

1 2 3 4 5 6 7|Q|

101

102

Runn

ing

time

(s)

BinCCMonoCCApxCC

(f) Twitter

Figure 4-10: Effect of |Q| cont.

110

ACC have the balanced performance in both high edge density and high aggregate

Jaccard similarity. Therefore, we would like to say that CC and ACC are the ideal

methods to retrieve the cohesive and relevant communities with the balanced benefit.

We also find that the communities of ATC have less edge density and lower aggregate

Jaccard similarity by comparing with CC and ACC. Therefore, we can conclude that

CC and ACC methods are superior to other methods.

4.8.2.5 Varying Query Sizes

Figure 4-6 (a) reports F1 score sensitivity w.r.t. the size of query context for CC

method. For all datasets, F1 score firstly grows as |Q| increases and then decreases

after certain threshold. The intuition is that we can find near ground-truth commu-

nities if the query context correctly describes the desired communities. The small size

of Q implies that the queries cannot be used to precisely describe the ground-truth

communities while the large size Q will induce noise disturbing the search. The ex-

periment shows that the CC method can find the results near to the ground-truth

communities in all datasets when the query context includes 4 to 5 attribute keywords.

Figure 4-6 (b) reports the edge density sensitivity of the CC method when we

vary the size of query context. For all the datasets, the edge density value increases

as |Q| becomes larger. When |Q| = 0, we consider the unweighted subgraph, which

results in the highest edge density. Since the context constrained subgraphs would

always be the sub-components of the unweighted subgraph, the context constrained

subgraphs become close to the unweighted subgraph when |Q| increases, which aligns

with the observed trends. The similar trending observation can be found for the

CC method in terms of the aggregate Jaccard similarity sensitivity when the size

of query varies, as shown in Figure 4-6 (c). Figure 4-6 (d) illustrates the changes

of the average pairwise spatial distance value for CC method when the size of Q

varies. In this experiment, we are only interested in the resulting communities where

the members’ average pairwise spatial distance is no more than 5 km. Otherwise,

the result is considered as infinity. Since we randomly generate the query contexts

and the different contexts represent different locations, CC can only find low average

111

pairwise spatial distance for queries with individual context. The reason is that, when

query contexts is more than one, CC includes users from different regions. As such,

CC finds communities with average pairwise spatial distance around 2 km for all

datasets if the application context describes a suburb.

4.8.3 Efficiency Evaluation

Scalability. Figures 4-7 (a) to (f) and Figures 4-8 (a) to (f)show scalability evalu-

ations for different datasets. For all datasets, our proposed methods MonoCC and

ApxCC perform well as data size increasing. Especially for ApxCC, it can answer

queries within several seconds for all datasets and it is faster than ATC for all

datasets. Please be noticed that ATC is a greedy algorithm with no effectiveness

guarantee. MonoCC can find queries exact CC within few ten of seconds in most

datasets. For all our proposed methods BinCC, MonoCC and ApxCC, the experiment

shows they match the discussed time complexity.

Varying |Q|. Figures 4-9 (a) to (f) and Figures 4-10 (a) to (f) show the running

times as |Q| varying for different datasets. MonoCC outperforms BinCC 4 to 7 times

in average for all datasets, which demonstrates the power of parametric algorithm.

ApxCC performs much better and the running time grow almost linear as |Q| increas-

ing for all datasets. For MonoCC and BinCC, the running time increases significantly

as |Q| becoming greater for all none spatial attributed datasets, i.e., Figures 4-9 (a)

to (f) and Figure 4-10 (a). However, for spatial attributed datasets, the running time

of them increases almost linear with the increase of |Q| as shown in Figures 4-10 (b)

to (f) The reason is that for spatial attributed networks, vertices spatially close with

each other intend to structural close as well and vice versa. As the queries contexts

for these networks are describing suburbs, as |Q| increases, algorithms tend to search

communities in disjoint sub-networks which results the near-linear time w.r.t. |Q|.

112

4.9 Conclusion

In this chapter, we propose a novel parameter-free community model, namely the con-

textual community that only requires a query to provide a set of keywords describing

an application/user context. We propose two network flow based exact algorithms

to solve CC search in polynomial time and an approximation algorithm with an ap-

proximation ratio of 13. Our empirical studies on real social network datasets demon-

strate the superior effectiveness of CC search methods under different query contexts.

Extensive performance evaluations also reveal the superb practical efficiency of the

proposed CC search algorithms.

113

114

Chapter 5

Batch Keyword Query Processing

on Graph Data

Answering keyword queries on textual attributed graph data has drawn a great deal

of attention from database communities. However, most graph keyword search solu-

tions proposed so far primarily focus on a single query setting. We observe that for

a popular keyword query system, the number of keyword queries received could be

substantially large even in a short time interval, and the chance that these queries

share common keywords is quite high. As such, answering keyword queries in batches

would significantly enhance the performance of the system. Motivated by this, We

study efficient batch processing for multiple keyword queries on graph data in this

chapter. Realised that finding both the optimal query plan for multiple queries and

the optimal query plan for a single keyword query on graph data are computationally

hard, we first propose two heuristic approaches which target maximising keyword

overlap and give preferences for processing keywords with short sizes. Then we de-

vise a cardinality based cost estimation model that takes both graph data statistics

and search semantics into account. Based on the model, we design an A* based al-

gorithm to find the global optimal execution plan for multiple queries. We evaluate

the proposed model and algorithms on two real datasets and the experimental results

demonstrate their efficacy.


115

batch keyword query processing on graph data. Section 5.2 presents preliminaries

and defines the problem formally. In Section 5.3, we present two heuristic rule based

approaches, a shortest list eager one and a maximal overlap driven one. In Section 5.4,

we propose a cost estimation model for estimating the cardinalities of r-join operations

used for evaluating r-cliques. Based on this estimation model, we then discuss how to

find the cost-based optimal query plan for multiple queries efficiently in Section 5.5.

We show the experimental results in Section 5.6. Finally, we conclude this chapter in

Section 5.7.

5.1 Introduction

We study the problem of batch processing of keyword queries on graph data. Our

goal is to process a set of keyword queries as a batch while minimising the total time

cost of answering the set of queries. Batch query processing (also known as multiple-

query optimization) is a classical problem in database communities. In relational

database, Sellis et. al. in [91] studied multiple SQL query optimization. The key idea

is to decompose SQL queries into subqueries and guarantee each SQL query in the

batch can be answered by combining subset of subqueries. However, it may incur a

challenging issue to maintain the intermediate results of all the possible subqueries,

which leads to expensive space cost and extra I/O cost. To do this, Roy et. al. in

[84] evaluated the tradeoff between reuse and recomputation of intermediate results

for subqueries by comparing pipeline cost and reusing cost.

In addition, Jacob and Ives in [55] addressed the problem of interactive keyword

queries as a batch query in relational database. In their work, the keyword search

semantics is defined by the candidate networks [49], which requires to know the

relational data schema in advance. Batch query processing was also studied in other

contexts, e.g., spatial-textual queries [24], RDFSPARQL [63], and XQueries [11].

After investigating batch query processing in different contexts and single key-

word query processing in graph databases, we observe that all the existing techniques

cannot be applied to our problem - batch keyword query processing on graph data.

116

The main reasons come from the following significant aspects. (1) Meaningful Result

Semantics : r-clique can well define the semantics of keyword search on graph data as

r-clique can be used to discover the tightest relations among all the given keywords

in a query [58], but there is no existing work that studies batch query processing with

this meaningful result semantics; (2) Complexity of the Optimal Batch Processing : it

is an NP-complete problem to optimally process multiple keyword queries in a batch.

This is because each single query corresponds to several query plans, and obviously

we cannot enumerate all the possible combinations of single query plans to get the

optimal query plan for multiple queries; (3) Not-available Query Historic Informa-

tion: unlike the batch query processing [107], we do not have the assumption that we

know the result sizes of any subqueries before we actually run these queries because

this kind of historic information is not always available.

Although we can simply evaluate the batch queries in a pre-defined order and

re-use the intermediate results in the following rounds as much as we can, there is

no guarantee the batch queries can be run optimally. To address this, we firstly

develop two heuristic approaches which give preferences for processing keywords with

short sizes and maximise keyword overlaps. Then we devise a cardinality estimation

cost model by considering graph connectivity and result semantics of r-clique. Based

on the cost model, we can develop an optimal batch query plan by extending A*

search algorithm. Since the A* search in the worst case could be exhaustive search,

which enumerates all possible global plans, we propose pruning methods, which can

efficiently prune the search space to get the model based optimal query plan.

5.2 Preliminaries and Problem Definitions

In this section, we introduce preliminaries and define the problem that is to be ad-

dressed in this chapter.

117

5.2.1 Keyword Query on Graph Data

Native graph data. The native graph data G(V,E) consists of a vertex set V (G)

and an edge set E(G). A vertex v ∈ V (G) may contain some texts, which are denoted

as v.KS = {v.key1, . . . , v.keyz}. We call the vertex that contains texts content vertex.

An edge e ∈ E(G) is a pair of vertices (vi, vj) (vi, vj ∈ V ). The shortest distance

between any two vertices vi and vj is the number of edges in the shortest path between

vi and vj, denoted as dist(vi, vj).

Query processing for Single keyword query. Given a keyword query q =

{k1, . . . , km} on a graph G, the answer to q is a set of sub-graphs of G, each subgraph

is generated using an r-clique [58] of G, which is a set of vertices that contains texts

that match all keywords in q and the distance between any two vertices in the r-clique

is less than or equals to r. Given a query q on G, we can get a set of r-cliques, de-

noted RC(q,G). For example, Figure 5-1 shows a subgraph G′ of native graph data

G. Given a query q1 = {k1, k2, k3, k4}, let r = 1. An answer to q1 is the thick vertex

set in Figure 5-1, in which the vertex set {v7, v8, v10} is a 1-clique.

Figure 5-2(a) shows a query plan for q1. A query plan is an operation tree that

contains two types of operations, one is a selection operation σki(G) that selects ver-

tices on graph G whose texts match keyword ki, and the other is an r-join operation

onR that join two r-cliques of G. There could be many query plans to generate the

final r-clique set based on the different processing order of r-joins. For simplifying the

presentation, we use Figure 5-2(b) to express the query plan shown in Figure 5-2(a).

In order to make the selection σki(G) efficiently, we can build up an inverted list of

vertices for each keyword contained in graph G. Then the cost of a selection is O(1).

Therefore the main cost of a query plan depends on the costs of its r-join operations.

5.2.2 Batched Multiple-Keyword Queries

Consider a batch of keyword queries Q = {q1, . . . , qn} on a native graph G, it returns

answers to each query qi ∈ Q.

118

v1 : k2v3 : k2

v5 : k1

v2 : k1; k4v4 : k1 v9 : k2 v10 : k3

v8 : k2

v7 : k1; k4

v6 : k1

G0

Graph data: G

...

... ...

...

...

... ...

Figure 5-1: An example graph G and the answer subsgraphs to q1 in the subgraph G′

σk1(G) σk2

(G) σk3(G) σk4

(G)

./ R

./ R

./ R

(a) p1(q1)

k1 k2 k3 k4

k1k2

k1k2k3

k1k2k3k4

(b) p1(q1)

k5 k3 k1 k4

k3k5

k1k3k5

k1k3k4k5

(c) p2(q2)

k2 k1 k3

k1k3

k1k3k4

k1k3k4k5

k4 k5

k1k2k3k4

(d) p3({q1, q2})

Figure 5-2: Query plans for single queries q1, q2, and batch multiple queries {q1, q2}

A naive way to answer batched keyword queries is to run those queries one by

one. For example we can run the query plans p(q1) and p(q2) shown in Figure 5-

2(b) and Figure 5-2(c) one by one. Obviously it is inefficient. Ideally we hope to

share some (intermediate) results of processed queries to avoid duplicate computation.

For example, Figure 5-2(d) shows a query plan for a batch query {q1, q2}, where

q2 = {k1, k3, k4, k5} and the intermediate results of r-joins σk1(G) onRσk3(G) and

(σk1(G)onRσk3(G))onRσk4(G) can be shared by their upper-level r-join operations.

Then the cost of this query plan p3({q1, q2}) is the summation of the cost of every

r-join operations in the plan.

Problem 5.2.1. Given a batch of keyword queries Q = {q1, . . . , qn} on a native graph

119

G, our aim is to construct a query plan p(Q)opt for all queries in Q such that p(Q)opt

requires minimum cost. It is a typical NP-complete problem [90].

Finding the optimal query plan is non-trivial due to the following reasons.

• A single query corresponds to several query plans, and obviously we do not want

to enumerate combinations of query plans for multiple queries to get the optimal

one; and

• Let RC(K1∪K2, G) = RC(K1, G)onRRC(K2, G). The size of RC(K1∪K2, G) is

not proportional to the sizes of RC(K1, G) and RC(K2, G). Therefore, it is not

easy to predict the size of RC(K1 ∪K2, G).

5.3 Heuristic-based Approaches

We propose two heuristic approaches to target a “good” query plan and answer queries

in the batch Q.

5.3.1 A Shortest List Eager Approach

We first propose an approach Basic whose main idea is to process every query in

turn in the batch Q = {q1, . . . , qn}, and for each query qi ∈ Q it starts from the

shortest list to eagerly join with existing intermediate results if they exist.

Rule 5.3.1. Given two inverted lists of keywords ki and kj, respectively. RC({ki}, G)

takes precedence to r-join with the the existing intermediate results, if the list of ki is

shorter than that of kj.

Algorithm 10 shows the detail of the algorithm Basic, which avoids processing

keywords that have been processed. Therefore, for each iteration, it checks if the

keywords of the current query qi have been processed. For those processed keywords,

it uses the intermediate results of the maximal set of processed keywords, and for those

unprocessed keywords, it starts an r-join onR between the processed intermediate

results and the RC({k}, G) with the smallest size.

120

Algorithm 10: Basic

Data: A graph G, queries Q={q1, . . . , qn}Result: R={RC(q1, G), . . ., RC(qn, G)}

1 Load index H of inverted lists of vertices for keywords;2 R← ∅;3 for i from 1 to n do4 RC(qi, G)← ∅;5 Processed keywords Kp ← ∅;6 foreach keyword k in qi do7 if k is processed in previous queries then8 Kp ← Kp ∪ {k};9 else

10 RC({k}, G) ← Hash vertices in the inverted lists of keyword k;

11 Key ← Kp;// compute processed keyworkds

12 repeat13 Find maximal set of processed keywords Kmax;14 RC(qi, G) ← RC(qi, G) onR RC(Kmax, G);15 Kp ← Kp −Kmax;

16 until Kp is empty ;// compute unprocessed keywords

17 Rank all remaining RC({k}, G) by their sizes in ascending order(k′1, . . . , k

′m);

18 foreach remaining keywork k in qi do19 RC(Key,G) ← RC(Key,G) onR RC({k}, G) ;20 Key ← Key ∪ {k};21 RC(qi, G) ← RC(Key,G);

22 return R;

It is clear that the algorithm Basic is better than the naive approach which simply

processes the queries one after another while does not consider reusing processed

intermediate results.

5.3.2 A Maximal Overlapping Driven Approach

The Algorithm Basic does not make full use of the shared (overlapping) keywords

among all the queries in the batch Q. Therefore, in this section we propose a new

approach based on the observation that more keywords often imply more processing

121

Algorithm 11: Overlap

Data: Q={q1, . . . , qn}Result: R={RC(q1, G), . . ., RC(qn, G)}

1 Algorithm Overlap()

2 Calculate sharing factors in Q;3 repeat4 Calculate frequencies of unprocessed sharing factors;5 Choose the precedent sharing factor sf in Q with maximal

|sf | · freq(sf) according to Rule 5.3.2;6 RC(sf,G)← Cal(sf);7 Remove the subtree rooted at sf ;8 Insert sf to a heap H;9 while H is not empty do

10 Pop the first factor from H to sf ;11 foreach factor s ⊃ sf and |s| − |sf | = 1 do12 RC(s,G)← RC(sf,G)onRRC(s\sf,G);13 if s is a query q in Q then14 Q← Q\{q};15 else16 Insert s to H;

17 Remove sf ;

18 until Q is empty ;19 return R;

20 Procedure Cal(sf)21 RC(sf,G)← ∅;22 if sf contains sub-sharing factors SFc then23 Choose the precedent sharing factor sfc among SFc with maximal

sfc · freq(sfc);24 RC(sf,G)←Cal(sfc);

25 Let k′1, . . . , k′v be keywords sf − sfc whose inverted lists are ranked in

ascending order;26 foreach keyword k′i do27 RC(sf,G)← RC(sf,G)onRRC(k′i, G);

28 return RC(sf,G);

122

k1k2k3k4 k1k3k4k5 k2k3k4k6 k1k3k4k7 k3k4k6k7 k3k4k6k8 k3k4k7k9

k1k3k4k2k3k4 k3k4k6 k3k4k7

k3k4

(a) A quey plan

Iterations Sharing factors and theirfrequencies

1 freq({k2, k3, k4})=2,freq({k1, k3, k4})=3,freq({k3, k4, k6})=3,freq({k3, k4, k7})=3,freq({k3, k4})=4

2 freq({k3, k4, k6})=3,freq({k3, k4, k7})=2freq({k3, k4})=2

3 ∅(b) Sharing factors and their frequencies

Figure 5-3: An example of processes in the algorithm Overlap

cost, and as a result, processing more frequently shared keywords first will benefit

more queries. Before continuing, we define Sharing Factor first.

Definition 5.3.1. Sharing factor. Given a batch query Q= {q1, . . . , qn}, for any

two queries qi, qj ∈Q(i 6= j), we use the intersection of qi and qj to express their

overlapped keywords, which is called sharing factor of qi and qj, denoted SF (qi, qj).

SF (qi, qj) = qi ∩ qj.

Rule 5.3.2. Given a batch Q and let S be the set of sharing factors w.r.t. Q. For any

two sharing factors SFi and SFj in S, RC(SFi, G) takes precedence over RC(SFj, G)

if |SFi| · freq(SFi) > |SFj| · freq(SFj), where freq(SF ) is the frequency of SF in

Q.

Algorithm 11 shows the algorithm Overlap based on Rule 5.3.2. Given a batch of

queries Q, the algorithm Overlap first calculates all sharing factors among queries in

Q. It chooses a sharing factor sf with maximal |sf | ·freq(sf) according to Rule 5.3.2

(line 5). Then in line 6 it calculates the intermediate result RC(sf,G) by invoking

Cal(sf), which recursively processes sharing factors whose keywords are subsets of

the keywords in sf (lines 20–28). This intermediate result RC(sf,G) can benefit

all factors whose keywords are supersets of the keywords of sf . Such benefit can

be propagated to queries in Q. Therefore, the algorithm Overlap pushes all such

sharing factors that can be benefited from computing sf into a heap (line 8). Finally,

123

it calculates r-cliques of the benefited queries (lines 9–17) and removes these processed

queries (line 14). The algorithm repeats the above process until all queries have been

processed.

Figure 5-3 shows an example illustrating the algorithm Overlap. Given a key-

word query batch Q containing queries: q1 = {k1, k2, k3, k4}, q2 = {k1, k3, k4, k5},

q3 = {k2, k3, k4, k6}, q4 = {k1, k3, k4, k7}, q5 = {k3, k4, k6, k7}, q6 = {k3, k4, k6, k8}, and

q7 = {k3, k4, k7, k9}, the Algorithm Overlap firstly calculates sharing factors, which

are shown in Figure 5-3(b).

In the first iteration, all the sharing factors {k1, k3, k4}, {k3, k4, k6}, and {k3, k4, k7}

have the largest size-frequency product as 9 (see the first row in Figure 5-3(b)). With-

out loss of generality, suppose {k1, k3, k4} is selected to evaluate first. It invokes Cal

to calculate RC({k1, k3, k4}, G) as follows: since calculating it may require intermedi-

ate results of other sharing factors whose keywords are a subset of {k1, k3, k4}, CAL

recursively finds the most promising sub-sharing factors of {k1, k3, k4}, i.e. {k3, k4}

in this case (it is also the only sub-sharing factor). After {k3, k4} is processed, it

goes back to the previous recursion state where {k1, k3, k4} is then processed by us-

ing the result of RC({k3, k4}, G). Finally CAL returns RC({k1, k3, k4}, G). Then

the algorithm needs to process all queries that can benefit from RC({k1, k3, k4}, G)

by pushing the sharing factors that are superset of {k1, k3, k4} into a heap, because

these superset sharing factors can benefit from computing RC({k1, k3, k4}, G), i.e.,

{k1, k2, k3, k4}, {k1, k3, k4, k5}, and {k1, k3, k4, k7} are pushed into a heap. Then it

processes each sharing factor s in the heap to calculate RC(s,G) and pushes its su-

perset into the heap until its superset is an original query of the batch. After the first

iteration, queries q1, q2, and q4 have been processed. In the second iteration, there

are only four queries left, which are q3, q5, q6, and q7. Their sharing factors and cor-

responding frequencies are shown in the second row in Figure 5-3(b). The algorithm

chooses {k3, k4, k6} and q3, q4, and q5 can be answered based on {k3, k4, k6}. Finally,

in the last iteration, since only q7 left, the algorithm chooses {k3, k4, k7} to support

query q7. The blue lines in Figure 5-3(a) shows the final query plan by using the

algorithm Overlap.

124

5.4 Cost Estimation for Query Plans

The maximal overlapping driven approach tries to maximize the sharing of intermedi-

ate results. However, this does not mean the overall cost of the query plan is optimal.

Therefore, we provide a cost-based solution to support multiple keyword query on

graph, which mainly contains two parts: (i) estimating cost of a query plan, and (ii)

generating global optimal plan based on the estimated cost.

In this section, we propose a cost model to estimate the query plan. The cost of

a query plan is determined by the cost of involved r-join operations. Therefore, in

Section 5.4.1 we analyse the cost of an r-join and in Section 5.4.2 we estimate the

cardinality of an intermediate result of an r-join between two r-cliques. Finally we

present our estimation cost model for a query plan.

Algorithm 12: rJoin

Data: Two r-cliques RC(Ki, G) and RC(Kj, G)Result: RC(Ki ∪Kj, G)← RC(Ki, G)onRRC(Kj, G)

1 RC(Ki ∪Kj, G)← ∅;2 foreach rc′ ∈ RC(Ki, G) do3 foreach rc′′ ∈ RC(Kj, G) do4 FLAG← true;5 foreach v ∈ rc′ do6 if dist(v, v′) > r for any v′ ∈ rc′′ then7 FLAG← false; break;

8 if FLAG=true then9 RC(Ki ∪Kj, G)← RC(Ki ∪Kj, G) ∪ {rc′ ∪ rc′′};

10 return RC(Ki ∪Kj, G);

5.4.1 Cost of an r-Join

In order to estimate the size of an r-join between two r-cliques, we first illustrate our

implementation of an r-join operation onR.

Given two keyword sets Ki and Kj, let RC(Ki, G) and RC(Kj, G) be two r-clique

sets for Ki and Kj, respectively. Algorithm 12 shows the implementation of an r-join

operation between two r-clique sets RC(Ki, G) and RC(Kj, G). For any r-clique pair

125

〈 rc′, rc′′〉 (rc′ ∈ RC(Ki, G) and rc′′ ∈ RC(Kj, G)), an r-join operation chooses vertex

pairs 〈v, v′〉 from the pair 〈rc′, rc′′〉 such that dist(v, v′) ≤ r. In order to calculate

dist(v, v′) efficiently, we pre-store all shortest paths between every two vertices in G

into a shortest path set SP (G).

Then the cost of the r-join operation o = RC(Ki, G)onRRC(Kj, G) is

cost(o) = O(ni × nj × |Ki| × |Kj|), (5.1)

where ni and nj are numbers of r-cliques in RC(Ki, G) and RC(Kj, G), respectively.

It shows that the cost of an r-join operation is determined by its inputs RC(Ki, G)

and RC(Kj, G).

5.4.2 Estimating Cardinality of an r-Join Result

We observe that given a query q, following the pipeline of r-join operations RC(q,G)

can be derived by a recursive process as follows.

RC(q,G) =

RC({k}, G) if q={k},

RC(q\{k}, G)onRRC({k}, G) if |q\{k}|≥1.

(5.2)

If q contains only one keyword k, the size |RC(q,G)| equals to the length of the

inverted list L(k). So we only need to estimate the size of RC(q,G) = RC(q\{k}, G)

onR RC({k}, G). RC(q,G) merges vertices from RC(q\{k}, G) and RC({k}, G), so

that for any vertex v ∈ RC(q\{k}, G) and v′ ∈ RC({k}, G), dist(v, v′) ≤ r.

According to the r-join operation, we know that for a vertex v ∈ L(k), v cannot

contribute to the result of RC(q,G) if for each v′ ∈ RC(q\{k}, G) their distance

dist(v, v′) > r. We call such v invalid vertex w.r.t. a parameter r, and we can

construct a valid inverted list Lrv(k) such that each vertex in it is a valid vertex

w.r.t. r. Given a graph G and the parameter r, we can easily construct valid inverted

lists of keywords as follows. For each processing vertex v ∈ V (G), we check every

unprocessed vertex v′ whose keywords v′.KS do not overlap with the keywords of v.

126

If ∀v′ belonging to unprocessed vertices, dist(v, v′) > r, we say v is invalid and will

not appear in any list of keywords. Then a valid shortest path set SP rv (G) for all

valid vertices in G can be loaded before receiving any queries. For each v ∈ Lrv(k),

let pr(v) be probability of appearance of v′ ∈ RC(q\{k}, G) such that dist(v, v′) ≤ r.

Then |RC(q,G)| can be estimated as pr(v)× |Lrv(k)| × |RC(q\{k}, G)|.

Estimating cardinality of an r-join between two inverted lists for keywords.

We first consider the simple case where Q = {q1, q2}. Given a graph G, let Lrv(k1)

and Lrv(k2) be the two valid inverted lists of keywords k1 and k2, respectively. Then

|RC(q,G)| can be estimated as:

|RC(q,G)| = pr(v)× |Lrv(k1)| × |Lrv(k2)|, (5.3)

where pr(v) = |SP rv (G)||V (G)|2 , |V (G)|2 is the number of shortest paths in G, and |SP r

v (G)| is

the number of shortest paths in SP rv (G). As we explained above, such statistic value

SP rv (G) can be collected offline for a given a graph G.

Estimating cardinality of an r-join between an r-clique set and an inverted

list for a keyword. We use Equation 5.4 to iteratively estimate the number of

cliques in RC(q,G) when q has more than two keywords.

|RC(q,G)| = (|SP r

v (G)||V (G)|2

)|q|−1 × |RC(q\{k}, G)| × |Lrv(k)|, (5.4)

where |q| > 2 and ( |SPrv (G)|

|V (G)|2 )|q|−1 is the probability pr(v).

Let a query plan p(q) w.r.t. a query q contains a list of r-join operations, then the

final cost of p(q) is cost(P ) =∑

o∈p(q) cost(o), where cost(o) is the cost of the r-join

operation o ∈ p(q) (see Equation 5.1).

5.5 Estimation-based Query Plans

Based on the estimated cost of query plans, we could find a global optimal query

plan by utilising the state-of-the-art approach A* algorithm. By using A*, we could

assess generated partial plans and only expand the most promising partial plan to

127

find the global optimal plan based on estimated cost. In Section 5.5.1 we show how

to construct a search space for A* algorithm, and in Section 5.5.2 we propose pruning

approaches to reduce the search space to make the search more efficient.

5.5.1 Finding Optimal Solution based on Estimated Cost

In this section, we adopt the solution in [92] which is based on A* algorithm to model

our problem as a state space search problem.

Search space. The search space S(Q) for a query batch Q= {q1, . . . , qn}, can be

expressed as: S(Q) = {P (q1) × . . . × P (qn)}, where P (qi) (1 ≤ i ≤ n) is a set

of query plans for single keyword query qi, each of which contains a pipeline of r-

join operations. Let a global query plan for the batch query Q have the form of

〈p1 ∈ P (q1), . . . , pn ∈ P (qn)〉, where pi ∈ P (qi).

Therefore, each state si in the search space is an n-tuple 〈pi1, . . . , pin〉, where pij is

either a {NULL} or a query plan for the i-th query qi ∈ Q. The search space contains

an initial state s0= 〈NULL, . . . , NULL〉 and several final states SF where each pij in

a final state sf ∈ SF corresponds to a query plan for qi ∈ Q. The value of each state

si = 〈p1, . . . , pn〉 equals to the summation of the cost of all query plans in si, i.e.,

v(si) =∑

p∈si,p 6=NULL

cost(p).

The A* algorithm starts from the initial state s0 and finds a final state sf ∈ SF such

that v(sf ) is minimal among all paths leading from s0 to any final state. Obviously,

v(sf ) is the total cost required for processing all n queries.

In order for an A* algorithm to have fast convergence, a lower bound function

lb(si) is introduced on each state si. This function is used to prune down the size of

the search space that will be explored. When A* starts from si−1 and determines if

it is worth to traverse a state si, it gets its lower bound lb(si) as follows:

lb(si) = v(si−1) + pre cost(si), (5.5)

128

where pre cost(si) is the minimal optimistic approximation cost of traversing the

next state si from si−1. That is, starting from si−1, the A* algorithm needs at least

pre cost(si) cost to arrive si, where a new query plan p′ for query qi is to be traversed.

Let p′ contain a set of r-joins, then pre cost(si) =∑

o∈p′ cost(o), where cost(o) is the

minimal optimistic cost of the r-join. For each such an r-join, if it is shared in previous

query plans in si−1, we do not need extra cost to compute it, therefore, cost(o) = 0;

otherwise, suppose this r-join operation can be reused at most nl times by remaining

queries from qi to qn, cost(o) = cost(o)nl

(see Equation 5.6).

cost(o) =

0 if o is shared in si,

cost(o)nl

otherwise.

(5.6)

The cost(o) in Equation 5.6 is the estimated cost of o defined in Equation 5.1.

According to the above analysis, we propose our algorithm EstPlan based on A*

algorithm as follows. If for any state sj (i 6= j), lb(si) < v(sj), EstPlan continues

to traverse pointed from si, otherwise it jumps to state sj and continues traversing

states pointed from sj since the best global plan that is derived from si cannot beat

that of sj. Since the search space is tree structure and the lower bound we used is

always less or equal to actual cost (assume reuse maximumly), the first global plan,

generated by EstPlan that has lowest lower bound in all expanded states, is the

global optimal plan based cost estimation model.

5.5.2 Reducing Search Space

In this section, we analyze how to reduce the search space of query plans. Recall, for

a particular keyword query q in the keyword batch, we will eventually choose only one

query plan to evaluate q. During the evaluation process, some plans in P (q) can be

found as not promising to be the chosen plan for this particular keyword query, and

therefore these plans can be safely pruned. We will introduce two Theorems serving

as the pruning conditions.

129

Theorem 5.5.1. Let pi, pj ∈ P (q) be any two query plans of the single query q. The

plan pi can be pruned, if cost(pi) > cost(pj) and pi does not contain a sharing factor

that is not contained in pj, i.e. SF (pi) ⊆ SF (pj) where SF (p) denotes all the sharing

factors of plan p.

Proof. Basically, Theorem 5.5.1 requires both two conditions be met to prune pi, i.e.,

(a) cost(pi) > cost(pj) and (b) SF (pi) ⊆ SF (pj). We prove by contradiction.

Case 1: if cost(pi) ≤ cost(pj), apparently pi is less expensive than pj. Under the

circumstance that pj does not have shared factors with other queries in Q. Plan pi is

always better than plan pj. As a result, pi cannot be pruned.

Case 2: if SF (pi) 6⊆ SF (pj), it implies there exists one sharing factor SF (SF ∈

SF (pi) and SF /∈ SF (pj)) such that SF is shared with another query q′ ∈ Q, q 6= q′.

Since the computation of SF is shared between q and q′, the actual cost(pi) must be

less than expected and even may be smaller than cost(pj). Consequently, pi cannot

be pruned.

To go further from the proof of Case 2 in Theorem 5.5.1, let SF be a sharing

factor SF ∈ SF (pi) and SF /∈ SF (pj), when pi is chosen as the query plan of q and

pi is evaluated, the best case is that the intermediate result of RC(SF,G) has been

computed in another query plan of q′, and plan pi just simply reuses the result. In such

case, the actual cost of pi is cost(pi)− cost(SF ). Accordingly, we have Theorem 5.5.2

as follows:

Theorem 5.5.2. Let pi, pj ∈ P (q) be any two query plans of a single query q. Let

SF (pi), SF (pj) denote the sharing factors of plan pi, pj respectively. The plan pi can

be pruned, if cost(pi)−∑∀SF∈SF (pi)\SF (pj)

cost(SF ) > cost(pj).

Proof. If plan pi is less preferred than plan pj, it means that cost(pi) is absolutely

larger than cost(pj). This implies that even in the best case where pi reuses as much

shared computation of sharing factors as possible, pi is still more expensive than its

counterpart plan pj. The largest possible reusable cost is∑∀SF∈SF (pi)\SF (pj)

cost(SF ),

which is the maximum cost that pi can save on the condition that the computation of

130

the sharing factors in SF (pi)\SF (pj) has been done in other queries of the batch Q.

The intuition is that, if pi’s minimal possible cost is already larger than pj’s maximal

cost, pi can be pruned safely.

Discussion. All our proposed frameworks can be extended to other search semantics

expect for the semantic specified estimation technique discussed in Section 5.4.


In this section, we implemented the Shortest List Eager Approach in Algorithm 10,

the Maximal Overlapping Driven Approach in Algorithm 11, and A* based algorithm

proposed in this chapter. They are briefly denoted as Basic, Overlap and Est-

Plan respectively. Their performances are evaluated and compared by running them

for different multiple query batches over two real datasets. All the tests were con-

ducted on a 2.5GHz CPU and 16 GB memory PC running Ubuntu 14.04.3 LTS. All

algorithms were implemented in GNU C++.

5.6.1 Datasets and Tested Queries

Datasets. We evaluated our algorithms on two real datasets.

1. DBLP dataset1. We generated graphs from DBLP dataset. The generated

graph contains 37,375,895 vertices and 132,563,689 edges where each vertices

represents a publication and each edge represents the citation relationships over

papers.

2. IMDB dataset2. We generated graphs from a processed IMDB dataset [46]. The

vertices in the generated graph represent users or movies. There are 247,753

users and 34,208 movies. The edges in the graph represent relations between

the users and the movies: users may rate movies or comments on movies. In

1http://dblp.uni-trier.de/xml/2http://grouplens.org/datasets/movielens/

131

http://dblp.uni-trier.de/xml/

http://grouplens.org/datasets/movielens/

Table 5.1: Keyword sets for DBLP and IMDB

Number of distinct keyword Keyword frequency rangeDBLP 100 0.015-0.075IMDB 100 0.011-0.045

the generated graph, the total edges consists of 22,884,377 rating relations and

586,994 commenting relations.

Tested Queries. For each dataset, we randomly selected 100 keywords as a keyword

set used for producing tested batch queries. Table 5.1 shows the frequency range of

the keywords in each keyword set. We created batches with different ratios of shared

keywords as follows. We randomly produced 5 subsets of keywords that are picked

from the 100 keywords for DBLP dataset. Each subset contains a certain number of

distinct keywords, i.e., 10 distinct keywords in the 1st subset, 15 distinct keywords

in the 2nd subset, 20 distinct keywords in the 3rd subset, 25 distinct keywords in the

4th subset, and 30 distinct keywords in the 5th subset. For each subset of keywords,

we randomly picked 3 to 7 keywords for individual keyword queries. By repeatedly

working on each subset of keywords until we generated 50 keyword queries as a query

batch. We iterated the above process and generated the tested batch queries for DBLP

and IMDB dataset. We use the generated query batches as experiments input and

report average results. In experimental studies, we fixed the size of multiple keyword

queries in a batch, say 50. Therefore, varying the number of distinct keywords of the

keywords subset from which the batch is generated varies the shared computations

that the query batch contains. If the batch of queries are generated from a small

subset of keywords (e.g., 10 distinct keyword in the subset), the shared computation

contained in the batch is high; otherwise, shared computation of the batch is low.

5.6.2 Evaluation of the Efficiency

We show the computational cost with some configurations in terms of the total run-

ning time for query batches.

Parameters. Parameters that may affect the batch processing efficiency include:

132

0

50

100

150

600 900 1200 1500 1800Data size(MB)

Run

ning

tim

e (S

ec.)

1:SERIAL 2:BASIC 3:OVERLAP 4:EstPlan

(a) DBLP (Number of dist. keywords = 20,r=3)

0

1000

2000

3000

128 256 384 512 640Data size(MB)

Run

ning

tim

e (S

ec.) 1:SERIAL

2:BASIC 3:OVERLAP 4:EstPlan

(b) IMDB (Number of dist. keywords = 20,r=3)

10

15

20

25

600 900 1200 1500 1800Data size(MB)

Spe

edup

(Tim

es) 1:BASIC

2:OVERLAP 3:EstPlan

(c) DBLP (Number of dist. keywords = 20,r=3)

10

15

20

25

128 256 384 512 640Data size(MB)

Spe

edup

(Tim

es) 1:BASIC

2:OVERLAP 3:EstPlan

(d) IMDB (Number of dist. keywords = 20,r=3)

Figure 5-4: Scalability and speedup studies

133

0

50

100

150

2 3 4 5r

Run

ning

tim

e (S

ec.) 1:BASIC 2:OVERLAP 3:EstPlan

(a) DBLP (Number of dist. keywords = 20)

0

10

20

30

40

10 15 20 25 30Number of dist. keywords

Run

ning

tim

e (S


(b) DBLP (r=3)

0

200

400

600

2 3 4 5r

Run

ning

tim

e (S


(c) IMDB (Number of dist. keywords = 20)

0

100

200

300


Run

ning

tim

e (S


(d) IMDB (r=3)

Figure 5-5: Efficiency of multiple queries

size of the dataset, r, and number of distinct keywords. In the following experiment

configurations, the default configuration of dataset size is the full size of DBLP and

IMDB, and the default configuration of r is 3, and the default number of distinct

keyword in keywords subsets that generates query batches is 20.

Scalability. We report the computational cost of batch processing algorithms: Ba-

sic, Overlap, and EstPlan, when we vary the dataset size of DBLP and IMDB.

To demonstrate the benefit of batch processing algorithms, we also implemented a

serial query processing algorithm, denoted as SERIAL. The size of DBLP data is

from 600(MB) to 1800(MB) with interval of 300(MB). The data size of IMDB is from

128(MB) to 640(MB) with interval of 128(MB). We keep the other parameters as the

default values (e.g., r=3 and the number of distinct keywords is 20). Figure 5-4(a)

and Figure 5-4(b) show that, the batch processing algorithms are one order of magni-

tude faster than SERIAL when we evaluated the batch queries on both DBLP and

IMDB datasets.

134

Speedup. Figure 5-4(c) and Figure 5-4(d) report speedups of batch processing al-

gorithms w.r.t. algorithm SERIAL. Overall speaking, EstPlan outperforms all the

other algorithms on both tested datasets, having the highest average speedup. When

data size is small, the speedup of EstPlan is relatively close to Basic and Over-

lap. That is because the batch queries evaluation on small pieces of data is fast and

the optimization overhead of EstPlan is non-trivial in this case. As the increase of

the data size, the speedups of Basic, Overlap, and EstPlan grow, which demon-

strates the advantages of batch processing algorithms. Due to the significant speedup

of batch processing algorithms, for the rest of experiments, we only focus on reporting

and discussing the results of batch processing algorithms.

Varying r. We show how the running time will be changed when we vary the value

of r while keeping the default settings for the data sizes and the number of distinct

keywords. Figure 5-5(a) and Figure 5-5(c) show that, when r is no less than 3,

EstPlan is much faster than all the other algorithms. When r is small, the r-clique

computational cost is small, the optimization overhead of EstPlan dominates the

overall running time. Because of that, we can see that EstPlan is close to than all

the other batch processing algorithms when r = 2 in Figure 5-5(a). With the value of

r is increased, the running time of all algorithms sharply increase. This is because as

r increases, there are more results to answer the keyword queries in a batch. Noticed

that the average running time for IMDB is almost 10 times higher than DBLP. This

is because the average connectivity of the graph generated from IMDB is much higher

than the graph from DBLP.

Varying the number of distinct keywords. We show how the running time varies

with different numbers of distinct keywords contained in batch queries while r and

the data sizes are set to be default values. Same as the above discussion, varying the

number of distinct keywords is equivalent to approximately change the ratio of shared

computation. Figure 5-5(b) and Figure 5-5(d) show that the time consumptions of

all algorithms grow when the number of distinct keywords increases. This is because

the increase of the number of distinct keywords may lead to less amount of shared

computation that can be taken advantage by Basic, Overlap and EstPlan.

135

The experiments demonstrate that the reuse of shared computations in a batch

improves the efficiency of computation. The EstPlan outperforms all other algo-

rithms in most configurations.

5.6.3 Evaluation of Effectiveness

In this section, we assess the effectiveness of our proposed cardinality based compu-

tational cost estimation model and the pruning effectiveness of Theorem 5.5.1 and

Theorem 5.5.2.

For a given batch of queries, EstPlan first generates a plan for the batch query

based on the proposed cost estimation model and then executes the queries or sub-

queries based on the generated plan. The total time is used as the exactly computa-

tional cost of the batch of queries. To measure the effectiveness of the cost model, we

also need to work out the ground truth plan. We run a large number of alternative

plans for the batch and select the one consuming the minimal time cost. The selected

plan is treated as the ground truth plan in our experiment. As such, the effectiveness

of the cost estimation can be computed by the ratio of the time cost of running Est-

Plan to the time cost of running the ground truth plan. The ratio is always no less

than one. The smaller value of the ratio means the effectiveness of our cost model

is higher. This is because a small ratio represents our cost model based plan to be

generated can closely approach the ground truth plan. Please note that the running

time has excluded the plan generation time in this section.

For an individual query, there are some factors to affect its effectiveness in terms of

cost estimation: the number of keywords in a query and the value of r. This is because

the cost estimation relies on cardinality estimation where the cardinality estimation

depends on r and number of keywords(Section 5.4). In the following experimental

configuration, the default number of keywords is 4 and the default value of r is 3.

Varying r. We show how the effectiveness of our cost estimation model varies with

different values of r when the number of keywords in batch queries are set to be the

default value. Figure 5-6(a) and Figure 5-6(c) show that, for both DBLP and IMDB,

the ratio of the estimated time cost decreases when we vary the value of r from 2

136

0.0

0.5

1.0

1.5

2.0

2 3 4 5r

Run

ning

tim

e ra

tio

(a) DBLP: average number of querykeywords = 4

0.0

0.5

1.0

1.5

2.0

3 4 5 6 7Number of query keywords

Run

ning

tim

e ra

tio

(b) DBLP: r=3

0.0

0.5

1.0

1.5

2.0

2 3 4 5r

Run

ning

tim

e ra

tio

(c) IMDB: average number of querykeywords = 4

0.0

0.5

1.0

1.5

2.0

3 4 5 6 7Number of query keywords

Run

ning

tim

e ra

tio

(d) IMDB: r=3

0.0

0.5

1.0

1.5

50 60 70 80 90Batch size

Run

ning

tim

e ra

tio

(e) DBLP: r=3

0.0

0.5

1.0

1.5

50 60 70 80 90Batch size

Run

ning

tim

e ra

tio

(f) IMDB: r=3

Figure 5-6: Accuracy of cardinality estimation

137

to 5. As the lower ratio means higher effectiveness, the experimental study indicates

that our cost estimation model based query plan can approach the ground truth plan

for the query evaluation. This is because, when r is high, the results of r-cliques on

the graph data generated from DBLP and IMDB tend to be Cartesian product of

content vertices while the processed cardinality estimation equation follows the same

tendency, which leads to more effective cost estimation.

Varying average number of query keywords in query batches. We show how

the effectiveness of the cost estimation model varies when the individual queries in

the batch query contain more keywords, and r is set as the default value. Figure 5-

6(b) and Figure 5-6(d) show that, on both DBLP and IMDB datasets, the ratio

increases with the increment of the individual query length. As we know the larger

ratio means lower effectiveness of cost estimation. The experimental study implies

that the effectiveness drops as the length of individual query increases.

Varying the batch size. We also study how the effectiveness of cost estimation

model varies when we change the batch size. r is set as the default value. We vary

batch size from 50 to 90 with interval of 10. From the experimental study in Figure 5-

6(e) and Figure 5-6(f), the ratio doesn’t change a lot. The effectiveness becomes a

bit low when we increase the size of query batches for both DBLP and IMDB. It says

our proposed cost estimation model is stable when it is used for big batch queries or

small batch queries.

Pruning effectiveness of theorems. Here we show the effect of proposed pruning

methods discussed in Theorem 5.5.1 and Theorem 5.5.2. We adopt generation time

of EstPlan that is used to get the optimal plan with inputs of: 1) pruned search

space (denote as P) and 2) none-pruned search space (denote as NP). In Figure 5-

7(a) and Figure 5-7(c), we compare their plan generation time with bathes having

different amounts of shared computations. Figure 5-7(a) and Figure 5-7(c) show that,

by increasing the distinct number of keywords in batches (means deceases amount of

shared computations contained in a batch), the plan generation time of EstPlan

with pruned search space decreases while that of none pruned search space is irrelevant

to amount shared computations contained in batches. This is because the pruning

138

0

1

2


Run

ning

tim

e (S

ec.)

NPP

(a) Running time comparison (DBLP)

0.00

0.25

0.50

0.75

1.00


Pru

ning

effe

ctiv

enes

s

(b) Pruning effictivenss (DBLP)

0

1

2


Run

ning

tim

e (S

ec.)

NPP

(c) Running time comparison (IMDB)

0.00

0.25

0.50

0.75

1.00


Pru

ning

effe

ctiv

enes

s

(d) Pruning effictivenss (IMDB)

Figure 5-7: Pruning effectiveness

effect is associated with shared r-join operations and fewer sharing would result in

higher pruning effectiveness. On the other hand, for the none pruned global optimal

plan searching space, its plan generation is independent with keyword set size but is

relevant to keyword batch size and keyword query length. It is noticeable that the

plan generation time of EstPlan with pruned search space is averagely 5 time less

than the plan generation time of EstPlan with none pruned search space. Figure 5-

7(b) and Figure 5-7(d) show that the pruning effectiveness in terms of the ratio of

the average number of global plans in pruned space and the average number of global

plans contained in none pruned search space. Obviously, the higher ratio will lead to

the better pruning effectiveness. On both datasets, the ratio increases with the more

distinct keywords in a batch (means deceases amount of shared computation contained

in a batch), which represents the better pruning effectiveness. It is noticeable that

the pruning effectiveness is over 0.75 on average for both DBLP and IMDB datasets.

139

5.7 Conclusion

In this chapter, we have studied the new problem, batch keyword query processing on

native graph data. r-clique is used as keyword query result semantics. We developed

two heuristic algorithms to heuristically find good query plans, which are based on

reusing shared computations between multiple queries, i.e. shortest list eager approach

(reuse if possible), maximal overlapping driven approach (reuse as much as possible).

To optimally run the batched queries, we devised an estimation-based cost model to

assess the computational cost of possible sub-queries, which is then used to identify the

optimal plan of the batch query evaluation. We have conducted extensive experiments

to test the performance of the three algorithms on DBLP and IMDB datasets. Cost

estimation based approach has been identified as the ideal solution.

140

Chapter 6

Conclusion and Future Work

In this chapter, we summarise the principal contributions made in this thesis and

discuss some interesting future research directions that can be further explored.

6.1 Conclusion

In this thesis, we studied how to search cohesive subgraphs, i.e. k-truss, densest sub-

graph and clique, in spatial or textual attributed graph data and proposed efficient

algorithms finding the studied cohesive subgraphs. In partially, three models of co-

hesive subgraph with different objectives were explored: (1) search spatial attributed

k-truss that are structural and spatial cohesive on purpose of discovering co-located

community; (2) search textual attributed densest subgraphs that are structural cohe-

sive and textual co-related for discovering contextual community; and (3) efficiently

processing multiple textual attributed r-cliques aiming to answer multiple keyword

queries on graph data.

Towards objective (1), we studied the maximum co-located community search

problem in large scale social networks. We proposed a novel community model, co-

located community, considering both social and spatial cohesiveness. We developed

efficient exact algorithms to find all maximum co-located communities. We designed

approximation algorithm with guaranteed spatial error ratios. We further improved

the performance using proposed TQ-tree index. We conducted extensive experiments

141

on large real-world networks, and the results demonstrate the effectiveness and effi-

ciency of the proposed algorithms.

Towards objective (2), we proposed a novel parameter-free community model,

namely the contextual community that only requires a query to provide a set of

keywords describing an application/user context. We proposed two network flow

based exact algorithms to solve CC search in polynomial time and an approximation

algorithm with an approximation ratio of 13. Our empirical studies on real social

network datasets demonstrate the superior effectiveness of CC search methods under

different query contexts. Extensive performance evaluations also reveal the superb

practical efficiency of the proposed CC search algorithms.

Towards objective (3), we studied the new problem, batch keyword query process-

ing on native graph data. r-clique is used as keyword query result semantics. We

developed two heuristic algorithms to heuristically find good query plans, which are

based on reusing shared computations between multiple queries, i.e. shortest list eager

approach (reuse if possible), maximal overlapping driven approach (reuse as much as

possible). To optimally run the batched queries, we devised an estimation-based cost

model to assess the computational cost of possible sub-queries, which is then used to

identify the optimal plan of the batch query evaluation. We have conducted exten-

sive experiments to test the performance of the three algorithms on DBLP and IMDB

datasets. Cost estimation based approach has been identified as the ideal solution.

6.2 Future Work

New models. In a number of real networks, attributed graph data can be represented

as multi-dimensional graph. To fit a certain application scenario, cohesiveness for each

dimension of the multi-dimensional graph could be very different. As such, we can

expect various combinations of different cohesive subgraph models for proposing new

cohesive subgraph models and design efficient algorithms.

More theoretical studies. The hardness for enumerating maximal clique in a

UDG is still an open problem. Although there is a study mentioning maximal clique

142

enumeration problem in some special graphs [98], the proof ideal and gadgets they

proposed are difficult to be adopted to prove the hardness of maximal clique enu-

meration in a UDG. In addition, most existing hardness proofs for UDG problems

are reductions of the same problems on planar graph. However, since UDG graph is

obvious more complicated than planar graph, problems that are easy in panner do

not mean they are easy in UDG. Therefore, it is possible that enumerating maximal

clique in a UDG is intractable.

Better algorithm. Improving the time complexity of maximum density subgraph

problem is still a challenging problem. All existing maximum density subgraph algo-

rithms rely on algorithms solving a series of min s-t cut problems. The most efficient

maximum density subgraph algorithm especially relies on push-relabel min s-t cut

algorithm. However, the push-relabel algorithm has been beaten by algorithms pro-

posed in [79]. We observe that using the algorithm in [79] to solve one min s-t cut

problem of a series of min s-t cut problems for finding a maximum density subgraph

can also reuse some computations of the previously solved min s-t cut, which may

lead to a faster algorithm for maximum density subgraph problem than all existing

algorithms.

143

144

Bibliography

[1] Alok Aggarwal, Hiroshi Imai, Naoki Katoh, and Subhash Suri. Finding k pointswith minimum diameter and related problems. Journal of Algorithms, 12(1):38– 56, 1991.

[2] Sanjay Agrawal, Surajit Chaudhuri, and Gautam Das. DBXplorer: a systemfor keyword-based search over relational databases. In ICDE, pages 5–16, 2002.

[3] Esra Akbas and Peixiang Zhao. Truss-based community search: a truss-equivalence based indexing approach. PVLDB, 10(11):1298–1309, 2017.

[4] E. A. Akkoyunlu. The enumeration of maximal cliques of large graphs. SIAMJournal on Computing, 2(1):1–6, 03 1973.

[5] H. Aksu, M. Canim, Y. Chang, I. Korpeoglu, and O. Ulusoy. Distributedk-core view materialization and maintenance for large dynamic graphs. IEEETransactions on Knowledge and Data Engineering, 26(10):2439–2452, Oct 2014.

[6] R. Andersen. Finding large and small dense subgraphs. eprintarXiv:cs/0702032, February 2007.

[7] Reid Andersen and Kumar Chellapilla. Finding dense subgraphs with sizebounds. In Algorithms and Models for the Web-Graph, pages 25–37. SpringerBerlin Heidelberg, 2009.

[8] Vladimir Batagelj and Matjaz Zaversnik. An o(m) algorithm for cores decom-position of networks. CoRR, cs.DS/0310049, 2002.

[9] Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, andShashank Sudarshan. Keyword searching and browsing in databases usingBANKS. In ICDE, pages 431–440. IEEE, 2002.

[10] Coen Bron and Joep Kerbosch. Algorithm 457: Finding all cliques of an undi-rected graph. Commun. ACM, pages 575–577, 1973.

[11] Nicolas Bruno, Luis Gravano, Nick Koudas, and Divesh Srivastava. Navigation-vs. index-based XML multi-query processing. In ICDE, pages 139–150. IEEE,2003.

145

[12] Guo-Ray Cai and Yu-Geng Sun. The minimum augmentation of any graph toa k-edge-connected graph. Networks, pages 151–172, 1989.

[13] Xin Cao, Gao Cong, Christian S Jensen, and Beng Chin Ooi. Collective spatialkeyword querying. In SIGMOD, pages 373–384. ACM, 2011.

[14] Lijun Chang, Jeffrey Xu Yu, and Lu Qin. Fast maximal cliques enumeration insparse graphs. Algorithmica, 66(1):173–186, May 2013.

[15] Moses Charikar. Greedy approximation algorithms for finding dense compo-nents in a graph. In Approximation Algorithms for Combinatorial Optimization,pages 84–95. Springer Berlin Heidelberg, 2000.

[16] P. Chen, C. Chou, and M. Chen. Distributed algorithms for k-truss decompo-sition. In 2014 IEEE International Conference on Big Data (Big Data), pages471–480, Oct 2014.

[17] Yu Chen, Jun Xu, and Minzheng Xu. Finding community structure in spatiallyconstrained complex networks. International Journal of Geographical Informa-tion Science, 29(6):889–911, 2015.

[18] Chun-Hung Cheng, Ada Waichee Fu, and Yi Zhang. Entropy-based subspaceclustering for mining numerical data. In SIGKDD, pages 84–93. ACM, 1999.

[19] Hong Cheng, Yang Zhou, Xin Huang, and Jeffrey Xu Yu. Clustering largeattributed information networks: an efficient incremental computing approach.Data Mining and Knowledge Discovery, 25(3):450–477, Nov 2012.

[20] J. Cheng, Y. Ke, S. Chu, and M. T. Ozsu. Efficient core decomposition inmassive networks. In 2011 IEEE 27th International Conference on Data Engi-neering, pages 51–62, April 2011.

[21] Boris V Cherkassky and Andrew V Goldberg. On implementing the push-relabelmethod for the maximum flow problem. Algorithmica, 19(4):390–410, 1997.

[22] Vladislav Chesnokov. Overlapping community detection in social networks withnode attributes by neighborhood influence. In Models, Algorithms, and Tech-nologies for Network Analysis, pages 187–203. Springer International Publish-ing, 2017.

[23] Eunjoon Cho, Seth A. Myers, and Jure Leskovec. Friendship and mobility:User movement in location-based social networks. In SIGKDD, pages 1082–1090. ACM, 2011.

[24] Farhana M Choudhury, J Shane Culpepper, and Timos Sellis. Batch processingof top-k spatial-textual queries. In 2nd Intl. ACM Workshop on Managing andMining Enriched Geo-Spatial Data, pages 7–12, 2015.

146

[25] Brent N. Clark, Charles J. Colbourn, and David S. Johnson. Unit disk graphs.Discrete Mathematics, 86(1):165 – 177, 1990.

[26] Jonathan Cohen. Trusses: Cohesive subgraphs for social network analysis.National Security Agency Technical Report, 16, 2008.

[27] Wanyun Cui, Yanghua Xiao, Haixun Wang, Yiqi Lu, and Wei Wang. Onlinesearch of overlapping communities. In SIGMOD, pages 277–288, 2013.

[28] Wanyun Cui, Yanghua Xiao, Haixun Wang, and Wei Wang. Local search ofcommunities in large graphs. In SIGMOD, pages 991–1002, 2014.

[29] Ian De Felipe, Vagelis Hristidis, and Naphtali Rishe. Keyword search on spatialdatabases. In ICDE, pages 656–665. IEEE, 2008.

[30] Werner Dinkelbach. On nonlinear fractional programming. Management Sci-ence, 13(7):492–498, 1967.

[31] D. Eppstein, M. Loffler, and D. Strash. Listing All Maximal Cliques in SparseGraphs in Near-optimal Time. ArXiv e-prints, June 2010.

[32] P. Erdos and G. Szckeres. A Combinatorial Problem in Geometry, pages 49–56.Birkha user Boston, 1987.

[33] Paul Expert, Tim S Evans, Vincent D Blondel, and Renaud Lambiotte. Un-covering space-independent communities in spatial networks. Proceedings of theNational Academy of Sciences, 108(19):7663–7668, 2011.

[34] Yixiang Fang, Reynold Cheng, Xiaodong Li, Siqiang Luo, and Jiafeng Hu.Effective community search over large spatial graphs. PVLDB, 10(6):709–720,2017.

[35] Yixiang Fang, Reynold Cheng, Siqiang Luo, and Jiafeng Hu. Effective commu-nity search for large attributed graphs. PVLDB, 9(12):1233–1244, 2016.

[36] U. Feige, D. Peleg, and G. Kortsarz. The dense k-subgraph problem. Algorith-mica, 29(3):410–421, Mar 2001.

[37] Santo Fortunato. Community detection in graphs. Physics reports, 486(3):75–174, 2010.

[38] Giorgio Gallo, Michael D. Grigoriadis, and Robert E. Tarjan. A fast paramet-ric maximum flow algorithm and applications. SIAM Journal on Computing,18(1):30–55, February 1989.

[39] Michelle Girvan and Mark EJ Newman. Community structure in social andbiological networks. Proc. Natl. Acad. Sci. USA, 99(cond-mat/0112110):8271–8276, 2001.

147

[40] Andrew V. Goldberg and Robert E. Tarjan. A new approach to the maximum-flow problem. J. ACM, 35(4):921–940, October 1988.

[41] Andrew V. Goldberg and Robert E. Tarjan. Efficient maximum flow algorithms.Commun. ACM, 57(8):82–89, August 2014.

[42] AV Goldberg. Finding a maximum density subgraph. Technical report, 1984.

[43] Goetz Graefe and William J. McKenna. The volcano optimizer generator: Ex-tensibility and efficient search. In ICDE, pages 209–218, Washington, DC, USA,1993. IEEE Computer Society.

[44] Diansheng Guo. Regionalization with dynamically constrained agglomerativeclustering and partitioning (redcap). International Journal of Geographical In-formation Science, 22(7):801–823, 2008.

[45] Rajarshi Gupta, Jean Walrand, and Olivier Goldschmidt. Maximal cliques inunit disk graphs: Polynomial approximation. In INOC, 2005.

[46] F Maxwell Harper and Joseph A Konstan. The movielens datasets: History andcontext. ACM Trans. Interact. Intell. Syst., 5(4):19:1–19:19, December 2015.

[47] Hao He, Haixun Wang, Jun Yang, and Philip S Yu. Blinks: ranked keywordsearches on graphs. In SIGMOD, pages 305–316. ACM, 2007.

[48] J. E. Hopcroft and R. M. Karp. A n5/2 algorithm for maximum matchings inbipartite. In 12th Annual Symposium on Switching and Automata Theory (swat1971), pages 122–125, Oct 1971.

[49] Vagelis Hristidis and Yannis Papakonstantinou. Discover: Keyword search inrelational databases. In VLDB, pages 670–681. VLDB Endowment, 2002.

[50] Jiafeng Hu, Xiaowei Wu, Reynold Cheng, Siqiang Luo, and Yixiang Fang.Querying minimal steiner maximum-connected subgraphs in large graphs. InCIKM, pages 1241–1250. ACM, 2016.

[51] Xin Huang, Hong Cheng, Lu Qin, Wentao Tian, and Jeffrey Xu Yu. Queryingk-truss community in large and dynamic graphs. In SIGMOD, pages 1311–1322,2014.

[52] Xin Huang, Hong Cheng, and Jeffrey Xu Yu. Dense community detection inmulti-valued attributed networks. Inf. Sci., 314(C):77–99, September 2015.

[53] Xin Huang and Laks V. S. Lakshmanan. Attribute-driven community search.PVLDB, 10(9):949–960, 2017.

[54] Xin Huang, Laks VS Lakshmanan, Jeffrey Xu Yu, and Hong Cheng. Approxi-mate closest community search in networks. PVLDB, 9(4):276–287, 2015.

148

[55] Marie Jacob and Zachary Ives. Sharing work in keyword search over databases.In SIGMOD, pages 577–588. ACM, 2011.

[56] Picard Jean-Claude and Queyranne Maurice. A network flow solution to somenonlinear 0-1 programming problems, with applications to graph theory. Net-works, 12(2):141–159, 1982.

[57] Varun Kacholia, Shashank Pandit, Soumen Chakrabarti, S Sudarshan, RushiDesai, and Hrishikesh Karambelkar. Bidirectional expansion for keyword searchon graph databases. In VLDB, pages 505–516. VLDB Endowment, 2005.

[58] Mehdi Kargar and Aijun An. Keyword search in graphs: Finding r-cliques.PVLDB, 4(10):681–692, 2011.

[59] Tarun Kathuria and S. Sudarshan. Efficient and provable multi-query optimiza-tion. In PODS, pages 53–67, New York, NY, USA, 2017. ACM.

[60] S. Khot. Ruling out ptas for graph min-bisection, dense k-subgraph, and bipar-tite clique. SIAM Journal on Computing, 36(4):1025–1071, 2006.

[61] Samir Khuller and Barna Saha. On finding dense subgraphs. In Proceedings ofthe 36th International Colloquium on Automata, Languages and Programming:Part I, ICALP ’09, pages 597–608. Springer-Verlag, 2009.

[62] Andrea Lancichinetti, Santo Fortunato, and Janos Kertesz. Detecting the over-lapping and hierarchical community structure in complex networks. New Jour-nal of Physics, 11(3):033015, 2009.

[63] Wangchao Le, Anastasios Kementsietsidis, Songyun Duan, and Feifei Li. Scal-able multi-query optimization for SPARQL. In ICDE, pages 666–677. IEEE,2012.

[64] J. J. Levandoski, M. Sarwat, A. Eldawy, and M. F. Mokbel. Lars: A location-aware recommender system. In ICDE, pages 450–461, April 2012.

[65] Guoliang Li, Shuo Chen, Jianhua Feng, Kian-lee Tan, and Wen-syan Li. Effi-cient location-aware influence maximization. In SIGMOD, pages 87–98. ACM,2014.

[66] Guoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, and Lizhu Zhou.EASE: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data. In SIGMOD, pages 903–914. ACM, 2008.

[67] Jianxin Li, Chengfei Liu, and Md Saiful Islam. Keyword-based correlated net-work computation over large social media. In ICDE, pages 268–279, 2014.

[68] Jianxin Li, Xinjue Wang, Deng Ke, Xiaochun Yang, Timos Sellis, and Xu YuJeffrey. Most influential community search over large social networks. In ICDE,pages 871–882, April 2017.

149

[69] Rong-Hua Li, Lu Qin, Fanghua Ye, Jeffrey Xu Yu, Xiaokui Xiao, Nong Xiao,and Zibin Zheng. Skyline community search in multi-valued networks. In SIG-MOD, pages 457–472. ACM, 2018.

[70] Rong-Hua Li, Lu Qin, Jeffrey Xu Yu, and Rui Mao. Influential communitysearch in large networks. PVLDB, 8(5):509–520, 2015.

[71] Rong-Hua Li, Jiao Su, Lu Qin, Jeffrey Xu Yu, and Qiangqiang Dai. Persistentcommunity search in temporal networks. In 34th IEEE International Conferenceon Data Engineering, ICDE 2018, Paris, France, April 16-19, 2018, pages 797–808, 2018.

[72] Rong-Hua Li, Jeffrey Xu Yu, and Rui Mao. Efficient core maintenance in largedynamic graphs. Knowledge and Data Engineering, IEEE Transactions on,26:2453–2465, 10 2014.

[73] Yan Liu, Alexandru Niculescu-Mizil, and Wojciech Gryc. Topic-link lda: Jointmodels of topic and author community. In ICML, pages 665–672. ACM, 2009.

[74] Ziyang Liu and Yi Chen. Identifying meaningful return information for XMLkeyword search. In SIGMOD, pages 329–340, 2007.

[75] Can Lu, Jeffrey Xu Yu, Hao Wei, and Yikai Zhang. Finding the maximum cliquein massive graphs. Proc. VLDB Endow., 10(11):1538–1549, August 2017.

[76] R. Duncan Luce and Albert D. Perry. A method of matrix analysis of groupstructure. Psychometrika, 14(2):95–116, Jun 1949.

[77] Ramesh M. Nallapati, Amr Ahmed, Eric P. Xing, and William W. Cohen. Jointlatent topic models for text and citations. In SIGKDD, pages 542–550. ACM,2008.

[78] Mark EJ Newman and Michelle Girvan. Finding and evaluating communitystructure in networks. Physical review E, 69(2):026113, 2004.

[79] James B Orlin. Max flows in o (nm) time, or better. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 765–774. ACM,2013.

[80] Lu Qin, Rong-Hua Li, Lijun Chang, and Chengqi Zhang. Locally densest sub-graph discovery. In SIGKDD, pages 965–974, 2015.

[81] Lu Qin, J.X. Yu, Lijun Chang, and Yufei Tao. Querying communities in rela-tional databases. In ICDE, pages 724–735, March 2009.

[82] Lester R Ford Jr and Delbert R Fulkerson. Maximal Flow Through a Network,pages 243–248. Birkhauser Boston, 1987.

150

[83] Mojtaba Rezvani, Weifa Liang, Chengfei Liu, and Jeffrey Xu Yu. Efficientdetection of overlapping communities using asymmetric triangle cuts. TKDE,pages 1–1, 2018.

[84] Prasan Roy, Srinivasan Seshadri, S Sudarshan, and Siddhesh Bhobe. Efficientand extensible algorithms for multi query optimization. SIGMOD, 29(2):249–260, 2000.

[85] Yiye Ruan, David Fuhry, and Srinivasan Parthasarathy. Efficient communitydetection in large networks using content and links. In WWW, pages 1089–1098.ACM, 2013.

[86] Raman Samusevich, Maximilien Danisch, and Mauro Sozio. Local triangle-densest subgraphs. In Proceedings of the 2016 IEEE/ACM International Con-ference on Advances in Social Networks Analysis and Mining, pages 33–40.IEEE Press, 2016.

[87] Ahmet Erdem Sarıyuce, Bugra Gedik, Gabriela Jacques-Silva, Kun-Lung Wu,and Umit V Catalyurek. Streaming algorithms for k-core decomposition. Pro-ceedings of the VLDB Endowment, 6(6):433–444, 2013.

[88] M. Sarwat, J. J. Levandoski, A. Eldawy, and M. F. Mokbel. An efficient andscalable location-aware recommender system. TKDE, 26(6):1384–1399, June2014.

[89] Stephen B. Seidman. Network structure and minimum degree. Social Networks,5(3):269 – 287, 1983.

[90] Timos Sellis and Subrata Ghosh. On the multiple-query optimization problem.TKDE, 2(2):262–266, Jun 1990.

[91] Timos K Sellis. Multiple-query optimization. ACM Trans. Database Syst.,13(1):23–52, 1988.

[92] Kyuseok Shim, Timos Sellis, and Dana Nau. Improvements on a heuristicalgorithm for multiple-query optimization. Data & Knowledge Engineering,12(2):197–222, 1994.

[93] Mauro Sozio and Aristides Gionis. The community-search problem and how toplan a successful cocktail party. In SIGKDD, pages 939–948, 2010.

[94] IZUMI Taisuke and SUZUKI Daisuke. Faster enumeration of all maximal cliquesin unit disk graphs using geometric structure. IEICE Transactions on Infor-mation and Systems, E98.D(3):490–496, 2015.

[95] Nikolaj Tatti and Aristides Gionis. Density-friendly graph decomposition. InWWW, pages 1089–1099. International World Wide Web Conferences SteeringCommittee, 2015.

151

[96] Etsuji Tomita, Akira Tanaka, and Haruhisa Takahashi. The worst-case timecomplexity for generating all maximal cliques and computational experiments.Theoretical Computer Science, 363(1):28–42, 2006.

[97] Charalampos Tsourakakis. The k-clique densest subgraph problem. In WWW,pages 1122–1132. International World Wide Web Conferences Steering Com-mittee, 2015.

[98] S. Vadhan. The complexity of counting in sparse, regular, and planar graphs.SIAM Journal on Computing, 31(2):398–427, 2001.

[99] Jia Wang and James Cheng. Truss decomposition in massive networks. PVLDB,5(9):812–823, 2012.

[100] N. Wang, D. Yu, H. Jin, C. Qian, X. Xie, and Q. Hua. Parallel algorithm for coremaintenance in dynamic graphs. In 2017 IEEE 37th International Conferenceon Distributed Computing Systems (ICDCS), pages 2366–2371, June 2017.

[101] Dong Wen, Lu Qin, Ying Zhang, Xuemin Lin, and Jeffrey Xu Yu. I/o effi-cient core graph decomposition at web scale. 2016 IEEE 32nd InternationalConference on Data Engineering (ICDE), pages 133–144, 2016.

[102] Peng Wu and Li Pan. Mining application-aware community organizationwith expanded feature subspaces from concerned attributes in social networks.Knowledge-Based Systems, 139:1 – 12, 2018.

[103] Yubao Wu, Ruoming Jin, Jing Li, and Xiang Zhang. Robust local communitydetection: On free rider effect and its elimination. PVLDB, 8(7):798–809, 2015.

[104] Yanghua Xiao, Wentao Wu, Jian Pei, Wei Wang, and Zhenying He. Efficientlyindexing shortest paths by exploiting symmetry in graphs. In EDBT, pages493–504, 2009.

[105] Yu Xu and Yannis Papakonstantinou. Efficient keyword search for smallestLCAs in XML databases. In SIGMOD, pages 527–538. ACM, 2005.

[106] Zhiqiang Xu, Yiping Ke, Yi Wang, Hong Cheng, and James Cheng. A model-based approach to attributed graph clustering. In SIGMOD, pages 505–516.ACM, 2012.

[107] Liang Yao, Chengfei Liu, Jianxin Li, and Rui Zhou. Efficient computation ofmultiple XML keyword queries. In WISE, pages 368–381. Springer, 2013.

[108] Long Yuan, Lu Qin, Wenjie Zhang, Lijun Chang, and Jianye Yang. Index-baseddensest clique percolation community search in networks. TKDE, pages 1–1,2018.

[109] Fan Zhang, Ying Zhang, Lu Qin, Wenjie Zhang, and Xuemin Lin. When en-gagement meets similarity: efficient (k, r)-core computation on social networks.PVLDB, 10(10):998–1009, 2017.

152

[110] Yikai Zhang, Jeffrey Xu Yu, Ying Zhang, and Lu Qin. A fast order-basedapproach for core maintenance. 2017 IEEE 33rd International Conference onData Engineering (ICDE), pages 337–348, 2017.

[111] Yang Zhou, Hong Cheng, and Jeffrey Xu Yu. Graph clustering based on struc-tural/attribute similarities. PVLDB, 2(1):718–729, August 2009.

153

Documents

Efficient cohesive subgraph search in big attributed graph ... · graph data. Realised that nding both the optimal query plan for multiple queries and the optimal query plan for a