Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Efficient Cohesive Subgraph Search in Big
Attributed Graph Data
by
Lu Chen
Faculty of Science, Engineering and Technologyin fulfilment of the requirements for the degree of
Doctor of Philosophy
at the
SWINBURNE UNIVERSITY OF TECHNOLOGY
December 2018
Acknowledgments
This thesis would not have been possible without the help, support and guidance of
important people in my life.
I would like to show my deepest appreciation to my coordinating supervisor Prof.
Chengfei Liu for his kindness, patience, and guidance. He encouraged and advised me
to pursue a Ph.D., and I am grateful for his great advice perpetually. His guidance
greatly inspired and helped me all the time of research and writing of this thesis. I
greatly appreciate that I had the chance to work with him. I would like to thank my
associate supervisor Dr. Jianxin Li for his inspiration and assistance. His determina-
tion and willpower affected me greatly, encouraging me to move forward when I got
stuck in research.
I would like to thank Dr. Rui Zhou, Prof. Xiaochun Yang, Assoc. Prof. Bin
Wang, and Assoc. Prof. Zhenying He. They helped me tremendously regarding how
to conduct in depth research and how to present profound research idea in clear and
simple language. Especially, Dr. Rui Zhou gave me invaluable comments and help
over my Ph.D. study period.
I would like to thank and appreciate all members of our research group at Swin-
burne, who are Dr. Saiful Islam, Dr.Tarique Anwar, Dr. Musfique Anwar, Dr. Mehdi
Naseriparsa, Ahmed Alshammari, Afzal Azeem Chowdhary, Aman Abidi, Limeng
Zhang and Xiaofan Li.
I would like to thank my parents and parents in law for their unconditional support
and love, and for being with me on important steps of my life. They always stood
by me at any cost whenever I was in tough situations and encouraged me with their
loving spirit.
i
More importantly, I would like to thank my wife Ms. Yingxian Zhang. She has
been always encouraging and alway had the right words to keep me going when I was
discouraged. I could not have done it without her.
At last but not least, I would like to acknowledge Swinburne University of Tech-
nology for providing funding, financial support, various facilities, and trainings, to
finish my Ph.D. research successfully.
ii
Declaration
I, Lu Chen, declare that this thesis titled, “Efficient Cohesive Subgraph Search in Big
Attributed Graph Data” and the work presented in it are my own. I confirm that:
• This work was done wholly or mainly while in candidature for a research degree
at this University.
• Where any part of this thesis has previously been submitted for a degree or
any other qualification at this University or any other institution, this has been
clearly stated.
• Where I have consulted the published work of others, this is always clearly
attributed.
• Where I have quoted from the work of others, the source is always given. With
the exception of such quotations, this thesis is entirely my own work.
• I have acknowledged all main sources of help.
• Where the thesis is based on work done by myself jointly with others, I have
made clear exactly what was done by others and what I have contributed myself.
iii
iv
Publications
• Lu Chen, Chengfei Liu, Rui Zhou, Jianxin Li, Xiaochun Yang, Bin Wang.
Maximum Co-located Community Search in Large Scale Social Networks. PVLDB
2018, CORE rank A*. (Chapter 3)
• Lu Chen, Chengfei Liu, Kewen Liao, Jianxin Li, Rui Zhou. Contextual Com-
munity Search over Large Social Networks. ICDE 2019, CORE rank A*. (Chap-
ter 4)
• Lu Chen, Chengfei Liu, Jianxin Li, Xiaochun Yang, Bin Wang, Rui Zhou.
Efficient Batch Processing for Multiple Keyword Queries on Graph Data. CIKM
2016, CORE rank A. (Chapter 5)
• Jianxin Li, Chengfei Liu, Lu Chen, Zhenying He, Amitava Datta, Feng Xia.
iTopic: Influential Topic Discovery from Information Networks via Keyword
Query. WWW 2017, Best Demo Award, CORE rank A*.
• Qiao Tian, Jianxin Li, Lu Chen, Ke Deng, Rong-hua Li, Mark Reynolds,
Chengfei Liu. Evidence-driven dubious decision making in online shopping.
WWWJ 2018, CORE rank A.
v
vi
Abstract
Models for finding cohesive subgraph previously have focused on graphs having no
attributes. However, these graphs provide only partial representation of real graph
data and miss important attributes describing a variety of features for each vertex
in the graphs. As such real graph data are better modelled as attributed graph. In-
vestigations for cohesive subgraph search in attributed graphs are still at preliminary
stage. Searching cohesive subgraphs in an attributed graph can discover interesting
communities and find useful information for answering keyword queries. In this the-
sis, several cohesive subgraph models considering spatial and textual attributes are
studied, which well fit into various real applications.
Firstly, the problem of k-truss search has been well defined and investigated to find
the highly correlated user groups in social networks. But there is no previous study to
consider the constraint of users’ spatial information in k-truss search, denoted as co-
located community search in this thesis. The co-located community can serve many
real applications. To search the maximum co-located communities efficiently, we
first develop an efficient exact algorithm with several pruning techniques. We further
develop an approximation algorithm with adjustable accuracy guarantees and explore
more effective pruning rules, which can reduce the computational cost significantly.
To accelerate the real-time efficiency, we also devise a novel quadtree based index
to support the efficient retrieval of users in a region and optimise the search regions
with regards to the given query region. We verify the performance of our proposed
algorithms and index using five real datasets.
Secondly, we propose a novel parameter-free community model called contextual
community, for searching a community in an attributed graph. The proposed model
vii
only requires a query context providing a set of keywords describing the desired com-
munity context, where the returned community is designed to be both structure and
attribute cohesive w.r.t. the query context. We show that both exact and approxi-
mate contextual community can be searched in polynomial time. The best proposed
exact algorithm bounds the running time by a cubic factor or better using an elegant
parametric maximum flow technique. The proposed 13-approximation algorithm sig-
nificantly improves the search efficiency. In the experiment, we use six real networks
with ground-truth communities to evaluate the effectiveness of our contextual com-
munity model. Experimental results demonstrate that the proposed model can find
near ground-truth communities. We test both our exact and approximate algorithms
using twelve large real networks to demonstrate the high efficiency of the proposed
algorithms.
Thirdly, answering keyword queries on textual attributed graph data has drawn
a great deal of attention from database communities. However, most graph keyword
search solutions proposed so far primarily focus on a single query setting. We observe
that for a popular keyword query system, the number of keyword queries received
could be substantially large even in a short time interval, and the chance that these
queries share common keywords is quite high. Therefore, answering keyword queries
in batches would significantly enhance the performance of the system. Motivated
by this, this thesis studies efficient batch processing for multiple keyword queries on
graph data. Realised that finding both the optimal query plan for multiple queries
and the optimal query plan for a single keyword query on graph data are compu-
tationally hard, we first propose two heuristic approaches which target maximising
keyword overlap and give preferences for processing keywords with short sizes. Then
we devise a cardinality based cost estimation model that takes both graph data statis-
tics and search semantics into account. Based on the model, we design an A* based
algorithm to find the global optimal execution plan for multiple queries. We evaluate
the proposed model and algorithms on two real datasets and the experimental results
demonstrate their efficacy.
viii
Contents
List of Figures ix
List of Figures xiii
List of Tables xv
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Maximum Co-located Communities Search . . . . . . . . . . . 2
1.1.2 Contextual Community Search . . . . . . . . . . . . . . . . . 3
1.1.3 Batch Keyword Query Processing on Graph Data . . . . . . . 6
1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Maximum Co-located Communities Search . . . . . . . . . . . 7
1.2.2 Contextual Community Search . . . . . . . . . . . . . . . . . 8
1.2.3 Batch Keyword Query Processing on Graph Data . . . . . . . 10
1.3 Organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Literature Review 13
2.1 Cohesive Subgraph Models . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 k-truss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.3 Clique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.4 k-core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Community Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . 20
ix
2.2.1 Community Discovery in Spatial Attributed Networks . . . . . 21
2.2.2 Community Discovery in Textual Attributed Networks . . . . 22
2.2.3 Community Discovery in Networks without Attributes . . . . 23
2.3 Keyword Search on Graph Data . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Keyword Search Semantics . . . . . . . . . . . . . . . . . . . . 24
2.3.2 Batch Query Processding . . . . . . . . . . . . . . . . . . . . . 25
3 Maximum Co-located Communities Search 27
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Finding Exact Results . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.1 Baseline Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.2 Efficient (k,d)-MCC Search . . . . . . . . . . . . . . . . . . . 35
3.3.3 Prunings before (k,d)-MCCs Enumeration . . . . . . . . . . . 40
3.4 Finding Spatial Approximate Result . . . . . . . . . . . . . . . . . . 41
3.4.1 How to Approximate (k,d)-MCCs . . . . . . . . . . . . . . . . 41
3.4.2 Spatial Index and Search Bounds . . . . . . . . . . . . . . . . 43
3.4.3 Prunings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.4 Error-bounded Approximation Algorithm . . . . . . . . . . . . 47
3.4.5 Truss Attributed Quadtree Index . . . . . . . . . . . . . . . . 48
3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5.1 Efficiency Evaluation . . . . . . . . . . . . . . . . . . . . . . . 52
3.5.2 Effectiveness Evaluation . . . . . . . . . . . . . . . . . . . . . 58
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4 Contextual Community Search 65
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2.2 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2.3 Why contextual community search . . . . . . . . . . . . . . . 72
x
4.2.4 Pre-prunings . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3 A Flow Network Based Approach . . . . . . . . . . . . . . . . . . . . 73
4.3.1 Flow Network Preliminaries . . . . . . . . . . . . . . . . . . . 73
4.3.2 Algorithm Framework . . . . . . . . . . . . . . . . . . . . . . 75
4.3.3 Warm-up for Flow Network Construction . . . . . . . . . . . . 76
4.3.4 CC Auxiliary Flow Network . . . . . . . . . . . . . . . . . . . 78
4.3.5 Correctness and Time Complexity . . . . . . . . . . . . . . . . 82
4.4 An Improved Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.4.1 Optimization Framework . . . . . . . . . . . . . . . . . . . . . 85
4.4.2 Algorithm Correctness . . . . . . . . . . . . . . . . . . . . . . 85
4.4.3 Solving the Subproblem . . . . . . . . . . . . . . . . . . . . . 86
4.4.4 Analysing the Number of Iterations . . . . . . . . . . . . . . . 89
4.5 The Incremental Parametric Maximum Flow . . . . . . . . . . . . . . 90
4.5.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.5.2 Parametric Flow Framework . . . . . . . . . . . . . . . . . . . 92
4.5.3 CC Parametric Flow Network . . . . . . . . . . . . . . . . . . 93
4.5.4 Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.6 Approximation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 95
4.7 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.7.1 Finding Large and Connected CC . . . . . . . . . . . . . . . . 98
4.7.2 State-of-the-art Maximum Flow Algorithms . . . . . . . . . . 98
4.8 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.8.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 99
4.8.2 Effectiveness Evaluation . . . . . . . . . . . . . . . . . . . . . 102
4.8.3 Efficiency Evaluation . . . . . . . . . . . . . . . . . . . . . . . 112
4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5 Batch Keyword Query Processing on Graph Data 115
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.2 Preliminaries and Problem Definitions . . . . . . . . . . . . . . . . . 117
xi
5.2.1 Keyword Query on Graph Data . . . . . . . . . . . . . . . . . 118
5.2.2 Batched Multiple-Keyword Queries . . . . . . . . . . . . . . . 118
5.3 Heuristic-based Approaches . . . . . . . . . . . . . . . . . . . . . . . 120
5.3.1 A Shortest List Eager Approach . . . . . . . . . . . . . . . . . 120
5.3.2 A Maximal Overlapping Driven Approach . . . . . . . . . . . 121
5.4 Cost Estimation for Query Plans . . . . . . . . . . . . . . . . . . . . 125
5.4.1 Cost of an r-Join . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.4.2 Estimating Cardinality of an r-Join Result . . . . . . . . . . . 126
5.5 Estimation-based Query Plans . . . . . . . . . . . . . . . . . . . . . . 127
5.5.1 Finding Optimal Solution based on Estimated Cost . . . . . . 128
5.5.2 Reducing Search Space . . . . . . . . . . . . . . . . . . . . . . 129
5.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.6.1 Datasets and Tested Queries . . . . . . . . . . . . . . . . . . . 131
5.6.2 Evaluation of the Efficiency . . . . . . . . . . . . . . . . . . . 132
5.6.3 Evaluation of Effectiveness . . . . . . . . . . . . . . . . . . . . 136
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6 Conclusion and Future Work 141
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Bibliography 145
xii
List of Figures
3-1 Spatial attributed graph . . . . . . . . . . . . . . . . . . . . . . . . . 29
3-2 Rectangular regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3-3 TQ-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3-4 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3-5 Effect of k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3-6 Effect of d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3-7 Effect of search granularity . . . . . . . . . . . . . . . . . . . . . . . . 57
3-8 Exact algorithm pruning effectiveness . . . . . . . . . . . . . . . . . . 58
3-9 Effectiveness of pruning rules . . . . . . . . . . . . . . . . . . . . . . 59
3-10 Region pruning ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3-11 Approximation ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3-12 Density study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3-13 Case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4-1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4-2 Warm-up flow network illustrations . . . . . . . . . . . . . . . . . . . 76
4-3 F1 scores for facebook . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4-4 Effectiveness evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4-5 Attributed networks with no ground-truth . . . . . . . . . . . . . . . 105
4-6 Sensitivity w.r.t. query attribute size . . . . . . . . . . . . . . . . . . 105
4-7 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4-8 Scalability cont. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4-9 Effect of |Q| . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
xiii
4-10 Effect of |Q| cont. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5-1 An example graph G and the answer subsgraphs to q1 in the subgraph
G′ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5-2 Query plans for single queries q1, q2, and batch multiple queries {q1, q2} 119
5-3 An example of processes in the algorithm Overlap . . . . . . . . . . 123
5-4 Scalability and speedup studies . . . . . . . . . . . . . . . . . . . . . 133
5-5 Efficiency of multiple queries . . . . . . . . . . . . . . . . . . . . . . . 134
5-6 Accuracy of cardinality estimation . . . . . . . . . . . . . . . . . . . . 137
5-7 Pruning effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
xiv
List of Tables
3.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Maximal cliques contained in Figure 3-1(c) . . . . . . . . . . . . . . . 34
3.3 Enumeration trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 Truss ids and union of truss-to-vertex descriptions . . . . . . . . . . . 48
3.5 Description files for a branch . . . . . . . . . . . . . . . . . . . . . . 49
3.6 Implemented algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.7 Statistic information in datasets . . . . . . . . . . . . . . . . . . . . . 51
3.8 Parameter settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.9 TQ-tree construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.1 Statistic information of datasets . . . . . . . . . . . . . . . . . . . . . 100
4.2 Implemented algorithm for different community models . . . . . . . . 101
5.1 Keyword sets for DBLP and IMDB . . . . . . . . . . . . . . . . . . . 132
xv
Chapter 1
Introduction
Cohesive subgraph, one of fundamental components in graphs, is a group of fine-
connected vertices which can be used to discover meaningful information from graph
data. The best known applications of finding cohesive subgraphs are community
search or detection and keyword search. Models for finding cohesive subgraph have
focused on graphs having no attributes previously. However, these graphs provide
only partial representation of real graph data and miss important attributes describ-
ing a variety of features for each vertex in the graphs, which leads to that real graph
data are better modelled as attributed graph. Although a plenty of cohesive subgraph
models have been proposed and extensively studied, effective cohesive subgraph mod-
els considering various of graph attributes remain to be further studied in earnest.
In this thesis, we explore three cohesive subgraph models: (1) k-truss; (2) den-
sity; and (3) r-cliques, together with one of two types of attributes: (1) textual
attribute; and (2) spatial attribute. That leads to three interesting while challenging
problems on attributed graph data: (1) discovering communities with cohesiveness
on both structure and geo-location; (2) searching communities with cohesiveness on
both structure and keyword attributes; and (3) finding answers for keyword queries
on attributed graph data efficiently.
Section 1.1 presents backgrounds, motivations, research gaps to be filled, and main
ideas of approaches for the problems introduced above. Section 1.2 presents principle
contributions for each of the problems studied in this thesis. Thesis organisation is
1
presented in Section 1.3.
1.1 Motivation
1.1.1 Maximum Co-located Communities Search
With the increasing popularity of online social networks, one of the most important
tasks in social network data analytics is to find communities of users with close
structural connections each other. The extensive studies on finding communities can
be categorised into global community detection GCD (e.g., [37, 39, 44, 78, 33, 17]),
local community detection LCD (e.g., [28, 103]), global community search GCS (e.g.,
[70, 68, 80]), and local community search LCS (e.g., [35, 93, 28, 27, 51, 34, 108,
50]). Community detection methods are often used to discover communities in social
networks based on the predefined implicit criteria, e.g., modularity [37]. The main
difference between GCD and LCD is that each user is equivalently important to
be measured in GCD, while the importance of a user depends on his relevance to
the given query vertex in LCD. Different from community detection, the community
search methods concentrate on finding communities from social networks based on
the users’ specified explicit criteria, e.g., parameter k in k-core based model [89], k-
truss based model [26], and k-edge-connected component based model [12]. Similar
to community detection methods, the major difference between GCS and LCS is that
LCS requires the communities to contain the given query vertex, but GCS doesn’t
have such additional requirement. However, most works above didn’t consider the
effect of users’ spatial information in their community detection or search methods.
Searching communities with social and spatial cohesiveness is of great impor-
tance in many applications, e.g., event scheduling, product recommendation, targeted
advertisement, local activism and advocacy, as well as more effective content spread-
ing like shop promotions, local news, and job openings. Although spatial feature is
highly desirable in applications, in practice, the existing study on spatial social com-
munity is still limited. In [34], Fang et al. require all the vertices of a returned k-core
2
community in a minimum covering circle with the smallest radius and the resultant
community must contain the given query vertex. So it is a type of LCS with the
spatial constraint. In [33, 17], Expert et al. and Chen et al. take into account spatial
information in the process of community detection by weighting the link based on the
spatial distance of two linked users. It is a type of GCD with the spatial constraint.
However, the two types of work cannot guarantee the spatial closeness of the commu-
nity members, which will be further discussed in our experiments. In [109], Zhang et
al. require all the vertices of a returned k-core based community meeting similarity
constraints, where the similarity could be distance similarity. However, finding exact
result for this community model is expensive in large scale social network since its
NP-hardness.
Therefore, in this thesis, we investigate the co-located community search problem
that reveals the maximum communities with high social and spatial cohesiveness,
denoted as (k,d)-MCCs search. The social cohesiveness is defined using the minimum
truss value k [26] and the spatial cohesiveness is parameterised by a user-specified
distance value d. As such, our proposed (k,d)-MCCs search problem can allow users
to easily affirm the quality of the resultant communities, which also fills in the research
gap on the type of GCS with spatial constraint.
1.1.2 Contextual Community Search
As real social network data are actually complicated, i.e. users are generally profiled
with attributes, a social network is better modelled as an attributed graph where
vertices are attached with descriptive attributes like keywords describing various user
properties. For this reason, searching community in attributed graph has become
popular with the invention of new community models, queries and search algorithms.
Since an attributed community search method does not need to explore all vertices
[35, 53], reducing the search space by orders of magnitudes, it makes online commu-
nity search become applicable, which is ideal for many applications. However, most
existing attributed community search methods do not support community search
given only the context information.
3
Often an application or user would only like to search any community that is
most relevant to its provided context information without knowing how community
looks like and what community members actually are. In this thesis, we propose a
novel parameter-free community model, namely the contextual community that only
requires a query to provide a set of keywords describing an application/user context
(e.g. location and preference). As such users can search desired communities without
detailed information of them. This is in contrast to existing community search models
where additionally a set of known query vertices as well as community cohesiveness
parameters (e.g. k as the minimum vertex degree) are required. But still, the returned
contextual community is designed to be both structure and context cohesive.
Structure cohesiveness. Given the query context, the most popular cohesive mea-
surements like k-core and k-truss are often unsuitable. On one hand, there exists an
inherent contradiction that a larger k value may imply a smaller community to be
found that is potentially insufficient to match the provided context. On the other
hand, while considering more about the context match, vertices (edges) may very
likely fail to meet the minimum number requirement of neighbours (common neigh-
bours) of k-core (k-truss). Moreover, imposing the k bound on the community can
be deemed to be inflexible and deciding the best k that satisfies both context and
structure requirements is challenging. Therefore, for seeking a proper contextual com-
munity we instead adopt the notion of relative subgraph density that is parameter-free
and relative in the number of considered motifs/units and the number of their induced
vertices. The search goal is then reduced to finding a subgraph having the highest
density. However, as shown in [97] if the considered motifs or community signatures
are simply edges, the found community might be large and not absolutely cohesive.
On the other hand, [97] shows that triangle motifs would be better signatures to
produce a truly cohesive subgraph, but in this case size of the returned community
might drop dramatically, thereby affecting the desirable vertex coverage. To alleviate
such shortcomings, we instead account for both involved triangle and edge motifs as a
unified density measure to suitably explain the structure cohesiveness of a contextual
community.
4
Context cohesiveness. As discussed previously, in real applications, it would be
desirable that, by simply accepting a set of keywords about the desired community
context, a community search system is able to find a community that is highly relevant
to the provided context. Intuitively, this translates to, vertices with attributes close to
the context shall be considered as members of the desired community in an attributed
graph. However, overemphasising the co-relation between found vertex attributes and
the query context may cause the search to result in a small and loosely connected
subgraph. This is actually against the structural requirement of being a community
and becomes another popular research topic, keyword search [9, 47, 57, 66, 81, 58].
To avoid such pessimistic situation, we can relax context cohesiveness by tolerating
community vertices that are themselves less relevant to the query context but in-
stead their surrounding vertices are more relevant. As shown in Section 4.2.2, this
relaxation is naturally achieved with triangle and edge (the aforementioned subgraph
density motif) contextual scores/weights aggregated from nodal contextual relevance.
Notice that our weighted motif (triangle and edge) measure ensures relaxed but strong
internal context cohesiveness in a community since all the involved units are matched
against the query context.
Contextual community. Based on the desiderata of contextual community search,
we propose a weighted density based contextual community model. First, a contex-
tual weight is assigned to each small motif of a subgraph. It measures the context
prevalence of a motif. Then, the context weighted density of a subgraph is calculated
as the division of the aggregated weight over all motifs by the total number vertices
in the subgraph. Finally, the subgraph with the highest contextual weighted density
w.r.t. the query context is returned as the best community. Notice that the intuition
behind our contextual community model is: every designated community member
should be involved in many structurally overlapped edge and triangle motifs which
are themselves prevalent in the specified query context. In real life, these units are
analogous to mutual friendships and family circles.
Advantages and applications. There are a number of advantages which contex-
tual community search can offer. First, contextual community search significantly
5
simplifies the search parameter space. As such, users do not need to admit to any
cohesiveness parameter in their search. Second, contextual community search only
explores parts of the social network that are relevant to the query context. This
significantly reduces the solution search space and makes online community search
possible. Also note that communities found by contextual community search can be
multifarious. For instance, if a query context covers most attributes of a ground truth
community, then the contextual search finds a near-groundtruth result. On the other
hand, in the case when a query context covers attributes across multiple ground truth
communities, the search returns a community that cannot be found by most exist-
ing methods. These situations are also evident from our experimental findings. For
this kind of flexibility, contextual community search is suitable for broader applica-
tions including the existing ones such as event scheduling, product recommendation,
targeted advertisement, activism and advocacy that leverage the query context.
1.1.3 Batch Keyword Query Processing on Graph Data
Keyword search has been extensively studied in the field of database, e.g., relational
database [2, 49], XML database [74, 105], graph database [9, 47, 49, 57, 58, 66, 67, 81]
as well as spatial database [13, 29]. However, all the above existing work focused on
single query processing in keyword search. They design their algorithms and indices
based on the performance of answering single keyword queries. Such single query
based techniques are not enough to support real query processing systems due to
several reasons. Normally, a query processing system should support multiple types
of users. For example, beyond general users, a third-party company as an important
data consumer may perform significant analysis and mining of the underlying data in
order to optimise their business by issuing a group of queries as a batch query. Here,
the third-party company may be an industry sector collecting their interested data
from online databases, a researcher comparing the scientific results from scientific
databases. In all the cases, the batch of queries issued from the third-party com-
pany are used to mine information from the databases and optimise their business
or targeting benefit. It is also important that such a query processing system is de-
6
signed with the goal of returning results in fractions of a second for a large number of
queries to be received in a very short time. Recently, domain-specific search engines
have been widely used as they provide specific and profound information that well
satisfies users’ search intentions. Usually, the underlying data is highly structured,
and in most cases, is represented as attributed graphs. We observe that for a popular
domain-specific search engine, the number of keyword queries received could be sub-
stantially large even in a short time interval, and the chance that these queries share
common keywords is quite high. Therefore, answering keyword queries in batches
would significantly enhance the performance of domain-specific search engines. How-
ever, most graph keyword search solutions proposed so far primarily focus on a single
query setting.
In this thesis, we study the problem of batch processing of keyword queries on
graph data. Our goal is to process a set of keyword queries as a batch while minimising
the total time cost of answering the set of queries.
1.2 Contribution
1.2.1 Maximum Co-located Communities Search
Given a social network G and two parameters k and d, a straightforward approach
is to enumerate all possible subgraphs in G meeting minimum truss value k where
the number of subgraphs could be as large as O(2n). It then filters the candidates
having a node pair with their distance above the spatial closeness threshold d. So the
time complexity of this approach is at least O(2n) where n is the number of vertices
in G. Obviously, it is infeasible to use this approach to support online (k,d)-MCCs
search, particularly for the large scale social networks. Thus, we will propose efficient
algorithms to achieve real-time response with theoretical guarantee.
To address the challenge of efficiency, we first develop an exact (k,d)-MCCs search
algorithm by proposing novel pruning techniques. During the search, we explore tech-
niques to prune the search space significantly by considering upper bound based earlier
7
termination, heuristic search order, and conditions for reusing pruning computation.
Before searching, we also propose pre-pruning techniques for reducing magnitudes
of input data. To design polynomial algorithms, we develop a novel approximation
schema with spatial accuracy guarantees. Notice, our proposed approximation scheme
can provide adjustable spatial error ratios based on user’s requirement on the spatial
accuracy. To further improve the performance of the approximation algorithm, we
propose more pruning techniques and also design the novel index TQ-tree. The main
contributions of our work are summarised as follows.
• We propose a novel co-located community model and formally define the (k,d)-
MCCs search problem.
• We develop an efficient exact algorithm for finding (k,d)-MCCs by proposing
effective techniques for pruning before and during the search.
• We also develop a spatial approximation algorithm that offers a variable spatial
error ratio ranging from 2√
2+ ε to√
2+ ε′. The efficiency of the approximation
algorithm is further improved by proposing more effective pruning techniques
and a novel TQ-tree index.
• We conduct extensive experimental studies on five real datasets to demonstrate
the efficiency and effectiveness of the proposed algorithms.
1.2.2 Contextual Community Search
Whether there exists an efficient algorithm for searching contextual community is
unknown. Although our proposed community model is based on weighted densest
subgraph, existing exact and approximation algorithms running in polynomial time
only work on separate density functions, i.e. weighted/unweighted degree density
function or unweighted triangle density function. For our more complicated con-
textual community search, building on the theory frameworks of flow networks and
approximation algorithms we confirm that there are indeed (both in theory and prac-
tice) efficient algorithms.
8
More precisely, given a graph G and a set of query attributes Q, our first approach
carefully constructs a flow network N that guides the community search. Together
with binary search probing, the approach in total runs in time O(|V (N)|3 log(|V (G)|))
where V (.) and E(.) define vertex and edge sets respectively and N is a constructed
flow network. By formulating the contextual community search as an optimisa-
tion problem, we then construct a different flow network N with parameters help-
ing a monotonic search of contextual community. Along this second approach, we
manage to avoid a pitfall implementation running in O(V (G)|V (N)|3) with an el-
egant parametric maximum flow technique. This technique eventually turns the
runtime into O(|V (N)|3) or better. Note that the aformentioned runtime complex-
ities are worst-case guarantees while in practice they are also very much reduced
with the consideration of query context. To achieve even extra runtime scalability,
we also propose a fast 13-approximation algorithm. The algorithm can run in time
O(|V (G)| log(|V (G)|)+|E(G)| log(|V (G)|)+|Tri(G)|) with simple degree and triangle
indices. Overall, the main contributions of our work are summarised as follows:
• We propose and study a novel and useful contextual community (CC) search
problem.
• Two network flow based exact algorithms are designed to solve CC search in
polynomial time.
• An approximate solution is proposed and analyzed (with an approximation ratio
of 13), which significantly improves over the runtimes of exact algorithms.
• Our empirical studies on real social network datasets demonstrate the superior
effectiveness of CC search methods under different query contexts. Extensive
performance evaluations also reveal the superb practical efficiency of the pro-
posed CC search algorithms.
9
1.2.3 Batch Keyword Query Processing on Graph Data
In spite of batch query processing having been studied extensively [91, 84, 55, 49,
24, 63, 63, 11], we observe that all the existing techniques cannot be applied to
our problem - batch keyword query processing on graph data. The main reasons
come from the following significant aspects. (1) Meaningful Result Semantics : r-
clique can well define the semantics of keyword search on graph data as r-clique can
be used to discover the tightest relations among all the given keywords in a query
[58], but there is no existing work that studies batch query processing with this
meaningful result semantics; (2) Complexity of the Optimal Batch Processing : it is
an NP-complete problem to optimally process multiple keyword queries in a batch.
This is because each single query corresponds to several query plans, and obviously
we cannot enumerate all the possible combinations of single query plans to get the
optimal query plan for multiple queries; (3) Not-available Query Historic Information:
unlike the batch query processing [107], we do not have the assumption that we know
the result sizes of any subqueries before we actually run these queries because this
kind of historic information is not always available.
Although we can simply evaluate the batch queries in a pre-defined order and
re-use the intermediate results in the following rounds as much as we can, there is
no guarantee the batch queries can be run optimally. To address this, we firstly
develop two heuristic approaches which give preferences for processing keywords with
short sizes and maximise keyword overlaps. Then we devise a cardinality estimation
cost model by considering graph connectivity and result semantics of r-clique. Based
on the cost model, we can develop an optimal batch query plan by extending A*
search algorithm. Since the A* search in the worst case could be exhaustive search,
which enumerates all possible global plans, we propose pruning methods, which can
efficiently prune the search space to get the model based optimal query plan.
We make the following contributions in this work:
• We propose and study a new problem of batch keyword query processing on
native graph data, which is popular to be used in modern data analytics and
10
management systems.
• We formalise the proposed problem, which is NP-complete. To address it, we
develop two heuristic solutions by considering the features of batch keyword
query processing.
• To optimally run the batched queries, we devise an estimation-based cost model
to assess the computational cost of possible sub-queries, which is then used to
identify the optimal plan of the batch query evaluation.
• We conduct extensive experiments on DBLP and IMDB dataset to evaluate the
efficiency of proposed algorithms and verify the precision of the cost model.
1.3 Organisation
The reminder of this thesis is organised as follows:
• Chapter 2 introduces the related work on cohesive subgraph models, commu-
nity search, spatial cohesiveness models, keyword search and multiple query
processing.
• Chapter 3 presents the co-located community search problem, the corresponding
algorithms and the experimental results.
• Chapter 4 presents the contextual community search problem, the correspond-
ing algorithms and the experimental results
• Chapter 5 presents the batch keyword query processing problem on attributed
graph data, the corresponding algorithms and the experimental results.
• Chapter 6 concludes our research and provides the possible extension of this
thesis and other unexplored areas as future research direction.
11
12
Chapter 2
Literature Review
The cohesive subgraph search problem has been studied extensively. In the context
of cohesive subgraph search in attributed graph data, studies has been recent and
limited. Finding cohesive subgraph in attributed graph is closely related to research
problems including community finding, and keyword search. In this chapter, we
conduct a detailed literature review on works related to the problems studied in
this thesis. We firstly start from reviewing cohesive subgraph models in Section 2.1,
which includes popular models and algorithm development path for each of the model
respectively. Next, works regarding two important applications of cohesive subgraph
search, i.e. community discovery and keyword search, are disscused in details. In
specific, Section 2.2 presents works regarding community discovery, which mainly
focuses on works that finding communities considering different attributes. While
Section 2.3 presents works related to keyword search on 54724 graph data, including
works proposing different search semantics as well as works dealing with batch queries.
2.1 Cohesive Subgraph Models
There are a number of cohesive subgraph models that are dedicated for various
scenarios in the literature. To define a cohesive subgraph, there are predefined cohe-
siveness measurements. A subgraph that satisfies certain cohesiveness measurements
is considered as cohesive. Some popular cohesive subgraph models, closely related to
13
this thesis, are introduced as follows.
2.1.1 k-truss
The concept of k-truss subgraph is proposed in [26] by Cohen. A k-truss is defined as
a non-trivial and connected subgraph such that every edge in the subgraph has no less
than k-2 common neighbours, in which the non-trivial constraint excludes isolated
vertices. A maximal k-truss is a k-truss if it is not a subgraph of another k-truss.
A k-truss subgraph also ensures that the the subgraph will remain connected if less
than k-2 edges are removed from the subgraph. The truss number of an edge is the
largest value of k such that the edge is in the k-truss.
Truss decomposition computes the truss number for each of the edges in a graph.
The first truss decomposition algorithm is proposed in [26]. The major ideal is that
for all possible k from k = 2 the algorithm iteratively deletes any edge with common
neighbours no greater than k-2 in the residual graph until no such edge remains (all
maximal k-2 trusses are computed), and then increases k by 1 and repeats the process
until the remaining graph becomes empty set. The algorithm uses a queue to store
edges having common neighbours no greater than k-2 for the current k. As such, as k
increasing, edges having high number of common neighbours will be revisited repeat-
edly to check whether it should be moved to the queue or not, which lowers the effi-
ciency and makes the time complexity of the algorithm become O(∑
v∈V (G) deg(v)2).
Observed such shortness, Wang et.al [99] firstly sort the edge according to the num-
ber of their common neighbours in none descending order, then remove an edge such
that has minimum number of common neighbours and has less than k-2 common
neighbours in the remaining graph iteratively. After each removal, since each of the
edges induced by the common neighbours of the removed edge only loses one common
neighbour, Wang et.al adopt a bucket sort based technique that the each effected edge
will be moved to an appropriate position in the sorted edges such that the remaining
edges are still in none descending order according to the number of their common
neighbours in the remaining graph in constant time. As such, the time complexity
of truss decomposition is improved to O(|E(G)| 32 ). In addition, Wang et.al [99] also
14
study I/O effect truss decomposition algorithm if the input graph is impossible to be
fitted in main memory.
Distributed truss decomposition algorithms are studied in [16]. They firstly de-
sign algorithms based on existing MapReduce triangle listing algorithm. Then they
propose an algorithm based on graph-parallel abstraction, which significantly reduces
I/O overhead and improves the performance. In particular, Shao et al study maximal
k-truss detection distributed algorithm, where different from truss decomposition,
maximal k-truss detection focuses on finding a maximal k-truss for a explicit k. They
construct a triangle complete subgraph for each computation node in the distributed
system, show that each triangle complete subgraph can be used to find local k-truss
in parallel, and prove that the union of local k-trusses is exactly the global k-truss.
2.1.2 Density
In graph theory, there are several approaches that define the density of a subgraph.
Among various density formulations, the most famous one is degree density, measuring
the average degree of vertices in a subgraph. Recently, triangle density [97], measuring
the average number of triangles of vertices involved in a subgraph, is proposed to find
more dense subgraph compared to degree density.
One of the fundamental problem for graph density is to find a subgraph max-
imising a given density function, where this problem is known as finding a maximum
density subgraph. The first algorithm solve the maximum degree density problem
is proposed by [56]. They reduce the maximum degree density problem into a 0-1
none-linear fractional programming problem. And then, they adopt fractional pro-
gramming solver proposed in [30], in which a fractional programming problem is
solved by finding optimal results of a set of problems related to the original frac-
tional programming problem. The algorithm proposed in [56] is bounded by |V (G)|
number of finding min s-t cuts, which is because that the set of problems related to
the fractional programming problem answering the maximum degree density problem
can be reduced to min s-t cut problem. Later on, Goldberg discovers that the maxi-
mum degree density problem can be reduced to solving a series of minimum capacity
15
cut computations, by applying flow network techniques. Different from algorithm
in [56], Goldberg proposes a carefully designed directed flow network based on the
original graph. The algorithm, proposed by him, iteratively guesses the optimum
density using a binary search convention. The min s-t cut of the carefully designed
directed flow network guides the direction of the binary search and makes the guess
converge to the optimum density. When optimum density is determined, the maxi-
mum degree density subgraph can be derived according to S of the last min s-t cut
satisfying S \ s 6= ∅, where s is the source vertex in the carefully designed directed
flow network. Such algorithm can be bounded by log(|V (G)|) number of finding min
s-t cuts. The algorithms discussed above all consider algorithm for min s-t cut as a
blackbox. Since the min s-t cut problem having been studied extensively till now, the
best algorithm, according to the best of our knowledge, can run in O(V (N)E(N)),
where N denotes the flow network. Observing the series of min s-t cut problems for
finding the maximum degree density graph are highly co-related, Gallo et al [38] name
similar problems as parametric flow network problem and propose a algorithm which
can be used to solve the maximum degree density graph problem. The algorithm has
a time complexity of O(|V (N)||E(N)| log( |V (N)|2|E(N)| )). Their algorithm for the maxi-
mum degree density graph problem is based on the framework used in [30]. The trick
here is they find that if solving the series min s-t cut problems using push-relabel
algorithm [41], the min s-t cut problems can be solved incrementally. As such, they
prove that taking advantages of reuse the overall time cost is equivalent to solve a
single in s-t cut problem using the push-relabel algorithm. Solving the maximum
degree density problem approximately has also drawn a great of attention. In [15],
a 2-approximation scheme is proposed and then the algorithm is improved and has
time complexity of O(|V (G)|+ |E(G)|) by Khuller et al [61].
The discussed techniques are adopted to solve the maximum density subgraph
problem in directed graph in [61]. Recently, Tsourakakis [97] also shows the maximum
triangle density subgraph problem can be solved using the discussed frameworks for
maximum density subgraph problem. Although the discussed framework can be used
to solve problems having different density functions in [61] and [97], the adoptions
16
are challenging and none-trivial.
If we add size constraint to the maximum degree density subgraph problem, the
new problems becomes intractable. The problems, densest at least size l [7], at most
size l [7], and densest size l subgraph problems [36, 60], are all NP-complete, which
implies problems of finding maximal and minimal maximum degree density subgraph
are all NP-hard. As such, approximation algorithms running in polynomial algorithm
are proposed. Densest at least size l subgraph problem has a 3-approximation algo-
rithm proposed in [7] and has a 2-approximation algorithm proposed by Andersen [6].
On the other hand, densest at most size l and size l subgraph problems are difficult
to have error bounded approximation algorithm running in polynomial time [7].
Most recently, the concept of local dense subgraph draws attention [80, 86, 95]. Lo-
cal dense subgraph in general combines the degree (triangle) density and a constraint
of the minimum degree (triangle) each vertex involved in . A graph is called local
dense subgraph if the degree (triangle) density and the minimum degree (triangle)
constraint of the subgraph go beyond a given threshold.
2.1.3 Clique
2.1.3.1 Cliques in graphs
Clique, as the most cohesive subgraph, is firstly introduced by Erdos et al [32] and
the term clique is named by Luce et al [76]. Given a graph, a clique subgraph is a
complete graph in which every pair of vertices has an edge. A clique is called maximal
clique if there is no super-graph of the clique that is also a complete graph. A clique is
called maximum clique if there is no other complete subgraph having size larger than
the size of the maximum clique. The definition of clique is strict. Luce introduces the
concept of r-clique to relax the pairwise structural distance, defined by the length of
shortest path between two vertices, from 1 to a given integer r, i.e., a r-clique is a
subgraph in which the shortest path between any two vertices is no greater than r.
Maximal clique enumeration problem has been studied extensively. The most
popular algorithms are based on a backtracking framework proposed in [10]. A major
17
optimisation used for clique enumeration is proposed in [4]. The optimisation is based
on an observation that the vertices of maximal cliques containing a vertex must be
the neighbours of the vertex. As such, during the enumeration, clique enumeration
on a whole graph can be divided into a set of small result disjoint problems, which
can be solved separately. The question regarding how to divide the problem so that
the clique enumeration can achieve optimal running time has been answered in [96].
They propose a greedy strategy, in which for each recursion state they always divide
the current problem into the fewest number of result disjoint subproblems. They
also prove that such strategy makes the algorithm proposed in [10] run in O(3|V (G)|
3 ),
which is worst-case optimal, since give a |V (G)| vertices graph the total number of
maximal cliques is up to 3|V (G)|
3 .
Recently, Wang et al propose a new clique enumeration algorithm based on com-
puting a summary of a set of maximal cliques which are redundancy-aware. Since
computing redundancy-aware maximal cliques is efficient and these cliques can be
extended to maximal cliques in the graph, the maximal clique enumeration algo-
rithm based on the redundancy-aware maximal cliques is shown to be more effective
compared to backtracking based clique enumeration.
In addition, maximal clique enumeration in sparse graphs is studied in [31]. In
that, Eppstein et al propose an optimisation, greedily selecting vertices having lowest
degeneracy, to make clique enumeration in a sparse graph run in O(d(|V (G)|−d)3d3 ),
where d is the degeneracy of the sparse graph. Besides, a truth polynomial delay
maximal clique enumeration algorithm for a sparse graph is proposed by Chang et
al [14].
2.1.3.2 Cliques in unit disk graph
A unit disk graph(UDG) is a set of vertices embedded in 2D space, where any two
vertices have an edge if their Euclidean distance is no greater than a given distance
threshold. Cliques in a UDG means that vertices in the cliques are all spatially having
pairwise distance no greater than the given threshold.
Interestingly, finding a maximum clique in a UDG is polynomially solvable [1, 25].
18
The major reasons are as follows. Firstly, given two vertices having an edge in a
UDG, the vertices in a maximum clique containing this edge must be located in an area
which is the intersection of two circles that use the two vertices as centres respectively
while have radius of the given distance threshold. Secondly, subgraph induced by the
two vertices and vertices in the intersection is the complement of a bipartite graph.
Thirdly, finding a maximum independent in the bipartite graph is equivalent to finding
a maximum clique in the subgraph induced by the two vertices and vertices in the
intersection, where finding a maximum independent in a bipartite graph is polynomial
solvable according to Konig’s theorem using Hopcroft–Karp algorithm [48]. As such,
by trying all edges, a maximum clique in a UDG can be found, running in O(|E(G)|)×
the complexity of finding a maximum independent in a bipartite graph.
Although a maximum clique in a UDG is polynomial solvable, the problem hard-
ness for finding all maximal cliques in a UDG is unknown and it is an open problem to
be further investigated. Gupta et al. [45] report the total number of maximal cliques
could grow exponentially with the size of a UDG graph and propose polynomial algo-
rithms that generate near maximal cliques. Exact maximal cliques enumeration for
the neighbourhood neighbourhood graph is also studied in [94], in which their algo-
rithm is still based on [10] but they propose a new problem division strategy using
geometry properties.
2.1.4 k-core
A maximal connected subgraph having vertices with degree at least k is called k-
core [89]. Regarding k-core, there are two classic problems known as core decomposi-
tion and core maintenance. Finding k-cores for all possible k for a graph is known as
core decomposition. The well known O(|E(G)|+ |V (G)|) core decomposition in mem-
ory algorithm is achieved by progressively removing vertices with minimum degree
while efficiently maintaining vertices in non-increasing order according to their most
recent degrees [8]. The I/O efficient core decomposition has been studied in [101, 20]
for massive graphs that cannot be held within main memory. When graph is dynam-
ically updating, incrementally computing new core number for the affected vertices
19
is known as core maintenance. The core maintenance problem has been studied ex-
tensively in [5, 72, 87, 110, 100].
By removing the maximal constraint of k-core, the concept minimum k degree
constrained subgraph, denoted as δ(k)-subgraph, is adopted. The concept of δ(k)-
subgraph was firstly introduced to community search in [93] and then has been widely
used for modelling communities such as [70, 71, 69, 109]. The major advantages of
δ(k)-subgraph are as follows. Firstly, when modelling communities it is intuitive to
interpret the meaning of δ(k), i.e., every member in a community has at lest k friends.
Secondly, it is easy to compute δ(k)-subgraphs, which can be done in linear time w.r.t.
the size of an input graph.
In addition to community modelling, δ(k)-subgraph (or k-core) has nice properties
for other applications as well. Firstly, given a δ(k)-subgraph, the maximum clique
subgraph which it may contain is at most size k or a size k clique must be contained in
a δ(k)-subgraph if it exists. This property can be used to compute the upper bound
of the clique size given a graph or can be used to prune graphs when finding size k
cliques [109, 75]. On the other hand k-core can be used to approximate the degree
densest subgraph with an approximation factor of 2 [95].
2.2 Community Discovery
Community discovery, having been extensively studied, is an important application
of cohesive subgraph search. Most existing community discovery works can be cate-
gorised into community detection, and community search. Community detection in
general has no explicit searching criteria from user, whereas community search find
communities based on certain given criteria. In the following, we introduce commu-
nity discovery works in spatial attributed and textual attributed networks in prior
and then discuss other community discovery works based on the categorisation.
20
2.2.1 Community Discovery in Spatial Attributed Networks
Community discovery in spatial attributed networks aims to find communities whose
members are intensively connected and spatially close.
Community detection. In [17], Chen et al modify the objective function of fast
modularity maximisation algorithm (known as CNM), and make the detected com-
munities sensitive to spatial distance. More clearly, for each edge (u, v) in a graph
G, they assign them a distance decay, therefore, they generate a weighted matrix.
Applying CNM algorithm on the weighted matrix, they maximise the geo-modularity
(previous maximise modularity on a adjacent matrix). Such spatial community model
only captures the spatial distance information over socially adjacent vertices. In re-
ality, none-adjacent vertices in the detected communities using this model proposed
could be spatially very far away from each other, thus, the spatial feature captured by
the model is limited and the quality of the community is comprised. In [33], they use
similar model as [17]; however, rather than maximise the modularity, they predefine
a score for a size l community and for a detected size l community, they evaluate the
quality of the detected community by the difference of real score and predefined score.
Such ranking somehow mitigates the shortage of the model proposed in [17]. In [44],
a two-step method is devised to cluster vertices, which considers both spatial and
structural features of a large network. Firstly, they cluster vertices with contiguity
constraints, which results a spatial contiguous tree. Next, they further partition the
tree to generate structural dense subgraphs.
Community search. Fang et al [34] propose a community model with three con-
straints: 1) the community contains query vertex, 2) all vertices in the community are
in a spatial enclosed cycle, 3) the community structurally is a connected k-core. Their
algorithms are designed to find a community not only meet the model constraints but
also with minimised diameter of the enclosed cycle. The shortage of the model is that,
the smallest connected k-core containing the query vertex may be inherently spatial
sparse. And it is not applicable to query communities with a set of input vertices.
In this thesis, we consider users’ spatial information in k-truss search, and propose
21
a novel community model, named as co-located community. It can unveil maximum
size communities given certain spatial and social cohesiveness thresholds.
2.2.2 Community Discovery in Textual Attributed Networks
Community discovery in textual attributed networks aims to find communities that
are structural cohesive while textually co-related.
Community detection. Several works [77, 73] consider topic LDA model and graph
structure to detect attributed communities. In [77], the proposed Pairwise-Link-LDA
can be adopted to detect communities in attributed graph replacing directed edges
with un-directed edges. In [73], Liu et al. propose a refined LDA mode, merging
graphical model into Topic-LDA, which can be used to detect attributed community.
Unified distance is also considered for detecting attributed communities. In [111], at-
tributed communities are detected by using proposed structural/attribute clustering
methods, in which structural distance is unified by attribute weighted edges. Cheng et
al. [19] propose a better algorithm for detecting communities using method in [111].
Other attributed community detection methods are as follows. Xu et al. [106] propose
a Bayesian based model. Ruan et al. [85] propose an attributed community detection
method that links edges and content, filters some edges that are loosely connected
from content perspective, and partitions the remaining graphs into attributed com-
munities. Huang et al. [52] propose a method based on an entropy-based model [18].
In [22], attributed communities are detected by finding structural cohesive subgraphs
containing common attribute greater than a threshold and merging the detected co-
hesive subgraphs if necessary. Recently, Wu et al. propose an attributed community
model [102] based on an attributed refined fitness model [62].
Community search. Li et al. [67] propose a keyword-based correlated network
computation over large social media. They firstly find small r-cliques containing
query keyword, and merge small r-cliques if their mutual similarities are greater than
a threshold. In [53], a community model sensitive to textual information is proposed.
Given a set of keywords, a set of query vertices, an integer k that measures the
structural cohesiveness, and an integer d that measures the communication cost (the
22
same definition as [53]), the model is defined as follows. 1)The community contains
all query vertices. 2)The community is a connected k truss with query distance no
great than d. 3)The community is textually most related to the set of keywords.
In [109], they find (k, r)-core community such that socially the vertices in (k, r)-core
is a k-core and from similarity perspective pairwise vertices similarity is more than
a threshold r. Recently Li et al. propose a skyline community model for searching
communities in attributed graph [69]. Their model intends to find communities that
have diversified attributes.
In this thesis, we will propose a contextual community (CC) model. Compared
to existing models CC is designed to be a general framework. Firstly, CC employs
subgraph density as a parameter-free cohesiveness measure to avoid novice user spec-
ifying k in k-core, k-truss etc. Secondly, CC finds multifarious communities. If user
provided query context closes to the attributes contained in a ground truth com-
munity, CC search indeed finds near ground truth communities. Further, if query
covers attributes in multiple ground truth communities, unlike others that bias struc-
ture cohesiveness first and then context match - might result nothing (for somewhat
large k) or community members mostly irrelevant to query context (after lowering k),
CC simultaneously models structure cohesiveness with subgraph density and context
match with scores/weights in computing weighted density so that CC search flexibly
finds community that is most cohesive relative to the query context.
2.2.3 Community Discovery in Networks without Attributes
Community detection. A number of community detection methods have been
reported in [37]. Mancoridis et al propose a community detection method based
on graph partitioning, the objective is to maximise the difference between internal
edge ratio over inter-cluster edge ratio. In [39, 78], Girvan et al use betweenness to
detect community structure. They find edges that are most between communities
and then progressively remove such edges in the remaining graph until no such edge
remains. Rezvani et al [83] propose a fitness metric based objective function and
finding communities maximising the fitness metric.
23
Community search. Li et al. [70] consider k-clique as structural cohesiveness metric
and rank the cliques using outer influence score. In [68], k-core is used to model the
social cohesiveness and internal influential score is used to rank the communities. In
general, the goal of local community search is to search a community that contains
vertices near a set of query vertices. In [51, 3], maximal triangle-connected k-truss
containing a query vertex metric are considered as communities. In [54], they define
a community model on a query vertex set as follows. 1)The community contains all
vertices in the vertices set. 2)The community structurally is a connected k-truss.
3)The longest shortest path from a vertex that not in query vertex set to the query
vertices set is minimised. Cui et al. [28], search local optimal community modelled
as connected subgraphs, containing the query vertices and maximising the minimum
degree of its vertices. k-clique based model is proposed in [108], in which a community
is defined as the maximal k-clique adjacent connected subgraphs, named as k-clique
percolation community. They study the problem of searching the densest clique
percolation community with maximum k containing all query vertices. Adapting
from minimal Steiner tree subgraph, Hu et.al. [50] propose algorithms searching a
community that is a minimal connected Steiner tree containing all query vertices
while maximising the cardinality.
2.3 Keyword Search on Graph Data
Keyword search has been extensively studied in the field of database, especially on
graph data. Existing works can be categorised into: (1) proposing search seman-
tics while finding results for individual queries and (2) efficiently answering multiple
queries in batch.
2.3.1 Keyword Search Semantics
The existing approaches aim at finding either Steiner tree based answers or subgraph
based answers. Steiner tree based answers [9, 47, 57] generates trees that cover all
the search keywords and the weight of a result tree is defined as the total weight of
24
the edges in the tree. Under this semantics, finding the result tree with the smallest
weight is a well-known NP-complete problem. The graph-based methods generate
subgraphs such as r-radius graph [66], r-community [81] and r-cliques [58]. In an r-
radius graph, there exists a central node that can reach all the nodes containing search
keywords whose distance is less than r. In an r-community, there are some centre
nodes. There exists at least one path between each centre node and each content
node such that the distance is less than r. Different from r-radius and r-community,
the r-clique semantics studied in this thesis is more compact and does not required
the existence of a central node. It refers to a set of graph nodes which contain search
keywords, and between any two nodes that contain keywords, we can find a path with
a distance less than r.
Nevertheless, all of the existing works focus on single query processing not multiple
query processing. Although, the proposed indices in [47, 58, 66, 81] can be used
for more than one query, the evaluation techniques are exclusively designed for one
query a time. In other words, none of the existing works have studied utilising
possible reusable computations over multiple keyword queries and leveraging shared
computations to improve the performance for multiple keyword query evaluation at
run time. In this thesis, we study these problems.
2.3.2 Batch Query Processding
On XML data, Yao et al [107] propose a log based query evaluation algorithm to find
the optimal plan to compute multiple keyword queries under SLCA semantics [105].
Recently, multiple keyword query optimization over relational databases (rather than
native graphs) has also been studied [55]. This work assumes all the keyword queries
haven been transformed into candidate networks (which are similar to SQL query
plans), and then multiple SQL query optimization techniques are used thereafter,
i.e., common SQL query operations (or subqueries) in the candidate networks are
considered.
Multiple query optimisation in database. On relational data, multiple SQL
query optimisation have been studied in the early works [84, 91, 92], in which the
25
main focus is to smartly handle shared operations among SQL queries. These works
decompose complex SQL queries into subqueries and consider reusing common sub-
queries based on cost analysis. Recently, Kathuria et al [59] propose an approximation
algorithm within Volcano optimisation framework [43], solving multiple SQL query
optimisation problem with theoretical guarantees.
Different from the above works, our problem focuses on native graph data where
data does not have to be stored in relational tables and query results are modelled
using r-clique semantics. The solution of the previous works cannot be applied to our
problem, because both the data model and the query semantics are different.
26
Chapter 3
Maximum Co-located
Communities Search
The problem of k-truss search has been well defined and investigated to find the
highly correlated user groups in social networks. But there is no previous study to
consider the constraint of users’ spatial information in k-truss search, denoted as co-
located community search in this chapter. The co-located community can serve many
real applications. To search the maximum co-located communities efficiently, we first
develop an efficient exact algorithm with several pruning techniques. After that, we
further develop an approximation algorithm with adjustable accuracy guarantees and
explore more effective pruning rules, which can reduce the computational cost signif-
icantly. To accelerate the real-time efficiency, we also devise a novel quadtree based
index to support the efficient retrieval of users in a region and optimise the search
regions with regards to the given query region. Finally, we verify the performance of
our proposed algorithms and index using five real datasets.
Chapter map. In Section 3.1, we give an overall introduction to the problem of
maximum co-located community search. Section 3.2 presents the proposed co-located
community model and formally defines the maximum co-located community search
problem. Section 3.3 presents the baseline, efficient exact algorithms and effective
techniques for pruning before and during the search. Section 3.4 presents a spa-
tial approximation algorithm that offers a variable spatial error ratio ranging from
27
2√
2 + ε to√
2 + ε′(ε and +ε′ are constant error factors), the proposed effective prun-
ing techniques, and index further speeding up the spatial approximation algorithm.
Experimental results are shown in Section 3.5, followed by the chapter summary in
Section 3.6.
3.1 Introduction
We study the co-located community search problem that reveals the maximum com-
munities with high social and spatial cohesiveness, denoted as (k,d)-MCC s search.
The social cohesiveness is defined using the minimum truss value k [26] and the spa-
tial cohesiveness is parameterised by a user-specified distance value d. As such, our
proposed (k,d)-MCCs search problem can allow users to easily affirm the quality of
the resultant communities, which also fills in the research gap on the type of GCS
with spatial constraint.
Given a social network G and two parameters k and d, a straightforward approach
is to enumerate all possible subgraphs in G meeting minimum truss value k where
the number of subgraphs could be as large as O(2n). It then filters the candidates
having a node pair with their distance above the spatial closeness threshold d. So the
time complexity of this approach is at least O(2n) where n is the number of vertices
in G. Obviously, it is infeasible to use this approach to support online (k,d)-MCCs
search, particularly for the large scale social networks. Thus, this chapter focuses on
devising efficient algorithms to achieve real-time response with theoretical guarantee.
To address the challenge of efficiency, we first develop an exact (k,d)-MCCs search
algorithm by proposing novel pruning techniques. During the search, we explore tech-
niques to prune the search space significantly by considering upper bound based earlier
termination, heuristic search order, and conditions for reusing pruning computation.
Before searching, we also propose pre-pruning techniques for reducing magnitudes
of input data. To design polynomial algorithms, we develop a novel approximation
schema with spatial accuracy guarantees. Notice, our proposed approximation scheme
can provide adjustable spatial error ratios based on user’s requirement on the spatial
28
a
b
c
j
id
he
g
f
n k
l
t
u
o
s
p
r
q
m
1
(a) Graph data
tu
s qr
p
ml
kn
b a
c djie h
f g
o
(b) Spatial DIST
a
b
c
de
f
gh
i
j
k
l
mn o
p
q
r
s
t
u
(c) Spatial network
Figure 3-1: Spatial attributed graph
accuracy. To further improve the performance of the approximation algorithm, we
propose more pruning techniques and also design the novel index TQ-tree.
3.2 Problem Definition
We consider a social network graph G = (V,E), which is an undirected graph
with vertex set V (G) and edge set E(G), where vertices represent social users and
edges denote their friendships. For each vertex v ∈ V (G), it has a spatial attribute
(v.x, v.y), where v.x and v.y denote its spatial positions along x− and y−axis in a
two-dimensional space.
Co-located community. A co-located community is a subgraph J ⊆ G satisfying:
(1) connectivity: J is connected, (2) structural cohesiveness: all vertices in J are
connected intensively, and (3) spatial cohesiveness: all vertices in J are spatially
close with each other.
Structural cohesiveness. We consider truss as the metric to measure the structural
cohesiveness of a co-located community. Truss measures the number of triangles that
each edge is involved in a graph. Given J , let us denote a triangle involving vertices
u, v, w ∈ V (J) as 4uvw. The support of an edge e(u, v) ∈ E(J), denoted by sup(e, J),
is the number of triangles containing e, i.e., sup(e, J) = |{4uvw : w ∈ N(v, J) ∩
N(u, J)}|, where N(v, J) and N(u, J) are the neighbours of v, u in J correspondingly.
Next, we define the truss of a co-located community J as follows:
29
Definition 3.2.1. Subgraph truss. The truss of J ⊆ G, where |V (J)| ≥ 2, is the
minimum support of an edge in J plus 2, i.e., τ(J) = 2 + mine∈E(J){sup(e, J)}.
J is a connected k-truss if it is both connected and τ(J) ≥ k. Intuitively, a k-
truss is subgraph in which each connection (edge) (u, v) has at least k − 2 common
neighbours. A k-truss with a large value k indicates strong internal connections over
members. In a k-truss, each node should have degree at least k−1, implying a k-truss
must be a (k − 1)-core. A connected k-truss is also (k − 1)-edge-connected.
Spatial cohesiveness. Let ed(u, v) denote the spatial distance between vertices u
and v. We first introduce the concept of spatial co-location to measure the spatial
cohesiveness. Then we define co-located community formally.
Definition 3.2.2. Spatial co-location. Given a distance
threshold d, a subgraph J ⊆ G is a spatial co-location graph if for every pair u, v ∈
V (J), ed(u, v) ≤ d holds.
Definition 3.2.3. Co-located community. Given a graph G, a positive integer
k, and a spatial distance d, J is a co-located community, if J satisfies the following
constraints:
• Structural cohesiveness. J is connected, τ(J) ≥ k.
• Spatial cohesiveness. J is a spatial co-location graph w.r.t. a spatial distance
d.
In general, when searching a community, users may want to maximise the mem-
bers contained in the community once they fix the spatial and social cohesiveness
parameters. Therefore, in this chapter, given a graph G, we study finding the max-
imum co-located communities, denoted as (k,d)-MCCs where k stands for k-truss, d
for spatial distance, M for maximum and CC for co-located community. Now we
formally define the problem of (k,d)-MCCs search.
Problem 3.2.1. (k,d)-MCCs search. Given a graph G, positive integer k and
number d, return any of those maximum co-located communities J ⊆ G, satisfying
constraints:
30
• J is a co-located community.
• There is no another co-located community J ′ such that |V (J ′)| > |V (J)|.
For example, in Figure 3-1(b), vertices in dark blue coloured areas are co-located.
Similarly, in Figure 3-1(a), three possible co-located communities are in blue coloured
areas with k = 4. The (4,d)-MCC here is the subgraph containing vertices {d, e, f, g, h, i}
with cardinality 6, as it is the maximum.
We may find (k,d)-MCCs from G by inspecting the whole graph. However, to
improve the search performance, we only want to search the parts of G that may
contain (k,d)-MCCs. To achieve that, we introduce the theorem below:
Theorem 3.2.1. (k,d)-MCCs of a graph G can be found from one of the maximal
connected k-trusses of G if they exist.
The proof is trivial since vertices that are not part of a maximal connected k-truss
clearly cannot meet the structural cohesiveness requirement in Definition 3.2.3.
By Theorem 3.2.1, the intuitive steps to find (k,d)-MCCs in G include: (1)
compute maximal connected k-trusses (note: these k-trusses are non-overlapped),
(2) search the local (k,d)-MCCs in each of these k-trusses, and (3) find the global
(k,d)-MCCs from the locals by comparing the cardinalities.
Analysis. A k-truss index can be built within O(|E(G)| 32 ) for a graph G. The
k-truss index for G is essentially a list of edges associated with their edge trusses
defined by τ(e,G) = maxH⊆G∧e∈E(H) {τ(H)} [51]. With the k-truss index, given a k,
we can retrieve all maximal connected k-trusses in G in polynomial time. However,
it is still challenging to find local (k,d)-MCCs within a maximal connected k-truss
due to: (1) the total number of spatial co-location subgraphs in the k-truss could be
exponential [45] and (2) there is no guaranteed monotonic relationship between the
size of a co-location subgraph and the size of its co-located communities.
From now on, we focus on finding (k,d)-MCCs in a maximal connected k-truss T ⊆
G. The notations used frequently in following sections are summarised in Table 3.1.
31
Table 3.1: Notations
Notation DefinitionT initially a maximal connected k-truss graphT ′ spatial neighbourhood network for TT ′0 a connected component of T ′, T ′0 ⊆ T ′
u, v, w individual verticesed(u, v) spatial distance between u, v
gd(u, v, T ) distance between u, v in Tdeg(u,G) degree of u in GN(u,G) neighbours of u in G
τ(G) the minimum truss of GA,A a maximal clique, a set of maximal cliques
R,P,X vertices setsT (R), T ′(R) subgraphs of T and T ′ induced by vertices in R
c a square spatial space cell with width wm, M a landmark cell and a set of landmark cells
r a square spatial region consisting of cellsζ an integer, denoting a number of cellsVr a set of vertices located in a region rp an error-bounded search bound region
K(r) the k-truss in a region r
3.3 Finding Exact Results
We first introduce a definition as follows:
Definition 3.3.1. Spatial neighbourhood network. Given a T and a distance
d, a spatial neighbourhood network for T is a graph T ′, which is an undirected graph
with V (T ′)=V (T ) and E(T ′)={(u, v)|ed(u, v) < d ∧ u, v ∈ V (T )}.
Finding a (k,d)-MCC is equivalent to finding an unextendable vertex set R such
that the R-induced subgraph T ′(R) of T ′ is a clique while the R-induced subgraph
T (R) of T contains a connected k-truss GR = (R,ER) where ER ⊆ E(T (R)).
Next, we show the baseline algorithm to find (k,d)-MCCs.
3.3.1 Baseline Algorithm
Given T and T ′, the baseline algorithm is to find all the maximal cliques contained
by T ′, and check the sorted maximal cliques one by one. For each maximal clique
32
A (the set of vertices in a maximal clique), we need to compute local (k,d)-MCCs in
T (A). After all maximal cliques have been checked, we compare the cardinalities of
the local (k,d)-MCCs and get the global (k,d)-MCCs.
Algorithm 1: baseline(T ,T ′, b = 0)
1 R ←mccSearch(T, T ′, b);2 Return R;3 Procedure mccSearch(T, T ′, b)4 A ← bkp(∅, V (T ′), ∅);5 sort A in descending order by clique cardinality;6 for each A ∈ A do7 if |A| > b then8 R′ ← maximum k-trusses in T (A) ;9 b← kdmccCollect(R,R′, b);
10 Procedure kdmccCollect(R,R′, b)/* R′[0]: first element of R′ */
11 if |V (R′[0])| == b then12 collect R′ into R;
13 if |V (R′[0])| > b then14 b← |V (R′[0])|;15 replace R by R′;
Baseline algorithm. The baseline algorithm is presented in Algorithm 1. It ensures
the correctness by giving every maximal clique A ∈ A a chance. To improve the search
efficiency, Algorithm 1 uses a heuristic rule and a bound to prune small maximal
spatial cliques. The heuristic rule assumes the larger the size of a spatial clique is, the
larger the size of the contained (k,d)-MCCs may be. The heuristic rule is implemented
by sorting the generated maximal cliques (line 5). The bound b is initialised as 0 and
is continuously updated as the maximum size of the (k,d)-MCCs found so far. A
maximal clique is pruned if its size is less than b.
Collect candidate results. In Algorithm 1, Procedure kdmccCollect is used to
collect candidate results. It checks the maximality of the currently found (k,d)-MCCs
in R′ and determines if they should be added into previously found results in R, or
replace R, or be discarded (lines 12 to 16 in Algorithm 1). During the process, the
upper bound will be updated if necessary.
33
Avoid duplication. Since (k,d)-MCCs are contained by spatial-clique-induced sub-
graphs of T , it is possible that multiple spatial cliques contain the same (k,d)-MCC.
To avoid duplication, we assign a unique key to each (k,d)-MCC based on the vertices
it contains. Before a new (k,d)-MCC is collected into the result R, duplication will
be checked by verifying if its key has already existed.
Example. We show an example using Algorithm 1 to find (4,d)-MCCs. The input
social graph is the 4-truss in Figure 3-1(a) and its spatial network is in Figure 3-1(c).
Firstly, maximal cliques can be obtained and sorted by size (see Table 3.2). The upper
bound history and the corresponding (k,d)-MCCs after each iteration are displayed in
Table 3.3. The iteration stops when the upper bound b = 6 is larger than the sizes of
the remaining cliques.
Table 3.2: Maximal cliques contained in Figure 3-1(c)
Cad. Cliques Cad. Cliques8 {a, b, c, d, e, h, i, j} 6 {d, e, f, g, h, i}4 {r, s, p, q}, {m,n, k, l} 2 {p, o}, {o, l},{t, u}
Table 3.3: Enumeration trace
Iter. clique bound R0 NULL b = 0 ∅1 {a, b, c, d, e, h, i, j} b = 4 {{a, c, b, j}} {{d, e, h, i}}2 {d, e, f, g, h, i} b = 6 {{d, e, h, g, f, i}}3 {r, s, p, q} b = 6 {{d, e, h, g, f, i}}
Time complexity. The dominating part of Algorithm 1 is to list all maximal cliques
that would be O(3|V (T )|
3 ), using the state-of-the-art algorithm bkp in [96]. Another
part is to find k-trusses with maximum cardinality in T (A) where A is the vertex set
contained by a maximal clique of T ′. To compute the maximum k-trusses in T (A), we
use the method in [99] with the time complexity bounded by O(|E(T )| 32 ). Therefore,
the complexity of Algorithm 1 is O(3|V (T )|
3 +∑
A∈A |E(T (A))| 32 ), where A is the set
of maximal spatial cliques contained by T ′.
34
3.3.2 Efficient (k,d)-MCC Search
The baseline method finds (k,d)-MCCs in two steps. Firstly, it generates all vertex
sets meeting the requirement of spatial cohesiveness, i.e. spatial cliques. Secondly,
it verifies social cohesiveness for each generated spatial clique and finds (k,d)-MCCs
from each clique and select the maximums.
However, a valid observation is that: if we check social cohesiveness right after
a clique is generated, i.e., find the candidate (k,d)-MCC(s) from the found maximal
clique before enumerating all the rest cliques, we can use the size of the largest
candidate (k,d)-MCCs as a bound to stop generating unpromising cliques, i.e., cliques
which are not possible to contain larger (k,d)-MCCs. Moreover, as the size of the
candidate (k,d)-MCC(s) becomes larger, the pruning also becomes more effective.
As a result, in this section, we develop an efficient (k,d)-MCC search algorithm. It
is different from the baseline in two folds. Firstly, after a spatial clique is generated,
we search for the (k,d)-MCCs in the clique-induced social graph immediately, and
the bound will be updated as the largest size of (k,d)-MCCs found so far. Secondly,
before generating a clique, we check whether the current clique search branch is able
to generate candidate (k,d)-MCCs with sizes greater than the current bound, if not,
we terminate the clique search branch.
The (k,d)-MCC search algorithm is shown in Algorithm 2. It is based on the maxi-
mal clique enumeration algorithm [96] which will be briefly reviewed in Section 3.3.2.1
with four non-trivial modifications: (1) finding candidate maximum (k,d)-MCCs im-
mediately after generating a maximal clique (line 7); (2) terminating a search branch if
no larger (k,d)-MCCs exist based on four pruning conditions (line 5), Section 3.3.2.2;
(3) a heuristic rule to find larger (k,d)-MCCs at early stages by carefully selecting
promising vertices to expand the candidates (line 10), Section 3.3.2.3; (4) reducing
the cost of computing pruning conditions by possibly reusing previous results (related
to line 5), Section 3.3.2.4.
35
Algorithm 2: effiMCCSearch(T, T ′)
1 b← 0, R ← ∅;2 mccbkp(∅, V (T ′), ∅);3 return R;4 Procedure mccbkp(R,P,X)
5 terminate this branch based on termination conditions;6 if P ∪X == ∅ then7 R′ ← find maximum connected k-truss in T (R);8 b←kdmccCollect(R,R′, b);9 u← select a pivot from P ;
10 for each v ∈ P \N(u, T ′) do11 mccbkp(R ∪ {v}, P ∩N(v, T ′), X ∩N(v, T ′));12 P ← P \ {v};13 X ← X ∪ {v};
3.3.2.1 Revisit of Maximal Clique Enumeration
Maximal clique enumeration. bkp [96] works on three vertex sets R, P and X and
finds all the maximal cliques in T ′. In each recursion state, R records the clique found
so far, P contains the vertices to be added into R and X contains the vertices that
were previously added into R and now have been explicitly excluded. P and X are
disjoint and they together contain all the vertices that are adjacent to all the vertices
in R. Initially, R and X are empty and P is V (T ′). From P , bkp picks a v ∈ P , adds
v to R and removes v’s non-neighbours from P and X, i.e., P ← P ∩ N(v, T ′) and
X ← X ∩N(v, T ′). Then bkp recursively calls itself and performs the same operation
on the newly generated R, P and X until the set P becomes empty. It then reports
a maximal clique if the current X is empty. The reason is that if X 6= ∅, it implicitly
means R is not maximal because vertices in X can be added into R to form a larger
clique. After finishing the recursive search branch of adding v into R, bkp restores R,
removes v from P , adds v into X, and then expands R with the next vertex in P .
Pruning search branches with pivots. Given a search state R, P and X, let
u ∈ P 1, the intuition is that, cliques generated by expanding R with a vertex in
P ∩N(u, T ′) can always be further expanded by adding u subsequently. Therefore, it
1Note that u can be chosen from P ∪X
36
is safe to expand R with P \N(u, T ′) only. To pursue the maximum pruning power,
a vertex u maximising |P ∩N(u, T ′)| shall be chosen, called a pivot.
Clearly, once a maximal clique is generated, we can search for (k,d)-MCCs imme-
diately, line 7 Algorithm 2. The largest size of the candidate (k,d)-MCCs found so far
will be used as a bound. In the following, we focus on how prunings, order heuristics,
computation reuse are implemented, respectively.
3.3.2.2 Terminating Unpromising Branches Earlier
The idea is that we estimate the upper bound of the (k,d)-MCCs in the current search
branch. If the upper bound is smaller than the found bound b, we terminate the
search branch. There are four upper bounds. (1) If |R ∪ P | < b, we can terminate
the branch. This means, if the largest possible clique is already smaller than b, the
possible (k,d)-MCCs contained are thus smaller than b. (2) Let K(R ∪ P ) be the
maximum connected k-truss in the induced graph T (R ∪ P ), then |V (K(R ∪ P ))|
is the upper bound. This is without considering spatial constraints in T ′. (3) The
largest possible truss number within the induced subgraph T ′(R ∪ P ) is the upper
bound of the maximum clique in T ′. This is without considering social constraints in
T . (4) Considering both (2) and (3), we can have a tighter bound, defined based on
(k, k′)-truss below:
Definition 3.3.2. (k, k′)-truss. Given T , T ′ and a vertex set S such that S ⊆
V (T ) ∧ S ⊆ V (T ′), if T (S) is a connected k-truss in T and T ′(S) is a connected
k′-truss in T ′, we say (T (S), T ′(S)) is a (k, k′)-truss. For ease of discussion, we also
call S a (k, k′)-truss.
Let k′max be the largest possible truss number such that a (k, k′max)-truss is con-
tained in T (R∪P ) and T ′(R∪P ), k′max is a tight upper bound of the size of (k,d)-MCCs
in the current recursion branch.
The above bounds are applied one after another following the discussed order. This
is because their computation cost increases accordingly and we want to terminate an
unpromising branch as early as possible. If a pruning with a loose bound is enough,
37
we can avoid computing a tighter bound expensively.
3.3.2.3 Search Order
Given that having a larger (k,d)-MCC size will help the algorithm terminate earlier,
we design a heuristic rule aiming to obtain large (k,d)-MCCs first. The rule is as
follows: given a search state with R and P , when we need to select which vertex in
P \N(u, T ′) should be added into R first (line 10, Algorithm 2), we choose from the
vertices in the (k, k′max)-truss contained in T (R ∪ P ) and T ′(R ∪ P ) in prior because
adding such vertices is likely to generate larger (k,d)-MCCs. Among the vertices in
the (k, k′max)-truss, we consider adding the vertex v with the largest deg(v, T ′), where
ties are broken arbitrarily.
3.3.2.4 Computation Reuse for Pruning
Finding the upper bounds in cases (2)(3)(4) in Section 3.3.2.2 may not be cheap, even
though truss decomposition is in polynomial time [51]. However, a nice observation
is that a search state (R,P ,X) and its child state (Rc,Pc,Xc) are likely to have similar
truss results. Suppose Rc = R ∪ {v}, Pc = P ∩ N(v, T ′), it is easy to see Rc ∪
Pc ⊆ R ∪ P . As a result, the maximum k-trusses in T (Rc ∪ Pc) are subsets of the
maximum k-trusses in T (R ∪ P ), hence the computation can be done incrementally
using truss maintenance techniques [51] by passing the existing T (R∪P ) and T ′(R∪P )
and truss indices to its child recursions. Similarly, (k, k′max)-truss can be computed
incrementally as well.
On the other hand, there are some special cases where we can cheaply determine
that the child state cannot be pruned: (1) if |R ∪ P | = |Rc ∪ Pc|, the child state’s
upper bounds are the same as the parent’s; (2) let K be the maximum k-truss in
T (R ∪ P ), if V (K) ⊆ Rc ∪ Pc, the child state cannot be pruned; (3) let S be the
(k, k′max)-truss in T (R ∪ P ) and T ′(R ∪ P ), if S ⊆ Rc ∪ Pc, the child state cannot be
pruned. Proofs are omitted as correctnesses are obvious.
38
3.3.2.5 Example and Discussion
Example. We show an example using Algorithm 2 to search for (k,d)-MCCs. Given
the T and T ′ in Figures 3-1 (a) and (c). Initially, R = ∅ and P = {1, a, . . . , u}
Algorithm 2 tries to terminate the recursions by computing all four upper bound sizes,
firstly producing a (4,6)-truss with vertices {d, e, f, g, h, i}. Then pivot h is selected,
making Algorithm 2 only need to expand R from P \N(h, T ′) = {h, r, s, q, p, o,m, n, k,
l, t, u} rather than P . Next, based on the order heuristic rule, h is added into R,
reducing P to {a, b, c, d, e, f, g, i, j}. Such recursions continue until the first (k,d)-
MCC, {d, e, h, g, f, i} is discovered while bound computation can be reused from when
d is added into R = {h}. After the first result is produced, b is updated to 6. Using
this bound, when Algorithm 2 backtracks to the recursion state, in which r is added
into R with R = {r}, P = {s, p, q} and X = ∅. The first upper bound pruning
condition is applied to terminate this branch because |R ∪ P | < 6. Other search
branches that will find cliques in Table 3.2 will also be pruned with the proposed
termination conditions.
Discussion. In [109], Zhang et al. proposed (k, r)-core which uses k-core instead of
k-truss to represent social cohesiveness. For the purpose of comparison, we adapt the
AdvMax algorithm proposed in [109] for finding (k,d)-MCCs and denote it as KRM.
KRM may have smaller search space, because social constraint on T is also checked
during the clique enumeration. However, after incorporating social constraint check
along the way, the powerful pivot-based pruning for clique enumeration cannot be used
because the classic pivot pruning works only for the structure part (That might be
why the work [109] used binary search rather than bkp). On the other hand, we have
studied adapting the pivot idea considering both structural and social constraints.
Unfortunately to determine such pivots is very complicated and the pruning power
of such pivots cannot be guaranteed. Experimental performance comparison between
our algorithms and KRM can be found in Section 3.5.
39
3.3.3 Prunings before (k,d)-MCCs Enumeration
In practice, a maximal connected k-truss T and its corresponding spatial neighbour-
hood network T ′ can be pruned before (k,d)-MCC enumeration. The aim is to reduce
the size of the input as much as possible so that (k,d)-MCC enumeration can be more
efficient.
Pruning vertices in T ′ (I). We introduce a k-truss property first, followed by the
explanation and the pruning rule.
Property 3.3.1. For every vertex v in a k-truss graph T , v has deg(v, T ) ≥ k − 1
[26].
Intuitively, if T (V (T ′)) contains a k-truss, then each v ∈ V (T ′) should have as
least k− 1 neighbours in T . Accordingly, v should also have at least k− 1 neighbours
in T ′.
Pruning Rule 3.3.1. For each v ∈ V (T ′), if deg(v, T ′) < k − 1, v can be pruned
from T ′.
Pruning edges in T ′ (II). Next, we show another k-truss property which can be
used to prune edges in spatial network T ′. The idea is that, if two vertices are far
from each other in T making them not able to be in the same k-truss, even though
they are spatially close in T ′, their link in T ′ can be discarded when enumerating
(k,d)-MCCs.
Property 3.3.2. The structural diameter of a connected k-truss T with |V (T )|
vertices is no more than b2|V (T )|−2kc[26].
Pruning Rule 3.3.2. Given T and T ′, let T ′0 be a connected component of T ′.
An edge e(u, v) ∈ E(T ′0) can be pruned if gd(u, v, T (V (T ′0))) > b2|V (T ′
0)|−2k
c, where
gd(u, v, T (V (T ′0))) denotes the distance between u and v in T (V (T ′0)).
Pruning Rule 3.3.2 is correct: suppose vertices u, v can co-exist in a connected k-
truss K ⊆ T (V (T ′0)), then we will have gd(u, v, T (V (T ′0))) ≤ gd(u, v,K) ≤ b2|K|−2kc ≤
b2|V (T ′0)|−2k
c, this contradicts with the pruning condition.
40
Pruning vertices in T (III). Let D be a set of vertices pruned from T ′, obviously
they should also be removed from T . After removing D from T , another set of vertices
D′ in T may be further removed due to truss maintenance [51]. D′ will need to be
pruned from T ′.
Cascading pruning effect. We summarise the cascading pruning effect here:
(1) Pruning I will cause Pruning II and III, because after pruning vertices in T ′,
gd(u, v, T (V (T ′0))) becomes larger and b2|V (T ′0)|−2k
c becomes smaller, so more edges
may be further pruned from T ′. Also, the pruned vertices of T ′ should be removed
from T ; (2) Pruning II will cause Pruning I, because, after some edges are pruned,
some vertex degrees will decrease which may lead to new vertices be pruned from T ′;
(3) Pruning III will cause Pruning I, which has been explained.
In implementation, vertex pruning I, III are prioritised as they are cheap. Shortest
path index [104] is maintained to help with pruning II. Pruning will stop if no more
changes are caused.
3.4 Finding Spatial Approximate Result
Since all the proposed exact algorithms in Section 3 can have exponential time in the
worst case, we aim to design a polynomial algorithm by relaxing spatial constraints.
The polynomial algorithm can approximately find co-located communities (which are
still k-trusses but have vertices within longer distances). The spatial distances can
be theoretically bounded. We firstly discuss how (k,d)-MCC should be approximated.
Then we propose three types of search bound regions: upper bound region, tight
bound region and error-bounded region. Using the bound regions, we design an
algorithm that can find approximate results meeting a user-specified spatial error
ratio requirement.
3.4.1 How to Approximate (k,d)-MCCs
It is desirable to have efficient algorithms to find approximate results. Accordingly,
several questions are interesting: How approximation should be defined? What are
41
good approximation results? Can users specify their own approximation preference,
i.e. to what extent the discovered results are approximate? We will answer these
questions in this section.
Define approximation. Firstly, a (k,d)-MCC is considered cohesive both struc-
turally and spatially. In general, both structural and spatial constraints can be re-
laxed, however, since the exponential number of exact (k,d)-MCCs comes from check-
ing spatial constraints, we only study the approximate results with spatial constraints
relaxed. Let us define an α-approximation of a (k,d)-MCC below:
Definition 3.4.1. Approximate (k,d)-MCC. Let J be a (k,d)-MCC, J ′ be a (k,d’)-
CC satisfying J ⊆ J ′ and d ≤ d′, we consider J ′ as an α-approximation of J with
spatial error ratio α = d′
d, where α ∈ [1,+∞).
Here, J ′ is a k-truss with the maximum distance between vertices in V (J ′) no
more than d′. Technically, α can be less than 1, but this is not desired.
Reasonable approximation. From the definition, α-approximation of a (k,d)-MCC
is not unique: the maximal α-approximation of a (k,d)-MCC is a (k,αd)-MCC, while
the minimal α-approximation of a (k,d)-MCC is the (k,d)-MCC itself. Both the maxi-
mal and minimal α-approximations lead to the exponential number of co-located com-
munities. As a result, a polynomial algorithm that can discover any α-approximations
should suffice. However, superiority does exist among approximations, eg., let J ′1, J′2
be two α-approximations of a particular (k,d)-MCC, if J ′1 ⊆ J ′2, J′1 is considered better
than J ′2, because J ′1 is “cleaner”.
Specify error ratio. Ideally, users should be able to specify their preferred spatial
error ratios, because different users may have different requirements. With a given
spatial error ratio α, the approximate algorithm finds α-approximation results ac-
cordingly. In the next section, we introduce using spatial index to guarantee error
ratio.
42
ba
j d
h
g
f
ie
cm2
m1
(a) Upper bound
ba
j d
h
g
f
ie
cm2
12
3
(b) Tight bound
ba
j d
h
g
f
ie
c
(c) Vertex residence
ba
j d
h
g
f
ie
c
x
m
0
1
m
0
2
(d) Truss residence
Figure 3-2: Rectangular regions
3.4.2 Spatial Index and Search Bounds
The idea of searching in polynomial is to delegate a spatial index to check spatial
constraints. The outcome is, with the index, we are able to cheaply locate a region or
a (limited) number of regions where we can focus on checking structural constraints
only when searching for k-trusses, because the user-specified spatial error ratio can
be guaranteed on the k-truss results discovered within the located regions. In the
following, we will introduce the spatial index first, and then elaborate on how two
typical bound regions are identified and how error-bounded search regions can be
identified.
Space division. We consider the space is divided into equal-sized cells. Each cell is
a w×w square. w is fixed once the space is divided. Vertices are distributed into the
43
cells. If a cell is not empty, we call it a landmark cell. Rectangle region and square
region are defined below, used later.
Definition 3.4.2. Rectangle region. A rectangle region is a subspace of the entire
space, with a rectangle shape containing only complete cells. Square region is defined
similarly.
Now the problem is, for each landmark cell m, to identify proper square bound
regions from which k-trusses should be discovered. Two types of bound regions are
interesting: (a) the upper bound region is a big bound region that can cover all the
exact (k,d)-MCCs; (b) the tight bound regions are a set of regions, each of which
covers some exact (k,d)-MCCs and they together cover all the exact (k,d)-MCCs. The
tight bound regions can provide the best possible error ratio among all the bound
regions covering the exact (k,d)-MCCs. In the following, we introduce them in detail.
Upper bound region. Given a landmark cell m and a distance d, the upper bound
region rm identified by m is an area covering all possible vertices whose distances to
every vertex in m are no greater than d. Apparently, the theoretical upper bound
region is irregular. However, for easy computation, we define square upper bound
region as the minimal square region that covers the upper bound region, formally as:
let ζ be an integer such that (ζ − 1)w < d ≤ ζw, the square region centred at m
with side size (2ζ+1)w is the square upper bound region. In later discussions,
upper bound region is used short for square upper bound region, denoted as rm.
In Figure 3-2 (a), we show two landmark cells m1 and m2 and their upper bound
regions in red and blue. Next, we show the spatial error ratio of the upper bound
region.
Lemma 3.4.1. The spatial error ratio of the upper bound region is 2√
2 + ε, where
ε = 3√2·wd
.
Proof sketch. The upper bound region rm is a square with side size (2ζ + 1)w. The
longest distance drm within rm is bounded by the diagonal distance√
2(2ζ + 1)w.
Combining drm ≤ 2√
2(ζ + 1)w with (ζ − 1)w < d. We have drmd≤ 2√
2 + ε, where
ε = 3√2·wd
.
44
The 2√
2+ε error ratio may be loose in most applications, because spatial closeness
has been relaxed to nearly 3d (2√
2 + ε ≈ 3). In the following, we introduce tight
bound regions which can bound approximate results within√
2 + ε′ error ratio.
Tight bound region. The upper bound region has side size (2ζ+1)w > 2d. On the
other hand, we observed that an exact (k,d)-MCC must be able to fit into a d-square
(a square with side size d). This motivates us to search for the approximate results
from square regions as small as possible while still not losing any exact (k,d)-MCCs.
To this end, we define tight bound region formally as follows: let ζ be an integer such
that (ζ−1)w < d ≤ ζw, the square region containing m with side size (ζ+1)w
is a square tight bound region. Again, we use tight bound region short for square
tight bound region. Note that, a landmark cell m can identify ζ2 number of tight
bound regions because there are ζ2 (ζ + 1)w-sized squares within a (2ζ + 1)w-sized
square.
Lemma 3.4.2. The spatial error ratio of a tight bound region is√
2 + ε′, where
ε′ = 2√2wd
.
The proof is similar to that of Lemma 3.4.1.
In Figure 3-2 (b), we show three possible tight bound regions (dashed squares)
given the landmark cell m2, supposing ζ has been identified as 2. Next, we give the
spatial error ratio of a tight bound region below.
The previously discussed bound regions are typical cases providing dedicated spa-
tial error ratios. Next, we discuss how to identify bound regions satisfying a user-
specified error ratio.
Error-bounded region. Given a landmark cell m, a distance d, let α be a user-given
error ratio, then αd is the maximum spatial distance allowed. Let ζ ′ be an integer,
an error-bounded region is a square region containing the landmark cell
m with side size (ζ ′ + 1)w satisfying argmaxζ′{√
2(ζ ′ + 1)w ≤ αd|ζ ′ ∈ Z}. With
w, d, α given, ζ ′ can be determined as b αd√2wc − 1. After that, all the square regions
containing m with side size (ζ ′ + 1)w, from the square region centred at m with side
size (2ζ ′ + 1)w, are retrieved as α-error-bounded regions.
45
3.4.3 Prunings
Inspecting error-bounded regions for all landmark cells is costly. If we can somehow
know that the k-trusses covered by one error-bounded region r1 is a subset of those
covered by another error-bounded region r2, the error-bounded region r1 can be dis-
carded. Next, we introduce how prunings can be supported by checking containment
between vertex residence regions and truss residence regions.
Definition 3.4.3. Vertex residence region. Given a square region r, the vertex
residence region Rev(r) is the minimum rectangle subspace of r containing all the
vertices in r.
For example, in Figure 3-2 (c) shows the vertex residence regions of the two upper
bound regions from Figure 3-2 (a) (assuming all the vertices are in the same k-truss).
Pruning Rule 3.4.1. Pruning vertex residence region. Let r1, r2 be two
bound regions, Rev(r1), Rev(r2) be two vertex residence regions, r1 can be pruned
if Rev(r1) ⊆ Rev(r2).
For example, in Figure 3-2 (d), since the vertex residence region within the red
rectangle contains the vertex residence region within the blue, the inner blue rectangle
can be pruned.
Defining vertex residence region as rectangle region is for easy computation. An
alternative way is to consider only those non-empty cells as vertex residence region,
however, checking containment between irregular regions is more expensive though
this approach provides better pruning power.
Similarly, a more powerful pruning condition is to consider only those k-truss
vertices.
Definition 3.4.4. Truss residence region. Given a region r, let K(r)2 be the
k-truss contained in r, the truss residence region Ret(r) is the minimum rectangle
subspace of r that contains all the vertices V (K(r)).
2K(r) may be a disconnected graph. It may have a set of connected components, each of whichis a connected k-truss.
46
Pruning Rule 3.4.2. Pruning truss residence region. Let r1, r2 be two bound
regions, Ret(r1), Ret(r2) be two truss residence regions, if Ret(r1) ⊆ Ret(r2), r1 can
be pruned.
3.4.4 Error-bounded Approximation Algorithm
In this section, we introduce the entire error-bounded approximation algorithm. Sim-
ilar to the exact algorithms, we start with a maximal connected k-truss T from the
structural point of view. Recall that T does not fully satisfy the spatial constraint,
i.e. every two vertices are within d. The key idea of the approximate (k,d)-MCC
search is as follows: Firstly, we retrieve the landmark cells (denoted as M) which
together hold all the vertices of T , V (T ) ⊆ V (M). Secondly, for each landmark cell
m ∈M , the algorithm computes the α-error-bounded regions of m. Thirdly, for each
α-error-bounded region, local maximum structural k-trusses are identified. Last, the
final maximums are selected among the local maximums and will be returned as the
approximate (k,d)-MCCs with the guaranteed spatial error ratio α.
Algorithm 3: apxSearch(T,M, α)
1 b← 0, R ← ∅;2 prune each landmark cell m ∈M by applying Pruning Rules 3.4.1 and 3.4.2
on the upper bound region rm of m ;3 sort all m ∈M according to the size of K(rm) in descending order;4 for m ∈ M do5 if |V (K(rm))| ≥ b then6 P ← generate error-bounded regions with ζ ′ = b αd√
2wc − 1;
7 pruning P based on Pruning Rule 3.4.1;8 sort all p ∈ P according to |Vp| in descending order;9 for p ∈ P do
10 if |Vp| ≥ b then11 R′ ← maximum connected k-trusses in T (Vp);12 b← kdmccCollect(R′,R, b);
13 return R;
Algorithm 3 shows how to search approximate (k,d)-MCCs given a spatial error
ratio α. It firstly prunes landmark cells using Pruning Rules 3.4.1 and 3.4.2 .
47
q1
q1:1 q1:2
q1:1:1 q1:1:2 q1:1:3 q1:2:3
fsg fpg fl,og fa,b,c,jgfq,rg fm,n,kg
q1:2:3:2q1:1:1:2 q1:1:2:1 q1:1:2:3 q1:1:3:2 q1:1:3:4
: : : : : :
q1:3
q1:2:4 q1:3:4
fg,hg ft,ug
: : :: : :
fdg
q1:2:4:1 q1:2:4:3 q1:3:4:3q1:2:3:4
fe,f,ig
Figure 3-3: TQ-tree
Table 3.4: Truss ids and union of truss-to-vertex descriptions
Truss ID k Verticest1 2 a . . . u and 1t1.1 4 a . . . ut1.1.1 5 d, e, h, i, f
Vertex Truss IDa t1, t1.1b t1, t1.1... ...
Then for each m, it generates error-bounded regions and prunes them according to
Pruning Rule 3.4.1 (lines 6 and 7). For each search region, it then computes the
local approximate results in T (Vp). Rather than computing k-trusses from sketch, we
compute them incrementally using K(rm). In this way, duplicated computation can
be avoided. In order to terminate the algorithm early, the size of the best approximate
results found so far is used as a bound (lines 5 and 10), and search bound regions are
sorted (lines 3 and 8).
Time complexity. The time complexity of Algorithm 3 is O(|V (T )| · (b αd√2wc −
1)2 · |E(T )| 32 ). That is because, the running time of Algorithm 3 is dominated by
the nested loop. The outside loop is bounded by |V (T )| since there are at most
|V (T )| number of landmark cells. The bound of inner loop is (b αd√2wc − 1)2, where
(b αd√2wc − 1)2 corresponds to the number of error-bounded regions. The computation
inside the nested loop is dominated by truss maintenance bounded by |E(T )| 32 .
3.4.5 Truss Attributed Quadtree Index
In this section, we show how to index the divided cells associated with useful truss
pre-computations using the proposed truss attributed quad tree, denoted by TQ-tree.
48
Table 3.5: Description files for a branch
q1 q1.2 q1.2.3 q1.2.3.4t1.1:q1.1,q1.2,q1.3;t1.1.1:q1.2
t1.1: q1.2.3,q1.2.4;t1.1.1:q1.2.3,q1.2.4
t1.1: q1.2.3.2,q1.2.3.4;q1.1.1:q1.2.3.2,q1.2.3.4
t1.1:e, f, i;t1.1.1:e, f, i
It can speedup the operation that retrieves vertices of a k-truss a region contains.
Besides, we can have multiple choices of cell size helping locate the error-bounded
regions efficiently.
TQ-tree components. We firstly introduce several components of the index: truss
list, vertex to truss description, truss to quad description and truss vertex to leaf
quad description below:
A truss list contains all connected trusses in a graph G. For each truss t, an
identifier is assigned, denoted as t.id. Since a truss with a small k may contain
trusses with larger k’s, the id we assign to a truss is similar to Dewey Decimal, which
explicitly expresses the containment relationships. For instance, in Figure 3-1 (a), we
have truss ids for different k’s in Table 3.4.
Given a vertex v, vertex-to-truss description returns all identifiers of the connected
trusses containing v. Using Hash table, given a vertex v and a k-truss t, we can check
whether t contains v in constant time.
Given a truss identifier tid, and a non-leaf quad q of TQ-tree, truss-to-quad de-
scription for q returns direct children of q that contains at least one vertex of tid
identified truss.
Given a truss identifier tid, and a leaf quad ql of TQ-tree, truss-vertex-to-leaf-quad
description returns all vertices of tid identified truss located in ql.
TQ-tree. Firstly, a TQ-tree is a quad tree indexing all divided cells, in which the
divided cells are leaf quads of TQ-tree. For each none-leaf quad in TQ-tree, in addition
to its spatial quaternary information, it is attached a list of truss-to-quad descriptions.
Similarly, for each leaf quad, in addition to the bounding quad information, it also
49
contains a list of truss-vertex-to-leaf-quad descriptions.
For example, the TQ-tree for the dataset displayed in Figure 3-1 is displayed in
Figure 3-3. We only show the partial quadtree for 4-truss in Figure 3-3. The ids
for all trusses are shown in the right side of Table 3.4. The union of vertex-to-truss
description for vertexes in t1.1 are shown in the left side of Table 3.4. The description
files for the branch q1 to q1.2.3.4 are displayed in Table 3.5.
Retrieving vertices from a region. In this section, we show that given a T and
a query region, the running time of getting vertices of T from a region r can be
bounded by O(|Vr|) by using TQ-tree, where |Vr| is the number of vertices of T in r.
The idea is that: we first use the boundaries provided by r, and the description files
attached in TQ-tree to explore limited branches of TQ-tree and then get the content
quads via depth-first traversing. More specifically, we start from the root of TQ-tree,
and collect all quads such that (1) they contain vertices in T , (2) they are in the r
covering spatial space. Given a quad q in TQ-tree, since the operations that check
whether a child of q is in the boundaries of r and whether the child contains T can
be done in constant time, getting a vertex of T in r only depends on the height of
the quad, which is determined once TQ-tree is created. Therefore, it is clear that the
running time of retrieving content quads is proportional to |Vr|.
Searching granularity. Intuitively, different layers of TQ-tree divide the whole
space into cells with different sizes. When the height of TQ-tree increases, the space
is divided into cells with smaller sizes. Therefore, according to the query distance,
we may vary the search granularity, that is, we can select the minimum cells that
construct the bound regions discussed in Section 3.4.2. Given a TQ-tree with height
h, the possible number of search granularities is equal to h. The possible sizes of the
search granularity is {2x × q.w|0 ≤ x ≤ h}, where q.w is the size of the leaf quads in
TQ-tree.
Selecting the search granularity affects the search efficiency and effectiveness. For
instance, if we select search granularity as small as a dozen of meters while the
query distance is over thousand of meters, constructing an error-bounded region with
such small leaf quads would be low efficient. On the other hand, selecting a search
50
Table 3.6: Implemented algorithms
Name AlgorithmExact Algorithm 1 + pre-pruning
EffiExact Algorithm 2 + pre-pruningAdvMax Algorithm in [109] for searching a maximum (k,r)-core
AdvMaxAll Adapted AdvMax to find all maximum (k,r)-coresKRM Adapted AdvMax to find all (k,d)-MCCs + pre-pruning
Apx1 Algorithm 3 with α = 2√
2 + ε and index
Apx2 Algorithm 3 with√
2 + ε′ and indexApx1Ini Apx1 without indexApx2Ini Apx2 without index
SAC Algorithm Exact+ in [34]GeoModu Algorithm in [17]
Table 3.7: Statistic information in datasets
Dataset #vertices #edges #checkins kmaxGowalla [23] 196,591 950,327 6,442,890 29
Brightkite [23] 58,228 214,078 4,491,143 43Foursquare [88, 64] 4,899,219 28,484,755 1,021,970 16
Weibo [65] 1,019,055 32,981,833 32,981,833 11Twitter [65] 554,372 2,402,720 554,372 16
granularity much larger than the query distance would be very fast to get the results
but the error ratio would be large, i.e., the effectiveness would be low. We will discuss
the choices of search granularity that can balance search efficiency and effectiveness
in experimental studies.
3.5 Experimental Results
In this section, we test all algorithms in Table 3.6 on a MAC with Intel i7-4870HQ
(3.7GHz) and 32GB main memory.
Datasets. We conducted the experiments over five real social network datasets
including Gowalla, Brightkite, Foursquare, Weibo and Twitter. Each social user
contains some check-in locations. Table 3.7 presents the statistics for all datasets.
Since we only need one check-in for each vertex, we select the most frequent check-in
as the spatial coordinate for a vertex, if it has multiple check-ins.
51
Table 3.8: Parameter settings
Parameter Range Default value
kGowalla, Brightkite: 5,7,9,11,13,15,17 11
Weibo: 5,6,7,8,9,10,11 9Twitter Foursquare:3,5,7,9,11,13,15 11
d 500, 1000, 1500, 2000, 2500, 3000 2000q.w 100, 200, 400, 800, 1600 400n 20%, 40%, 60%, 80%, 100% 100%
Table 3.9: TQ-tree construction
Dataset Time (Sec) Space (MB)Gowalla 101 31
Brightkite 75 17Foursquare 5102 1680
Weibo 4812 1423Twitter 473 152
Parameter settings. The experiments are evaluated using different settings of query
parameters: k (the minimum truss number) and d (the distance threshold, in meters)
as well as different settings of dataset parameters: q.w (the search granularity) and
n (the percentage of vertices). The ranges of parameters and their default values are
shown in Table 3.8, in which we select reasonable k based on datasets. Furthermore,
when we vary the value of a parameter for evaluation, all the other parameters are
set as their default values.
Index construction. Space division: the width of minimum cells in TQ-tree for the
whole space is set to be 100 meters. The index construction time and size for each
dataset are displayed in Table 3.9.
3.5.1 Efficiency Evaluation
Scalability. To verify the scalability of our algorithms, AdvMaxAll (adapted Ad-
vMax [109] for searching all maximum (k′, r)-cores) and KRM (searching all (k,d)-
MCCs), we choose different sizes of sub-datasets by selecting different percentages
of vertices in each dataset. For AdvMaxAll, we set k′ as k-1, where k is the default
value for the corresponding dataset. We implemented KRM by adapting the problem
52
20% 40% 60% 80% 100%Percentage
10− 1
100
101
102
Tim
e(s)
Exact
EffiExact
Apx1
Apx2
KRM
AdvMaxAll
(a) Gowalla
20% 40% 60% 80% 100%Percentage
10− 1
100
101
102
Tim
e(s)
Exact
EffiExact
Apx1
Apx2
KRM
AdvMaxAll
(b) Brightkite
20% 40% 60% 80% 100%Percentage
100
101
102
103
Tim
e(s)
Exact
EffiExact
Apx1
Apx2
KRM
AdvMaxAll
(c) Foursquare
20% 40% 60% 80% 100%Percentage
10− 1
100
101
102
Tim
e(s)
Exact
EffiExact
Apx1
Apx2
KRM
AdvMaxAll
(d) Twitter
20% 40% 60% 80% 100%Percentage
10− 1
100
101
102
Tim
e(s)
Exact
EffiExact
Apx1
Apx2
KRM
AdvMaxAll
(e) Weibo
Figure 3-4: Scalability
53
setting of AdvMax from k-core to k-truss. From the results in Figures 3-4 (a) to (e),
we can see that the exact algorithms run much slower when the data size is equal
to or larger than 80%. However, for the approximate algorithms, their time costs
increase almost linearly as the data sizes increase for all datasets. For all datasets,
our EffiExact outperforms KRM 30% and AdvMaxAll outperforms EffiExact 10% in
average. Surprisingly AdvMaxAll does not outperform EffiExact significantly. This is
mainly because the size of candidate vertices of searching (k,r)-cores is larger than
that of searching (k,d)-MCCs on all real datasets. In the following experiments, we
focus only on our algorithms (i.e., excluding KRM and AdvMaxAll).
Effect of k. Figures 3-5 (a) to (e) evaluate the performance of the algorithms when
we vary the value of k. In general, all algorithms take less time when k increases.
This is because increasing k will result in the decrease of the sizes of k-trusses. The
EffiExact runs consistently faster than Exact, especially when k is large. In addition,
the study shows that the approximate algorithms outperform the exact algorithms
greatly. The performance can be improved in two orders of magnitude in average
over all datasets. Compared with Apx2, Apx1 is much faster due to its low accuracy
guarantee. Although Apx2 is slower, it can provide more effective results, which will
be discussed in Section 3.5.2.
Effect of d. Figures 3-6 (a) to (e) show the time cost when we vary the distance value
of d from 500 to 3000. With the increase of d, the exact algorithms consume time
exponentially. This is because increasing d will require the algorithms to explore larger
spatial neighbourhood network. The experimental results also confirm our theoretical
analysis that the time complexities of exact algorithms are exponential to the size of
spatial neighbourhood network. In our experiment, EffiExact is faster than Exact by
3-5 times. Different from exact algorithms, the time cost of approximate algorithms
increases slowly for all datasets. In most cases, the approximate algorithms can
answer a (k,d)-MCC search within 10 seconds, which is able to answer the real time
search. Only for Foursquare, it requires to consume about 40 seconds for answering
a (k,d)-MCC search.
Effect of granularity. Figures 3-7 (a) to (e) demonstrate the time cost when we vary
54
5 7 9 11 13 15 17k
100
101
102
Tim
e(s)
Exact
EffiExact
Apx1
Apx2
(a) Gowalla
5 7 9 11 13 15 17k
100
101
102
Tim
e(s)
Exact
EffiExact
Apx1
Apx2
(b) Brightkite
3 5 7 9 11 13 15k
101
102
103
Tim
e(s)
Exact
EffiExact
Apx1
Apx2
(c) Foursquare
3 5 7 9 11 13 15k
101
102
103
Tim
e(s)
Exact
EffiExact
Apx1
Apx2
(d) Twitter
5 6 7 8 9 10 11k
10− 2
10− 1
100
101
102
103
Tim
e(s)
Exact
EffiExact
Apx1
Apx2
(e) Weibo
Figure 3-5: Effect of k
55
500 1000 1500 2000 2500 3000d
100
101
102
103
Tim
e(s) Exact
EffiExact
Apx1
Apx2
(a) Gowalla
500 1000 1500 2000 2500 3000d
100
101
102
Tim
e(s)
Exact
EffiExact
Apx1
Apx2
(b) Brightkite
500 1000 1500 2000 2500 3000d
101
102
103
Tim
e(s)
Exact
EffiExact
Apx1
Apx2
(c) Foursquare
500 1000 1500 2000 2500 3000d
101
102
103
Tim
e(s) Exact
EffiExact
Apx1
Apx2
(d) Twitter
500 1000 1500 2000 2500 3000d
100
101
102
Tim
e(s) Exact
EffiExact
Apx1
Apx2
(e) Weibo
Figure 3-6: Effect of d
56
100 200 400 800 1600Search granularity (m)
0
20
40
60
80
100
120
Tim
e(s
)
Apx1
Apx2
Apx1Ini
Apx2Ini
(a) Gowalla
100 200 400 800 1600Search granularity (m)
0
10
20
30
40
50
60
70
Tim
e(s
)
Apx1
Apx2
Apx1Ini
Apx2Ini
(b) Brightkite
100 200 400 800 1600Search granularity (m)
0
50
100
150
200
250
300
Tim
e(s
)
Apx1
Apx2
Apx1Ini
Apx2Ini
(c) Foursquare
100 200 400 800 1600Search granularity (m)
0
50
100
150
200
Tim
e(s
)
Apx1
Apx2
Apx1Ini
Apx2Ini
(d) Twitter
100 200 400 800 1600Search granularity (m)
0
25
50
75
100
125
150
175
Tim
e(s
)
Apx1
Apx2
Apx1Ini
Apx2Ini
(e) Weibo
Figure 3-7: Effect of search granularity
57
3 6 9 12 15 18k
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Vert
ices f
ilte
ring r
ati
o
Gowalla
Brightkite
Foursquare
(a) Varying k
500 1000 1500 2000 2500 3000d
0.1
0.2
0.3
0.4
0.5
0.6
Vert
ices f
ilte
ring r
ati
o
Gowalla
Brightkite
Foursquare
(b) Varying d
Figure 3-8: Exact algorithm pruning effectiveness
the search granularity. To show the power of index, we also implemented algorithms
Apx1 and Apx2 without the support of index, denoted as Apx1Ini and Apx2Ini respec-
tively. Overall speaking, the time cost of both algorithms decreases as the search
granularity increases. This is because the space would be divided into less number of
cells for the larger search granularity, and the time complexity is proportional to the
number of cells. More clearly, Apx2Ini is very sensitive to granularity since the time
complexity of Apx2Ini is proportional to ζ4.
3.5.2 Effectiveness Evaluation
3.5.2.1 Exact Algorithms
We present the trends of pre-pruning effectiveness in exact algorithms when the pa-
rameters k and d vary. To show the effectiveness, let’s introduce two metrics as below.
Metrics. Let T be the union of maximal connected k-trusses in G. Let G′ be the
graph after pruning. The vertex pruning ratio is measured by |V (G′)||V (T )| . Let t1 and t2 be
the running time of a algorithm applying and without applying pruning rules. The
time saved ratio is defined as t2−t1t2
.
Effect of k. Figure 3-8 (a) reports the vertex pruning ratios as we change the k value.
For datasets Gowalla and Brightkite, their pruning effectiveness becomes higher as
k value increases. Interestingly, for the datasets Weibo, Foursquare and Twitter,
their pruning effectiveness becomes higher at the beginning and then decreases when
58
Gowalla Brightkite Weibo Twitter Foursquare0
0.2
0.4
0.6
0.8
1
Verti
ces f
ilter
ing
ratio
0
0.2
0.4
0.6
0.8
Tim
e sa
ved
ratio
P1P2TrussCAS
P1TSP2TSTrussTSCASTS
(a) Pruning rules 1 and 2
Gowalla Brightkite Weibo Twitter Foursquare0
0.2
0.4
0.6
Regi
on p
runi
ng ra
tio
P31P32
P4P31TS
P32TSP4TS
0
0.2
0.4
0.6
0.8
Tim
e sa
ved
ratio
(b) Pruning rules 3 and 4
Figure 3-9: Effectiveness of pruning rules
the value of k increases further. The main reason is that, in the three datasets,
their vertices have good spatial and social distributions, i.e., the vertices with higher
social cohesiveness also tend to have spatial closeness with each other. Therefore, the
pruning effectiveness becomes less significant when k is high. Actually, similar trends
occur in Gowalla and Brightkite if we further increase the value of k.
Effect of d. Figure 3-8 (b) shows the vertex pruning effectiveness ratio when we
change the value of d. For all datasets, the pruning effectiveness decreases with the
increase of d value. This is because as d increases, the spatial cohesiveness constraint is
relaxed so that the explored spatial network becomes large, which makes less number
of vertices to be filtered. However, when d is at the interval of 1500 to 2000 meters,
our pruning technique can prune more than 50% vertices averagely, which makes
EffiExact run much faster than Exact in all configurations.
Effect of pruning rules. Figure 3-9 (a) reports the vertex pruning ratios with the
left scale and time saved ratios with the right scale for pruning vertices (P1) and
edges (P2) in T ′, and pruning vertices in T (Truss) individually, or applying these
rules interchangeably by cascading pruning (CAS). The coloured bars correspond to
pruning ratios while the bars with hatches correspond to time saved ratios when
applying pruning rules in EffiExact. For all datasets, it shows that applying rules
individually has limited pruning effectiveness with less than 15% vertices pruned, and
less than 12% of time saved by P1(P1TS) and P2(P2TS), though a bit improved by
Truss. However, applying these rules interchangeably can prune much more vertices
59
and save much more time. For Gowalla and Weibo, over 60% of vertices can be
filtered out and more than 50% of time can be saved (shown by CAS(CASTS)).
3.5.2.2 Approximate Algorithms
Region pruning ratio. Region pruning ratio is defined as the ratio of the number
of regions to be evaluated overs the number of regions actually evaluated.
Figure 3-9 (b) shows the region pruning ratios with left scale (time saved ratio
with right scale) when applying pruning rules 3 and 4. The bars P31(P31TS) and
P32(P32TS) demonstrate the pruning ratios (time saved ratios) when applying prun-
ing rule 3 in Apx1 and Apx2. In all datasets, pruning rule 3 is more effective when
pruning tight bound regions compared to applying it to prune upper bound regions.
However, the pruning ratio of pruning rule 4 (shown by P4) outperforms rule 3 when
pruning tight bound regions for all datasets. The reason is that rule 4 considers truss
residence regions.
Figure 3-10 (a) shows the upper bound region pruning ratios when vary the search
granularity for algorithm Apx1. Overall speaking, the pruning ratio decreases as
search granularity increases for all datasets. This is because smaller search granularity
makes the region become more compact, i.e., the effect regions consisting of small sized
cells are closer to minimum bounding box. The pruning ratio for Foursquare is slightly
worse than others but the average pruning ratio is still over 0.5, i.e., more than 50%
percentage of regions are pruned by proposed pruning techniques. Figure 3-10 (b)
shows the tight bound region pruning ratios when varying the search granularity for
algorithm Apx2. The pruning effect trend is similar to upper bound region pruning.
Effect of granularity on approximation ratio. We demonstrate the correlation
between the theoretical and actual approximation ratios when we run the algorithms
Apx1 and Apx2 over all datasets, and show the trend when the search granularity
changes. They are plotted in Figures 3-11 (a) and (b). The x-axis in the figures
represents the calculated theoretical approximation ratios according to search gran-
ularity. For Apx1, the theoretical approximation ratios are calculated as 2.9, 2.97,
3.11, 3.4, and 3.96. For the Apx2, the theoretical approximation ratios are calculated
60
100 200 400 800 1600Search granularity(m)
0.2
0.3
0.4
0.5
0.6R
egio
n f
ilte
ring r
ati
o
Gowalla
Brightkite
Foursquare
(a) Apx1
100 200 400 800 1600Search granularity(m)
0.10
0.15
0.20
0.25
0.30
0.35
0.40
Regio
n f
ilte
ring r
ati
o
Gowalla
Brightkite
Foursquare
(b) Apx2
Figure 3-10: Region pruning ratio
2.75 3.00 3.25 3.50 3.75 4.00Theoretical approx. ratio
1.50
1.75
2.00
2.25
2.50
2.75
3.00
3.25
3.50
Actu
al appro
x.
rati
o
Gowalla
Brightkite
Foursquare
(a) Apx1
1.50 1.75 2.00 2.25 2.50 2.75Theoretical approx. ratio
0.50
0.75
1.00
1.25
1.50
1.75
2.00
2.25
2.50
Actu
al appro
x.
rati
o
Gowalla
Brightkite
Foursquare
(b) Apx2
Figure 3-11: Approximation ratio
as 1.48, 1.56, 1.70, 1.98, and 2.55. From the experimental results, we can see that
the real approximation ratios of Apx1 and Apx2 are much smaller than theoretical
approximation ratios. For both algorithms, their real approximation ratios increase
as the search granularity increases. The approximation ratio of Apx1 increases almost
linearly as the theoretical approximation ratio increases. For Apx2, we can observe a
sudden increase for all datasets after the search granularity goes beyond 400. Within
400, we can see that the actual approximation ratios are all less than 1.4. Therefore,
in real applications we may need to tune the search granularity to balance the search
efficiency and effectiveness.
61
Gowalla Brightkite Foursquare Weibo Twitter0
0.25
0.5
0.75
1No
rmol
ized
spat
ial d
ensit
y
ExactApx1
Apx2AdvMax
SACGeoModu
Figure 3-12: Density study
3.5.2.3 Spatial Density
We verify the spatial closeness of the maximum co-located community found by our
algorithms Exact, Apx1 and Apx2 by comparing with the state-of-the-art spatial-aware
community models: SAC [34], GeoModu [17] and AdvMax [109].
SAC. It finds a k-core G′, containing a query vertex and vertices are spatially con-
tained in a minimum covering circle with smallest radius.
GeoModu. It refines the weight of each edge eu,v in graph G as 1dηu,v
where dηu,v is
the normalised spatial distance from u to v and η is a decay factor. It then detects
the communities using modularity maximisation.
AdvMax. It find the maximum (k, r)-core which is a k-core containing pairwise
vertices with spatial distance no greater than d and maximising the cardinality.
We refine the structural cohesiveness k-core in SAC using k-truss. For GeoModu,
we setup η = 1 and select the community with highest spatial density defined as
follows.
Spatial density. Given a community J with spatial diameter d, the spatial density
of J is defined as:∑u,v∈V (J) ed(u,v)
2|V (J)| .
For SAC, we randomly select 200 query vertices, set up the structural cohesiveness
as 11-truss, generate exact communities, compute spatial density and get the average
spatial density for each dataset. For GeoModu, we set η=1, detect all communities
and compute the average spatial density for each dataset. For Exact, Apx1, Apx2
62
(a) (4,1.5km)-MCC (b) (9,5km)-MCC
Figure 3-13: Case study
and AdvMax, we setup k=11 (k=10 for AdvMax), randomly select 200 different query
distances between 500 to 2000 meters, generate maximum co-located communities,
compute spatial density and get the average spatial density for each dataset. All
results are normalised by D−DminDmax−Dmin , where D is an average spatial density, Dmax and
Dmin are the extremes.
Figure 3-12 shows that Exact performs the best and outperforms AdvMax due to
its structural tightness. AdvMax also performs reasonably well because both Exact
and AdvMax tend to enlarge the cardinality as much as possible for the given distance
threshold. As expected, the approximate algorithms Apx1, Apx2 perform worse than
Exact but reasonably acceptable. As shown in the figure, Apx2 performs better than
AdvMax in most datasets except for Weibo. Also Apx2 performs better than Apx1 in
all datasets due to lower error ratio. Compared with the above algorithms, SAC and
GeoModu perform much worse mainly because they don’t intend to include as many
vertices as possible. In particular, GeoModu has the lowest spatial dense communities
over all datasets and is listed as 0 after normalisation.
3.5.2.4 Case Study
We conducted two case studies on Gowalla to demonstrate the effectiveness of (k,d)-
MCC. In contrast to connected k-truss, our models can ensure the spatial closeness
over community members.
63
Figure 3-13 (a) shows a community with k=4 and d=1.5km. All the members
are around Gothenburg University in Sweden. The community members are good
candidates for some meetup activities since (1) they have strong social relationships,
i.e., each member has 3 friends in the community and members who are not friends
are connected by their mutual friends; and (2) the longest distance between them is
bounded by 1.5km.
Figure 3-13 (b) shows a community with k=9 and d=5km. We can see that there
are some members around the downtown in Austin that are very close with each other.
And there are some members in outskirt that are relatively distant to members in
the downtown area. Removing any of these relatively distant members makes the
community collapse from social cohesiveness perspective. Although the query has
d = 5km, the actual distance between most members is less than 3.3km.
3.6 Conclusion
In this chapter, we study the maximum co-located community search problem in
large scale social networks. We propose a novel community model, co-located com-
munity, considering both social and spatial cohesiveness. We develop efficient exact
algorithms to find all maximum co-located communities. We design approximation
algorithm with guaranteed spatial error ratios. We further improve the performance
using proposed TQ-tree index. We conduct extensive experiments on large real-world
networks, and the results demonstrate the effectiveness and efficiency of the proposed
algorithms.
64
Chapter 4
Contextual Community Search
In this chapter, we propose a novel parameter-free community model called contextual
community, for searching a community in an attributed graph. The proposed model
only requires a query context providing a set of keywords describing the desired com-
munity context, where the returned community is designed to be both structure and
attribute cohesive w.r.t. the query context. We show that both exact and approxi-
mate contextual community can be searched in polynomial time. The best proposed
exact algorithm bounds the running time by a cubic factor or better using an elegant
parametric maximum flow technique. The proposed 13-approximation algorithm sig-
nificantly improves the search efficiency. In the experiment, we use six real networks
with ground-truth communities to evaluate the effectiveness of our contextual com-
munity model. Experimental results demonstrate that the proposed model can find
near ground-truth communities. We test both our exact and approximate algorithms
using twelve large real networks to demonstrate the high efficiency of the proposed
algorithms.
Chapter map. In Section 4.1, we give an overall introduction to the problem of
contextual community search. Section 4.2 presents the contextual community (CC)
model and CC search problem. Sections 4.3, 4.4 and 4.5 present two network flow
based exact algorithms that are designed to solve CC search in polynomial time. Sec-
tion 4.6 presents an approximate solution is proposed and analysed (with an approxi-
mation ratio of 13), which significantly improves over the runtimes of exact algorithms.
65
Experimental results are shown in Section 4.8, followed by the chapter summary in
Section 4.9.
4.1 Introduction
We propose a novel parameter-free community model, namely the contextual com-
munity that only requires a query to provide a set of keywords describing an appli-
cation/user context (e.g. location and preference). As such users can search desired
communities without detailed information of them, this is in contrast to existing
community search models where additionally a set of known query vertices as well as
community cohesiveness parameters (e.g. k as the minimum vertex degree) are re-
quired. But still, the returned contextual community is designed to be both structure
and context cohesive.
Given the query context, the most popular cohesive measurements like k-core and
k-truss are often unsuitable. On one hand, there exists an inherent contradiction that
a larger k value may imply a smaller community to be found that is potentially in-
sufficient to match the provided context. On the other hand, while considering more
about the context match, vertices (edges) may very likely fail to meet the minimum
number requirement of neighbours (common neighbours) of k-core (k-truss). More-
over, imposing the k bound on the community can be deemed to be inflexible and
deciding the best k that satisfies both context and structure requirements is chal-
lenging. Therefore, for seeking a proper contextual community we instead adopt the
notion of relative subgraph density that is parameter-free and relative in the number
of considered motifs/units and the number of their induced vertices. The search goal
is then reduced to finding a subgraph having the highest density. However, as shown
in [97] if the considered motifs or community signatures are simply edges, the found
community might be large and not absolutely cohesive. On the other hand, [97] shows
that triangle motifs would be better signatures to produce a truly cohesive subgraph,
but in this case size of the returned community might drop dramatically, thereby
affecting the desirable vertex coverage. To alleviate such shortcomings, we instead
66
account for both involved triangle and edge motifs as a unified density measure to
suitably explain the structure cohesiveness of a contextual community.
As discussed previously, in real applications, it would be desirable that, by sim-
ply accepting a set of keywords about the desired community context, a community
search system is able to find a community that is highly relevant to the provided con-
text. Intuitively, this translates to, vertices with attributes close to the context shall
be considered as members of the desired community in an attributed graph. How-
ever, overemphasising the co-relation between found vertex attributes and the query
context may cause the search to result in a small and loosely connected subgraph.
This is actually against the structural requirement of being a community and be-
comes another popular research topic, keyword search [9, 47, 57, 66, 81, 58]. To avoid
such pessimistic situation, we can relax context cohesiveness by tolerating commu-
nity vertices that are themselves less relevant to the query context but instead their
surrounding vertices are more relevant. As shown in Section 4.2.2, this relaxation
is naturally achieved with triangle and edge (the aforementioned subgraph density
motif) contextual scores/weights aggregated from nodal contextual relevance. Notice
that our weighted motif (triangle and edge) measure ensures relaxed but strong in-
ternal context cohesiveness in a community since all the involved units are matched
against the query context.
Contextual community. Based on the desiderata of contextual community search,
we propose a weighted density based contextual community model. First, a contex-
tual weight is assigned to each small motif of a subgraph. It measures the context
prevalence of a motif. Then, the context weighted density of a subgraph is calculated
as the division of the aggregated weight over all motifs by the total number vertices
in the subgraph. Finally, the subgraph with the highest contextual weighted density
w.r.t. the query context is returned as the best community. Notice that the intuition
behind our contextual community model is: every designated community member
should be involved in many structurally overlapped edge and triangle motifs which
are themselves prevalent in the specified query context. In real life, these units are
analogous to mutual friendships and family circles.
67
Although our proposed community model is based on weighted densest subgraph,
existing exact and approximation algorithms running in polynomial time only work
on separate density functions, i.e. weighted/unweighted degree density function or
unweighted triangle density function. For our more complicated contextual commu-
nity search, building on the theory frameworks of flow networks and approximation
algorithms we confirm that there are indeed (both in theory and practice) efficient
algorithms.
More precisely, given a graph G and a set of query attributes Q, our first approach
carefully constructs a flow network N that guides the community search. Together
with binary search probing, the approach in total runs in time O(|V (N)|3 log(|V (G)|))
where V (.) and E(.) define vertex and edge sets respectively and N is a constructed
flow network. By formulating the contextual community search as an optimisa-
tion problem, we then construct a different flow network N with parameters help-
ing a monotonic search of contextual community. Along this second approach, we
manage to avoid a pitfall implementation running in O(V (G)|V (N)|3) with an el-
egant parametric maximum flow technique. This technique eventually turns the
runtime into O(|V (N)|3) or better. Note that the aforementioned runtime com-
plexities are worst-case guarantees while in practice they are also very much reduced
with the consideration of query context. To achieve even extra runtime scalability,
we also propose a fast 13-approximation algorithm. The algorithm can run in time
O(|V (G)| log(|V (G)|)+|E(G)| log(|V (G)|)+|Tri(G)|) with simple degree and triangle
indices.
4.2 Problem Definition
4.2.1 Preliminary
Graph with attributes. An attributed graph is denoted as G = (V,E,A), where
V (G), E(G), A denote the set of vertices in G, the set of edges in G, the set of
attributes. Each vertex v ∈ V (G) is attached with a set of attributes A(v) ⊆ A.
68
1 2
5 6
7 8
12
13 14
109
3
11
4
fk1; k2; k3g
fk1; k2; k3g
fk2g fk2g
fk2g
fk2g
fk1; k2; k3gfk3g
fk1; k2; k3gfk3g
fk3g fk3g
fk1; k3g
fk1; k2g
5
(a) Attributed graph
1 2
5 6
7 8 9
fk1; k2; k3g
fk1; k2; k3g
fk1; k2; k3gfk3g
fk1; k2; k3g
fk1; k3g
fk1; k2g
5
(b) Contextual community
1 2
5 6
7 8
fk1; k2; k3g
fk1; k2; k3g
fk3gfk1; k2; k3g
fk1; k3g
fk1; k2g
5
(c) Triangle density model
1 2
5 6
7 8 9
3
fk1; k2; k3g
fk1; k2; k3g
fk2g
fk1; k2; k3gfk3g fk1; k2; k3g
fk1; k3g
fk1; k2g
5
(d) Degree density model
Figure 4-1: Example
Given v ∈ V (G), deg(v,G) denotes the degree of v in G and N(v,G) denotes the
neighbours of v in G.
Triangles in graphs. A triangle in G is a cycle of length 3. A triangle induced on
vertices u, v, w ∈ V (G) is denoted as 4uvw and when these vertices are not specified
we omit the subscript. Given a subgraph H ⊆ G, Tri(H) denotes the set of triangles
in H.
Degree density. Given a subgraph H ⊆ G, the degree density of H is defined as
ρ(H) = |E(H)||V (H)| . We call it degree density because |E(H)| =
∑v∈V (H) deg(v,H)
2.
Triangle density. Given a subgraph H ⊆ G, the triangle density of H is defined as
ρ4(H) = |Tri(H)||V (H)| .
69
4.2.2 Problem definition
As discussed previously, the found CC shall be large, cohesive and relevant to query
context. We address these perspectives by proposing a context weighted density
function.
Our CC model is inspired from observations of real communities. A community in
real world consists of small units like family circles and mutual relationships. Over-
lapping small units together form a bigger community while the overlaps can be both
structure and context/attribute overlap. Edge and triangle motifs in a social net-
work are employed as our basic units. To capture both overlaps and ensure that the
obtained community satisfies a given query context, these basic units are assigned
with context scores measuring their prevalence of query context. A group of users are
considered as a contextual community if in average they are involved in many basic
units that are rich in the given query context.
In the following, we formally define the CC model and search problem starting
from defining context scores of the basic motifs, i.e. scores of edge and triangle motifs
in a subgraph H ⊆ G.
Definition 4.2.1. Context score. Given motifs (u, v) ∈ E(H), (u, v, w) ∈ Tri(H)
and a set of attributes Q as the query context, the context score function are as follows.
• Edge context score: w(e(u, v)) = |Q ∩ A(u)|+ |Q ∩ A(v)|
• Triangle context score: w((u, v, w)) =∑
e∈{(u,v),(u,w),(v,w)}w(e)
Intuitively, the score should award motifs in union covering more query context
and containing vertices that have more query context. The defined context score well
satisfies these intuitions. For the first intuition, an example is as follows. Given query
Q = {k1, k2, k3}, triangle (2, 8, 9) in Figure 4-1 (a) is superior over triangle (3, 4, 11)
since all query contexts are covered by (2, 8, 9), which can be differentiated by the
defined context score. On the other hand, an example for the second intuition is:
given the same query Q, the defined score also rewards triangle {2, 8, 9} a higher
score compared to triangle {8, 12, 14}, which makes sense since all vertices in {2, 8, 9}
70
cover all attributes whereas in {8, 12, 14} only vertex 8 covers all attributes. Another
alternative score function |∪v∈motif A(v) ∩ Q| only encourages motifs covering more
query context while fails to encourage motifs containing more vertices that have query
context. It cannot differentiate triangles such as {2, 8, 9} and {8, 12, 14}.
Next, we define context weighted density of a subgraph H ⊆ G based context
scores.
Definition 4.2.2. Context weighted density. Given a set of query attributes, the
attribute score function, H ⊆ G with Tri(H) containing the set of triangles in H.
The context-weighted density of H, denoted by AD(H) is defined as:
AD(H) =
∑4∈Tri(H)w(4) +
∑e∈E(H)w(e)
|V (H)|. (4.1)
The context weighted density mixes weighted triangle density and weighted edge
density. The superiorities of the context wighted density are twofolds:
Size versus cohesiveness. Given a subgraph H, the density function considers both
edges and triangles in H. If edges are considered only, it is likely to fail to detect
cohesive communities according to the observations in [97]. If triangles are considered
only, we may find very dense subgraphs, while miss vertices that are only involved
in edges or vertices that are involved in a relatively small number of triangles, which
in consequence may penalize the size of the found H. This may contradict with the
purpose of community search.
High adaptability. Given a query context, vertices that are related to the query con-
text may be loosely or intensively connected. An ideal cohesive measurement for CC
shall be adaptable to these situations. When vertices are loosely connected, edge
density part of the context weighted density will favour more related edges being
included so that density AD(H) can increase. When vertices are intensively con-
nected, the triangle score part will help to incorporate the dense parts and ensure the
effectiveness.
Problem 4.2.1. CC search. Given a social network graph G and a set of query
attributes Q, return a CC modelled as a subgraph H ⊆ G satisfying:
71
• H is connected.
• There is no connected H ′ ⊆ G such that AD(H ′) > AD(H).
Example. An attributed graph is shown Figure 4-1 (a), which is adopted from [51].
Given Q = {k1, k2, k3}, applying CC search we can get CC as shown in Figure 4-1(b).
The vertices containing all query contexts are included in the found CC. Compared
to CC, if using weighted triangle density only, we get a smaller community, as shown
in Figure 4-1(c), missing vertex 9 that covers all query attributes. On the other hand,
if using degree density only, the found community is shown in Figure 4-1(d), which
includes vertex 3 that only covers a query context.
4.2.3 Why contextual community search
Compared to existing works, our CC model and search is designed to be a general
framework with several advantages as follows.
• CC employs subgraph density as a parameter-free cohesiveness measure, which
avoids specifying k in k-core, k-truss etc., thus it is easy to use.
• CC balances the cohesiveness of the found community by considering both tri-
angles and edges. Triangles reflect denser nature of the community similar to
k-truss and edges allow flexibility which somewhat similar to k-core.
• CC search avoids certain bad query effect [53], i.e., the search returns empty set
or very loose connected subgraphs because of inappropriate query input. That
is because CC simultaneously models structure and contextual cohesiveness.
• CC search indeed finds near ground truth communities as shown in the experi-
mental studies if user-provided query context is close to attributes contained in
a ground truth community.
72
4.2.4 Pre-prunings
Before describing the main CC search algorithms, we first show a simple yet effective
pruning rule, which helps us quickly exclude vertices that are irrelevant to CC.
Pruning Rule 4.2.1. Given a vertex v ∈ V (G), and given a query context Q, v can
be pruned if following two conditions are satisfied simultaneously.
• A(v) ∩Q = ∅.
• for each u ∈ N(v,G), A(u) ∩Q = ∅.
The correctness of the Pruning Rule 4.2.1 is immediate. Given any H, removing
vertices confirming the pruning rule will not decrease the numerator value of ρ(H)
but will reduce the denominator of ρ(H).
Applying the pruning rule, we can easily filter out irrelevant vertices; at the same
time compute the contextual scores. As such, the input graph could be divided into
disjoint subgraphs. Nevertheless, the algorithms discussed in the following can be still
applied. Then, after the pruning the global optimal CC can be much more efficiently
derived.
4.3 A Flow Network Based Approach
In this section, we first present a polynomial time exact algorithm for finding CC with
a carefully constructed auxiliary flow network. Before introducing the algorithm, we
revisit the preliminaries about flow networks.
4.3.1 Flow Network Preliminaries
Directed flow network. The flow network considered in this chapter is a directed
graph N = (V,E) with a set of nodes V and directed edge set E, having a unique
source node s, a unique sink node t, and a non-negative capacity c(u, v) for every
directed edge (u, v). Note that we prefer using the term node when discussing about
73
flow network, and vertex in the context of social network. Following the flow net-
work convention, we extend the capacity function to arbitrary node pairs by defining
c(u, v) = 0, if (u, v) /∈ E(N), which implies f(u, v) = 0 if (u, v) /∈ E(N).
Flow. A flow f on N is real-valued function on node pairs satisfying three constraints.
First, the capacity constraint, i.e., f(u, v) ≤ c(u, v) for every (u, v) ∈ V (N) × V (N).
Second, the anti-symmetry constraint, i.e., f(u, v) = −f(v, u) for every (u, v) ∈ V (N)
× V (N). Third, the conservation constraint, i.e.,∑
u∈V (N) f(u, v) = 0 for every v ∈
V (N) \ {s, t}. The value of the flow f is defined as∑
v∈V (N) f(v, t). A maximum
flow is a flow of maximum value.
s-t cut. If S and T are two disjoint node subsets such that S ∪ T = V (N) and
S ∩ T = ∅, then the capacity or the cut value across the cut (S, T ) is c(S, T ) =∑v∈S,u∈T c(u, v). (S, T ) is an s-t cut if the source node s ∈ S and the sink node
t ∈ T . A minimum or min s-t cut is with the minimum cut value c(S, T ).
Algorithm 4: Binary search based algorithm
Data: G1 low ← AD(G), Ho ← ∅;2 high ←
|Q|×(6C3|V (N)|+2|V (N)|2)|V (N)| ;
3 while high − low ≥ min interval do4 mid ← (high+ low)/2;5 H ′o ← tryOpt(N ,mid);6 if H ′o 6= ∅ then7 Ho ← H ′o;8 low ← mid;
9 else10 high← mid;
11 return Ho;
Maximum flow and min s-t cut. If f is a flow of N , the flow across an s-t cut
(S, T ) in N is f(S, T ) =∑
u∈S,v∈T f(u, v). The conservation constraint indicates that
the flow across any cut is equal to the flow value f . The capacity constraint indicates
that for any flow f and cut (S, T ), f(S, T ) ≤ c(S, T ) shall hold, which implies the
value of a maximum flow is equal to the capacity of a minimum cut, i.e. the max-flow
min-cut theorem.
74
4.3.2 Algorithm Framework
Intuition. We design an auxiliary flow network such that 1) part S of a min s-t cut
(S, T ) contains a candidate CC and 2) the min s-t cut can guide us how a guessed
context/attribute weighted density of CC eventually arrives at the optimal context
weighted density. With such auxiliary flow network, by solving a sequence of min s-t
cut problems, we approach the optimal density by iteratively guessing its value with
a half-interval method that is analogous to binary search. We carefully study the
stop condition and correctness of the algorithm so that the candidate CC from the
last computed min s-t cut is guaranteed to be CC.
Major steps. Algorithm 4 shows a binary search based framework for searching CC.
In each iteration (lines 4 to 11), the algorithm guesses a context weighted density,
denoted by mid, in a binary search manner, and tries to find a candidate CC with
a closer density to the optimum. Such candidate CC is found by computing the
minimum s-t cut on the flow network to be introduced later.
Function tryOpt. This function takes the carefully designed flow network N and
a newly guessed attribute-weighted density mid as parameters. It first updates the
flow network so that the previously guessed context weighted density is replaced by
mid. The update operation will be discussed in great detail later. The function
then calculates the min s-t cut for the updated flow network, generates a candidate
CC from nodes in part S, and returns a new candidate CC H ′o which is S induced
subgraph of G. The binary search proceeds depending on H ′o is empty or not (we will
prove its correctness later). At this moment, we treat min s-t cut solver as a black
box since we are not modifying it. We will revisit min cut algorithm for our improved
approaches later.
Similar ideas to ours have been applied in graph mining [42, 97, 61] for finding
degree densest subgraph, triangle densest subgraph, directed densest subgraph etc.
However, adopting flow theory for CC search is non-trivial where the challenges are
1) designing appropriate flow network that satisfies the desired intuition of CC and
binary search (the correctness), and 2) bounding algorithm’s runtime complexity
75
3
1
4
6
2
5
fk1; k2; k3g
fk1; k2; k3g
fk1; k2; k3g
fk3g
fg
fg
(a) Sample network
3
1
6 5
4
2
t
1
2
1
2
s
0
2
23
3
2
-5 -1
2
-5-7-1
2-8
(b) Degree
134
3
1
6 5
4
2
t
s
346
245
2×14
3
14
3
2×14
314
3
2×14
3
14
3
2×14
314
32×14
3
2×14
3
14
3
14
3
-14
3
-28
3
-14
3-10
2
32×2
3
2
3
-2
3
-2
3
2×2
3
2
3
2×2
3
(c) Triangle
Figure 4-2: Warm-up flow network illustrations
according to CC definitions (efficiency). In the following subsections, we present the
proposed auxiliary flow network in detail, prove the correctness of Algorithm 4 and
analyze the algorithm’s time complexity.
4.3.3 Warm-up for Flow Network Construction
Consider relaxing the objective of flow network N to take into account either edge
or triangle weights and let S of the min s-t cut always contain a candidate CC. To
simplify the discussion, given a cut (S, T ) for N , let XS(XT ) denote the set of nodes
in N representing vertices in G belonging to S(T ) and let YS(YT ) denote the set of
nodes in N representing triangles in G contained in S(T ). The relaxed objective AD
can be expressed as: AD(G(XS)) =∑
e∈G(XS)w(e) or
∑4∈Tri(G(XS))
w(4), where
(S, T ) is the min cut of N . In the following, we establish connection between cut
value c(S, T ) and each relaxation of AD.
The first relaxation is as follows. Given a graph G, constructing a flow network
such that its any cut (S, T ) with S \{s} 6= ∅ has cut value c(S, T ) and objective value∑e∈G(XS)
w(e). Intuitively∑
e∈G(XS)w(e) equals to the difference between
∑v∈G(Xs)∑
u∈N(v,G)w((v,u))
2and
∑v∈XS
∑u∈XT
w((v,u))2
, in which if (v, u) 6= E(G), w((v, u)) = 0.
To relate cut and objective values, we construct the network as follows. For each
vertex in G we create a node for N and we add a single s and t to N as well.
76
W.o.l.g., the discussion in this subsection assumes there is an edge from s to every
node in N having capacity large enough such that it does not affect the min cut
discussed below. Adding legitimate capacities to these outgoing edges from the source
node s will be discussed in the next subsection. To ensure the expression of c(S,T )
contains∑
v∈G(Xs)
∑u∈N(v,G)
w((v,u))2
, for each node v we create a directed edge to
t with capacity of −∑
u∈N(v,G)w((v,u))
2. To ensure c(S,T ) contains
∑v∈XS
∑u∈XT
w((v,u))2
, for each edge in G and their corresponding involved vertices u and v, we create
two directed edges (u, v) and (v, u) with the same capacity of w((u,v))2
. As a result,
given any cut (S, T ), c(S, T ) will be expressed as −∑
v∈G(Xs)
∑u∈N(v,G)
w((v,u))2
+∑v∈XS
∑u∈XT
w((v,u))2
, which is the negation of∑
e∈G(XS)w(e). We demonstrate such
constructed flow network in Figure 4-2(b) for a sample graph in Figure 4-2(a), given
a query Q = {k1, k2, k3} The cut denoted by dash line partitions the network into
S = {s, 1, 3, 4, 6} and T = {t, 2, 5} and breaks edges of interests {(4, 2), (4, 5)} that
have direction from S to T . XS induced subgraph of G contains V (GXS) = {1, 3, 4, 6},
which has objective value∑
e∈G(XS)w(e) = 24. The cut value c(S, T ) = −25 + 1 =
−24, which is the negation of 24.
The second more complicated relaxation involving triangles is as follows. Given
a graph G, constructing a flow network such that its c(S, T ), where (S, T ) is a min
cut and S \ {s} 6= ∅, is associated with∑4∈Tri(G(XS))
w(4). Similar as before,∑4∈Tri(G(XS))
w(4) can be expressed as the difference between∑
v∈G(Xs)
∑4∈Tri(v,G)
w(4)3
and∑
u∈XS
∑v,w∈XT
w((u,v,w))3
+∑
u,v∈XS
∑w∈XT
2w((u,v,w))3
, where if (v, u, w) 6=
Tri(G), w((v, u, w)) = 0. To achieve that we construct the network as follows. For
each vertex in G we create a vertex node for N , for each triangle in G we create a
triangle node for N and we create s and t as well. Similar to the first relaxation,
we create edges from s to each vertex node and assume their capacities are large
enough. To ensure c(S,T ) contains∑
v∈G(Xs)
∑4∈Tri(v,G)
w(4)3
, for each vertex node
v we create a directed edge to t with capacity of −∑4∈Tri(v,G)
w(4)3
. To ensure
c(S,T ) contains∑
u∈XS
∑v,w∈XT
w((u,v,w))3
we create directed edges as follows. For
each triangle node we create three directed edges to each vertex nodes involved in the
triangle with capacity of 2w(4)3
. For each vertex node, we create a directed edge with
77
capacity of w(4)3
to each triangle node it involves. Notice that with such configuration,
only the minimum cut (S, T ) ensures c(S,T ) contains∑
u∈XS
∑v,w∈XT
w((u,v,w))3
+∑u,v∈XS
∑w∈XT
2w((u,v,w))3
. This property will be formally proven in Lemma 4.3.1.
As an example, in Figure 4-2(c) that is alternatively constructed for the sample
graph in Figure 4-2(a). The cut containing node vertices {1, 3, 4, 6} could be the
one indicated by either black or red dash line. Clearly, the cost of the black dash
line is lower by 23. And in fact it is the min cut which correctly reflects that the
node 4 loses 23
weight because the cut only breaks one count of triangle {4, 2, 5}
aggregated by these vertices in the sample graph. The {1, 3, 4, 6} induced subgraph
G has∑4∈Tri(G(XS))
w(4) = 28. The min cut of N , (S, T ), denoted by black dash
line, has c(S, T ) = −563−10+ 2
3= −28, which is the negation of
∑4∈Tri(G(XS))
w(4).
Through combining the above constructed networks, i.e., create a union of edges
and nodes and combine the capacities on the edges from vertex nodes to t, we are able
to get a flow network having min cut (S, T ) with c(S, T ) containing∑
e∈G(XS)w(e)
and∑4∈Tri(G(XS))
w(4), which is very close to the AD(G(XS)). In addition there is
a challenge that most existing min cut algorithms do not allow negative capacities.
To make N only contain positive capacities and its min cut contain AD(G(XS)), we
modify the construction as follows. For each vertex node in N to t, we add/enlarge the
previously combined capacity by∑4∈Tri(G) w(4) +
∑e∈E(G)w(e) + g, where g acts
as our guessed optimal contextual weight density. By doing that, c(S, T ) must contain
the expression |XS|(g−AD(G(XS))) whereas previously c(S, T ) = −|XS|AD(G(XS)),
which includes the context weighted density of XS induced subgraph in G.
Until now, we have successfully constructed the flow network whose min-cut (S, T )
can be used to derive a candidate CC G(XS). Next, we show the complete construc-
tion in Algorithm 5 by adding capacities from S to vertex nodes and explain how
obtained min cut values direct binary search.
4.3.4 CC Auxiliary Flow Network
In this subsection, following the warm-up intuitions and constructions we show how
the complete auxiliary flow network N for CC is carefully constructed and tuned for
78
Algorithm 5: Auxiliary flow network construction
Data: G1 V (N)← ∅, E(N)← ∅, C(N)← ∅;2 create s and t vertex, V (N)← V (N) ∪ {s} ∪ {t} ;3 foreach 4 ∈ Tri(G) || v ∈ V (G) do create a node n, V (N)← V (N) ∪ {n};4 foreach v ∈ V (N) do5 if v is a vertex in V (G) then6 create an edge to every triangle (4) node that v participates in with
capacity of w(4)3
;7 create an edge from s to v with capacity of
∑e∈E(G)w(e) +∑
4∈Tri(G)w(4);
8 create an edge from v to t with capacity of∑
e∈E(G)w(e) +∑4∈Tri(G)w(4) + g −
∑u∈N(u,G)
w((v,u))2−∑4∈Tri(v,G)
w(4)3
;
9 create an edge from v to each node in N(v) with capacity of w((u,v))2
;
10 if v is a triangle 4 ∈ Tri(G) then11 create an edge to every vertex node form the triangle with capacity of
2w(4)3
12 return N ;
directing CC search.
The whole construction of N is displayed in Algorithm 5. In addition to a source
node s and a sink node t, the node set in N contains two types of nodes: triangle node
and vertex node, which are corresponding to triangles and vertices in a social network
G (lines 2 to 3). Lines 4 to 11 assign capacities to the created directed edges. The ca-
pacity for edges from S to vertex nodes are set to be∑
e∈E(G)w(e) +∑4∈Tri(G)w(4).
The reason is that for different guessed and tuned g, we want the cost of the min cut
always contains the term |V (G)|(∑4∈Tri(G)w(4) +
∑e∈E w(e)), which is important
for proving binary search correctness. The purpose of other assigned capacities have
been explained in the warm-up constructions. Next we formally introduce and prove
the min cut value of the constructed N .
Min s-t cut. We show it with a lemma as follows.
79
Lemma 4.3.1. The min s-t cut of N , denoted by (Sm, Tm) must be in form of:
c(Sm, Tm) = |V (G)|(∑
4∈Tri(G)
w(4) +∑e∈E
w(e))
+|XS|(g − AD(G(XSm)))
(4.2)
Proof sketch. The proof is conducted by showing correctness of two auxiliary lem-
mas.
Firstly, we show how to express the min s-t cut in N . For the sake of simplification,
let X denote the set of nodes in N corresponding to vertices in V (G), let Y denote the
set of triangles vertices in N corresponding to triangles in Tri(G). And accordingly,
we use XS(YS) and XT (YT ) to denote the set of nodes in X (Y ) in S and T parts
after applying s-t cut to N . And let Trii(XS) denote the the set of triangles induced
by exactly i number of nodes in set XS. The lemma below shows the expression of
min s-t cut.
Lemma 4.3.2. Let (S, T ) be a minimum s-t cut in N , the capacity of the c(S, T )
must be in form of:
c(S, T ) =∑v∈XT
c(s, v) +∑v∈XS
c(v, t) +∑v∈XS
∑u∈XT
c(v, u)
+∑
(u,v,w)∈Tri1(XS)∧u∈XS
c(u, (u, v, w))
+∑
(u,v,w)∈Tri2(XS)∧u,v∈XS
(c(u, (u, v, w)) + c(v, (u, v, w)))
+∑
(u,v,w)∈Tri2(XS)∧u,v∈XS
c((u, v, w), w)
(4.3)
Proof sketch. Case 1: XT = {t}; C(S, T ) equals to∑
v∈XS c(v, t) and the correctness
of Lemma 4.3.2 is clear.
Case 2: XS = {s}; C(S, T ) equals to∑
v∈XT c(s, v) and the correctness of Lemma 4.3.2
is clear.
Case 3: T \ {t} 6= ∅ and S \ {s} 6= ∅; The correctness of line (1) in the equation
in Lemma 4.3.2 is clearly correct. When breaking a triangle 4, there two situations
80
may happen corresponding to lines 2 to 4 in Equation 4.3. Firstly, only one vertex
u is in XS. In this situation, to get the min cut, the triangle must be in YT , if
not, we can always get a cut having less cut capacity by moving the triangle from
YS to YT , whereas such cut capacity correctly reflects u lose one triangle. Secondly,
two vertices u, v of (u, v, w) are in XS. The triangle could be either in YT or YS
and in both situation, the designed N correctly reflects two triangles are lost from
u, v’s perspective. If the triangle is in YT , the min-cut breaks two directed edges
(u, (u, v, w)) and (v, (u, v, w)) (line 3), which correctly shows two triangles are lost.
If the triangle is in TS, since w is not in XS, the min cut breaks the directed edge
((u, v, w), w) (line 4), which correctly reflects u, v lose two triangles since the capacity
of the edge from a triangle node to a vertex node is twice of the capacity from a vertex
node to a triangle node. In this case, we can always get a less cost cut by putting
the triangle in YS. We can conclude that the equation in Lemma 4.3.2 expresses the
form of a min s-t cut for N .
Replacing capacities into Equation 4.3, we show the lemma as follows.
Lemma 4.3.3. c(S, T ) can be organised as:
c(S, T ) = |V (G)|(∑
4∈Tri(G)
w(4) +∑e∈E
w(e))
+|XS|(g −∑4∈Tri(G(Xs))
w(4) +∑
e∈E(G(Xs))w(e)
|XS|)
(4.4)
Proof sketch. Firstly, by replacing the capacity into Equation 4.3 we can get equa-
tion as follows.
c(S, T ) = (|XS|+ |XT |)(∑
4∈Tri(G)
w(4) +∑e∈E
w(e))
+∑v∈Xs
(g −∑
u∈N(v,G)
w((u, v))
2−
∑4∈Tri(v,G)
w(4)
3)
+∑v∈XS
∑u∈XT
w((u, v))
2+
∑4∈Tri1(XS)
w(4)
3+ 2
∑4∈Tri2(XS)
w(4)
3.
(4.5)
Now we show two equivalences as follows.∑
v∈Xs∑
u∈N(v,G)w((u,v))
2−
∑81
v∈XS∑
u∈XTw((u,v))
2is equivalent to
∑e∈E(G(XS))
w(e).∑
v∈Xs∑
4∈Tri(v,G)w(4)
3-∑
4∈Tri1(XS)w(4)
3-2∑4∈Tri2(XS)
w(4)3
is equivalent to∑4∈Tri(G(XS)) w(4).
Using the two equivalences we can transfer Equation 4.5 to
c(S, T ) = (|XS|+ |XT |)(∑
4∈Tri(G)
w(4) +∑e∈E
w(e))
+∑v∈Xs
g −∑
e∈E(G(XS))
w(e)−∑
4∈Tri(G(XS))
w(4),(4.6)
which can be organised to equation in Lemma 4.3.3 clearly.
The correctness of Lemma 4.3.3 immediately show the correctness of Lemma 4.3.1.
In particular, the obtained expression for c(Sm, Tm) will help us direct the binary
search, which will be discussed in great detail next.
Update flow network. tryOpt in Algorithm 4 updates/tunes the created N . Es-
sentially in every iteration, it replaces the value of g with the updated mid for edges
created in line 8 of Algorithm 5.
4.3.5 Correctness and Time Complexity
Why min s-t cuts direct the search? Next we show with a lemma that the
min s-t cut of N involving g can help us determine whether the current guessed
context weighted density is higher or lower than the optimal one after introducing
some notations.
Notations. Ho denotes the CC, i.e., Ho = argmaxH{AD(H)|H ⊆ G} and the corre-
sponding optimal contextual weighted density is AD(Ho). Ho′ denotes the candidate
CC obtained from the updated N . g denotes a guessed contextual weighted density
Lemma 4.3.4. Given a min s-t cut (Sm, Tm), if Sm \ {s} 6= ∅, then g < AD(Ho′);
else if Sm \ {s} = ∅ then g > AD(Ho′) .
Proof sketch. First of all, given the cut with S = {s}, the cost of the cut is
|V (G)|(∑4∈Tri(G)w(4) +
∑e∈E w(e)). Since (Sm, Tm) is the minimum cut, we shall
have the inequality: c(Sm, Tm) ≤ |V (G)| (∑4∈Tri(G) w(4) +
∑e∈E w(e)). To prove
82
the lemma, for convenience we instead prove its contrapostive holds which is: 1) if
g ≥ AD(Ho′), then Sm \ {s} = ∅; 2) if g ≤ AD(Ho′) then Sm \ {s} 6= ∅ where
Ho′ = G(XSm).
We prove 1) by contradiction. Assuming g < AD(G(XSm)) while Sm \ {s} =
∅. But this only happens when g = AD(G(XSm)) as in this case S = {s} and
c(Sm, Tm) = |V (G)|(∑4∈Tri(G)w(4) +
∑e∈E w(e)), a contradiction. Hence, g ≥
AD(G(XSm)) holds.
Now for 2), assuming there exists a G(XS′) with context weighted density ad′,
g ≥ ad′ while S \ {s} 6= ∅. The c(S ′m, T′m) now becomes |V (G)| (
∑4∈Tri(G) w(4)
+∑
e∈E w(e)) + |XS′ |(g − ad′). Based on the derived inequality, (S ′m, T′m) shall be
no greater than |V (G)| (∑4∈Tri(G) w(4) +
∑e∈E w(e)), which holds if and only if
g ≤ ad′. This contradicts to our assumption that g ≥ ad′. As such, the second part
must hold.
How binary search stops? To ensure the binary search coverage of CC within
a finite number of searches, we have to show that there is a finite range of context
weighted densities. The range of the possible densities is as follows.
{wn| 0 ≤ w ≤ |Q| × (6C3
|V (N)| + 2|V (N)|2), 1 < n ≤ |V (G)|} (4.7)
While we show the smallest search interval (min interval in Algorithm 4) between
different densities is 1n(n−1) as follows. This interval can be expressed as w1
n1− w2
n2,
i.e. n2w1−n1w2
n1n2. When n1 = n2, since the minimum difference between w1 and w2 is
1, we have the minimum interval of 1n. When n1 6= n2, we have n1n2 ≤ n(n − 1)
and n2w1 − n1w2 ≥ 1, which together makes the minimum interval 1n(n−1) . Further,
equation 4.7 can guide the initial search interval for Algorithm 4.
The termination condition of Algorithm 4 can now be determined with the mini-
mum search interval and the conditions in Lemma 4.3.4, which is formally stated as
follows.
Stop Condition 4.3.1. If there is community Ho with density score of AD(Ho),
while there is no community H ′o with score less than AD(Ho) + 1n(n−1) .
83
Correctness. Now we are ready to present the correctness theorem.
Theorem 4.3.1. Algorithm 4 correctly find CC with the aid of the proposed auxiliary
flow network.
The theorem is immediate from the correctness of Lemma 4.3.4 and Stop Con-
dition 4.3.1.
Time complexity. The time complexity of Algorithm 4 is O (log (|V (G)| |Q|) ×
(|V (N)|3)). The time to construct N is dominated by triangle counting O(|E(G)| 32 )
and we only need to construct N once and then tune the guessed weighted density.
The total number of binary searches is bounded by log(|V (G)||Q|). For each binary
search we need to run a min s-t cut algorithm on N , which can be bound by (|V (N)|3))
using first-in first-out (FIFO) version of the preflow algorithm proposed in [40].
Algorithm 6: Iterative optimization algorithm
Data: G1 Ho ← G;2 ado ← AD(Ho);
3 H′o ← argmaxH′{l(ad,H ′)|ad = ado, H
′ ⊆ G};4 while l(ado, Ho ← H
′o) 6= 0 do
5 ado ← AD(Ho);6 H ′o ← argmaxH′{l(ad,H ′)|ad = ado, H
′ ⊆ G};7 return Ho;
4.4 An Improved Approach
Mathematically the objective function of CC model is to maximize a fractional den-
sity function. Following this observation, we propose an algorithm solving the CC
search problem via an iterative optimization framework. A pitfall implementation of
this framework based on network flows can be easily implemented with time com-
plexity of O(|V (G)| × (|V (N)|3)). However, as shown in the next section, we further
achieve a more sophisticated implementation taking time O(|V (N)|3) via the tech-
nique of parametric maximum flow. In this section, we focus on showing how the
84
optimization framework guarantees to find CC and in next section we show its run-
time improvement.
4.4.1 Optimization Framework
The overall idea of the algorithmic framework is as follows. We start off by considering
the whole graph G as a candidate CC and then use the stop condition in line 4 of
Algorithm 6 (will be explained later) to check if G itself is a CC or not. If not,
we generate a better solution by solving a subproblem argmaxH′ {l(ad,H ′)|ad =
ado, H′ ⊆ G}, in which l(ad,H ′) is defined as:
l(ad,H ′) =∑
4∈Tri(H′)
w(4) +∑
e∈E(H′)
w(e)− ad× (|V (H ′)|) (4.8)
and check the optimality again. Improved solution gets repeatedly generated as such
until the stop condition is met, i.e., a CC is found. These detailed steps are displayed
in Algorithm 6.
In fact in theoretical computer science and network optimization research [38, 56],
such framework has been adopted to solve various optimization problems including
degree densest subgraph. However, it becomes non-trivial to prove the framework
together with our carefully designed Algorithm 7 actually solves the CC search prob-
lem involving contextual weighted edges and triangles. Further, how fast is problem
solved with the framework is unclear. Specifically, our challenges are: 1) Can lines
4 to 7 in Algorithm 6 ensure Ho converges to the optimal solution? 2) How to solve
the subproblem argmaxH′{l(ad,H ′)|ad = ado, H′ ⊆ G}? And 3) how best can Algo-
rithm 6 run in worst case? We answer questions 1) and 2) in the following subsections
and 3) in the next section.
4.4.2 Algorithm Correctness
Algorithm 6 is clearly correct if argmaxH′{l(ad,H ′)|ad = ado, H′ ⊆ G} (line 6) always
returns H ′o having a higher density score, i.e., the current iteration generates a can-
85
didate community having the so-far highest attribute-weighted density. We formally
prove this property with lemmas as follows. Notice that in the following arguments
we follow the same symbols as in Algorithm 6.
Lemma 4.4.1. Let ad′ = AD(H ′o) generated by Algorithm 6, then ad′ > ado always
holds before the algorithm terminates.
Proof sketch. By definition, we have l(ad′, H ′o) =∑4∈Tri(H′
o)w(4) +
∑e∈E(H′
o)w(e)
− ad′ × (|V (H ′o)|) = 0. Then after substracting ado × (|V (H ′o)|) from both sides of
the equation, we get∑4∈Tri(H′
o)w(4) +
∑e∈E(H′
o)w(e) − ado × (|V (H ′o)|) = ad′ ×
(|V (H ′o)|) − ado× (|V (H ′o)|). Therefore if the left side of the equation is greater than
0, then ad′ > ado must hold. For this, we now prove an auxiliary lemma as follows.∑4∈Tri(H′
o)w(4) +
∑e∈E(H′
o)w(e) − ado × (|V (H ′o)|) > 0,
Lemma 4.4.2. Let ado = AD(Ho), then l(ado, H′o) > 0 before Algorithm 6 termi-
nates.
Proof sketch. Since H ′o ← argmaxH′{l(ad,H ′)|ad = ado, H′ ⊆ G}, by definition
argmaxH′{l(ad,H ′)|ad = ado, H′ ⊆ G} > l(ado, Ho) must hold before the algo-
rithm terminates. Since l(ado, Ho) = 0 by definition, then any subgraph returned
by argmaxH′{l(ad,H ′)|ad = ado, H′ ⊆ G} is strictly greater than 0. Hence the cor-
rectness of the lemma is clear which ensures the correctness of Lemma 4.4.1.
Iteration stop condition. Next, we show the stop condition as follows. Let ado =
AD(argmaxH{AD(H)|H ⊆ G}), which will be arrived after certain iterations based
on Lemma 4.4.1, then l(ado, Ho ← H ′o) = 0 means Ho is the result of argmaxH
{AD(H)|H ⊆ G}.
As a conclusion, Lemmas 4.4.1, 4.4.2 and the stop condition together guarantee
the correctness of Algorithm 6.
4.4.3 Solving the Subproblem
Here we prove the optimization problem argmax′H{l(ad,H ′) | ad = ado, H′ ⊆ G}
can be solved by finding min s-t cut in our carefully constructed flow network. In
86
Algorithm 7: Flow network for solving a subproblem
Data: G1 V (N)← ∅, E(N)← ∅, C(N)← ∅;2 create vertices s and t, V (N)← V (N) ∪ {s} ∪ {t} ;3 for 4 ∈ Tri(G) || v ∈ V (G) || e ∈ E(G) do4 create a node n, V (N)← V (N) ∪ {n};5 if n is a triangle (u, v, w) then6 create a directed edge from s to 4 with capacity of w((u, v, w));7 for each edge node n′ representing (u, v), (u,w), (v, w) do8 create a directed edge ((u, v, w), n′) with infinite capacity;
9 if n is an edge (u, v) then10 create a directed edge from s to e with capacity of w(e);11 for each vertex node n′ representing u, v do12 create a directed edge ((u, v), n′) with infinite capacity ;
13 if n is a vertex node then14 create a directed edge (n, t) with capacity of ado;
15 return N ;
particular, we first show in Algorithm 7 the detailed construction of the flow network
N given a current ado and the social network G. Note that this new construction
is quite different from the previous one where edge nodes are also included and the
structural mapping G→ N becomes more transparent with added infinite capacities.
Next we show the equivalence between the problems of solving argmaxH′ {l(ad,H ′)|d =
do, H′ ⊆ G} and finding the min s-t cut in N . We achieve this by the following lem-
mas.
Lemma 4.4.3. Valid cut. For every s-t cut in N without edges having infinite
capacity, nodes in S corresponding to valid subgraph H in G, i.e., (1) if an edge node
is in S the involved two vertex nodes are in S and (2) if a triangle node is in S the
involved three vertex and edge nodes are in S.
Lemma 4.4.3 is obvious from the construction in Algorithm 7.
Lemma 4.4.4. The min s-t cut for N corresponds to a solution of argmaxH′ {l(ad,H ′) | ad =
ado, H′ ⊆ G}.
Proof sketch. Firstly, the min s-t cut must be a valid cut defined by Lemma 4.4.3.
87
Then we show the following correspondence. Mathematically, let (Sm, Tm) be the min
s-t cut while let (S, T ) be a valid s-t cut. Since both of them are valid cut, c(Sm, Tm)
=∑
v∈Sm c(v, t) +∑
v∈Tm c(s, v) and c(S, T ) =∑
v∈S c(v, t) +∑
v∈T c(s, v). Clearly,
the inequality c(Sm, Tm) ≤ c(S, T ) holds.
The inequality∑
v∈Sm c((v, t)) +∑
v∈Tm c((s, v))≤∑
v∈S c((v, t)) +∑
v∈T c((s, v))
holds. We will show its detailed transformations leading to Equation 4.8.
Now, by the flow network we constructed,∑
v∈Tm c(s, v) can be expressed as∑4∈Tri(G)w(4) +
∑e∈E(G)w(e) − (
∑4∈T (S)w(4) +
∑e∈E(S)w(e)), where T (S)
denotes triangles represented by triangle nodes in S and E(S) denotes the set of
edges represented by edge nodes in S. Since a min s-t cut is valid cut (defined by
Lemma 4.4.3), vertex, edge and triangle nodes in S \ {s} can form some Ho and
Ho contains triangles and edges represented by T (S) and E(S). Therefore, we can
rewrite the expression into∑4∈Tri(G)w(4) +
∑e∈E(G)w(e) − (
∑4∈Tri(Ho)w(4) +∑
e∈E(Ho)w(e)).
Next we show alternatively the expression of∑
v∈Sm c((v, t)). Clearly, if (v, t)
exists in E(N), v must be vertex node by construction and they must be contained in
Ho. As a result,∑
v∈Sm c((v, t)) can be expressed as∑
v∈V (Ho)ado, i.e. ado×|V (Ho)|.
The left part of the inequality now becomes∑4∈Tri(G)w(4) +
∑e∈E(G)w(e) −
(∑4∈Tri(Ho)w(4) +
∑e∈E(Ho)
w(e)) + ado × |V (Ho)|. By the same approach, the
right part of the inequality can be expressed as∑4∈Tri(G)w(4) +
∑e∈E(G)w(e)
− (∑4∈Tri(H)w(4) +
∑e∈E(H)w(e)) + ado × |V (H)|, in which H is the subgraph
consisting of vertex nodes in S and containing all triangles and edges represented by
triangle and edges nodes in S. By negating both sides of the above newly expressed
inequality and after simplification, we get∑4∈Tri(Ho)w(4) +
∑e∈E(Ho)
w(e) - ado ×
|V (Ho)| ≥∑4∈Tri(H)w(4) +
∑e∈E(H)w(e) − ado × |V (H)|. The inequality clearly
shows the min s-t is equivalent to finding the solution argmaxH′{l(ad,H ′) | ad =
ado, H′ ⊆ G}.
88
4.4.4 Analysing the Number of Iterations
At most n iterations. We show the number of iterations in Algorithm 6 is at most
n, where n = |V (G)|, thereby a first runtime complexity of O(|V (G)| × (|V (N)|3))
is implied. Note that we will show in the next section how to avoid this runtime
factor n via an incremental parametric flow framework. This claim of linear number
of iterations is formally proved as follows.
Proof sketch. Given any three consecutive iterations, let Ho, H′o and H ′′o be the
candidates generated by Algorithm 6. Accordingly, let ad = AD(Ho), ad′ = AD(H ′o)
and ad′′ = AD(H ′′o ). By the definition, we have H ′o is the result of argmaxH′ {
l(ad,H ′) | H ′ ⊆ G}. Now we are about to generate H ′′o based on H ′o. Vertices
contained in H ′′o can be categorised as follows. Firstly, it contains some vertices that
are in V (H ′o), denoted by V1, V1 ⊂ V (H ′o). Secondly, it contains some vertices that
are not in V (H ′o), denoted by V2. If proving V2 will always be ∅, we can conclude that
given any consecutive generated H ′o and H ′′o , H ′′o ⊂ H ′o will always hold.
We prove that by contradiction. We assume V2 6= ∅, which results in H ′o is not
the optimum results of argmaxH { l(ad,H) | H ⊂ G}, and that contradicts to our
precondition that H ′o is the optimum result for argmaxH {l(ad,H)|H ⊂ G}, which
proves V2 must be ∅. We start the derivation as follows.
By definition, we have H ′′o must be argmaxH { l(ad′, H) | H ⊂ G}, which leads to
inequality l(ad′, H ′′o ) ≥ l(ad′, V1). From the inequality we get AD(G(V2)) ≥ ad′, which
is derived based on H ′′o = V1∪V2. Now, we will show that based on AD(G(V2)) ≥ ad′,
we will get l(ad,G(V (H ′o) ∪ V2)) > l(ad,H ′o). Based on the fact that G(V (H ′o)∪V2)) =
H ′o ∪ G(V2), l(ad,G(V (H ′o) ∪ V2)) − l(ad,H ′o) can be simplified to l(ad,G(V2)). Since
we have Lemma 4.4.1, AD(G(V2)) ≥ ad′ shall hold, which is∑4∈Tri(G(V2))
w(4) +∑e∈E(G(V2))
> |V2|ad′, and l(ad,G(V2)) is∑4∈Tri(G(V2))
w(4) +∑
e∈E(G(V2))− |V2|ad,
we can see that∑4∈Tri(G(V2))
w(4) +∑
e∈E(G(V2))− |V2|ad ≥ |V2|ad′ − |V2|ad > 0.
That is∑4∈Tri(G(V2))
w(4) +∑
e∈E(G(V2))− |V2|ad > 0, i.e., l(ad,G(V (H ′o) ∪ V2))
> l(ad,H ′o). The derived inequality shows that if V2 6= ∅, H ′o is not the optimum
solution of argmaxH {l(ad′, H)|H ⊂ G} whereas H ′o should be optimum. Therefore,
89
we conclude that given any consecutive generated H ′o and H ′′o , H ′′o ⊂ H ′o will always
hold.
4.5 The Incremental Parametric Maximum Flow
From the previous discussion we can observe that Algorithm 6 solves a series of
strongly correlated maximum flow problems to find the CC. In this section we show
that by modifying the proposed flow network, we can apply the faster parametric
maximum flow algorithm which instead incrementally updates a flow network.
Intuition. Algorithm 6 successively computes maximum flow problems that are
closely related, i.e. during the course of iteration, the structure of flow network
doesn’t change much except the value of ad.
Indeed, in [38], Gallo et.al. have investigated such facts and proposed the so-called
parametric maximum flow technique that keeps and incrementally updates the state
of the successive flow networks. It turns out that successively solving maximum flow
problems is computationally equivalent to only solving one maximum flow, thereby
dramatically speed up the whole computation.
However, it is unclear that if this method can be adopted to solve the CC search
problem much faster (avoid n rounds of flow computation in our case). In this sec-
tion we give an affirmative answer to this through slightly modifying the previously
constructed flow network. We also prove that the new network indeed correctly finds
CC with the parametric maximum flow algorithm.
In the following, we first provide preliminaries about parametric maximum flow
and then continue with applying this to solve CC search.
4.5.1 Preliminaries
The parametric maximum flow algorithm is based on the preflow algorithm. The pre-
flow algorithm computes a maximum flow in a given flow network. Two concepts are
essential before presenting the preflow algorithm, that are preflow and valid labelling.
Preflow. A preflow f on N is a real-valued function on node pairs satisfying the
90
capacity and antisymmetry constraints discussed in Section 4.3.1, and the relaxation
of the conservation constraint that is defined as below:
∑u∈V (N)
f(u, v) ≥ 0 for all v ∈ V \ {s}. (4.9)
Given a preflow, the excess e(v) is defined for v as∑
u∈V (N) f(u, v) if v 6= s, or infinity
if v = s. The value of the preflow is defined as e(v). A node v ∈ V (N)\{s, t} is called
active if e(v) ≥ 0. A preflow is a flow if and only if e(v) = 0 for all v ∈ V (N)\{s, t}. A
node pair (u, v) is a residual directed edge for f if f(u, v) < c(u, v), and the difference
c(u, v)− f(u, v) is the residual capacity of the directed edge. A pair (u, v) that is not
a residual directed edge is saturated.
Valid labelling. A valid labelling d for a preflow f is a function from the nodes
to the non-negative integers or infinity, such that d(t) = 0, d(s) = |V (N)|, and
d(u) ≤ d(v) + 1 for every residual directed edge (u, v). The residual distance df (u, v)
from a node v to u is the minimum number of directed edges on a residual path
from u to v, or infinity if no such path is available. If d is a valid labelling, d(v) ≤
min{df (v, t), df (v, s) + n} for any node v.
Preflow algorithm. The algorithm consists of repeating one of the two operations
on an active node u as follows until there is no active node. When the algorithm
terminates, the preflow f becomes a maximum flow.
Push operation. Let v be a forward neighbour of u, if d(u) = d(v) + 1 and f(u, v) <
c(u, v), push the flow valued of min{e(u), c(u, v)− f(u, v)} from u to v.
Relabelling operation. If for all v in forward neighbours of u, d(u) ≤ d(v), relabel d(u)
to min{d(v)|v ∈ forward neighbours of u}+ 1.
Computation of min s-t cut. After running the preflow algorithm, a minimum
cut can be found as follows. For each node u ∈ V (N), replace d(u) by min{df (u, s)+
|V (N)|, df (u, t)}. Then, the cut (S, T ) defined by S = {u|d(u) ≥ |V (N)|} is a
minimum cut whose sink partition T is of minimum size. If desired, a cut (S ′, T ′)
of minimum-size S ′ can be computed as follows. For each u ∈ V (N) let d′(v) =
min{df (s, u), df (t, u) + |V (N)|}, and let S ′ = {d′(u) < |V (N)|}.
91
Algorithm 8: Parametric preflow algorithm
Data: G1 Ho ← G, ado ← AD(G);2 construct parametric flow network NR and set λ = ado; obtain H ′o from min
s-t cut in NR;3 while l(ado, Ho ← H ′o) 6= 0 do4 ado ← AD(Ho), λ ← ado;5 obtain H ′o from min s-t cut in NR;
6 return Ho;
Time complexity. We note the bounds derived in [38] as follows.
Lemma 4.5.1. For any active node v ∈ V (N), d(v) ≤ 2|V (N)| − 1. The value of
d(v) never decreases during the running of the algorithm. The total number of relabel
steps is therefore O(|V (N)|2).
Lemma 4.5.2. The number of saturating push steps through any residual directed
edge (u, v) is at most one per value of d(v). The total number of saturating push
steps is therefore O (|V (N)||E(N)|); there is an implementation that each of such
steps takes constant time.
Lemma 4.5.3. The total number of non-saturating push steps is O (|V (N)|2 |E(N)|);
there is an implementation that each of such steps takes constant time.
Lemma 4.5.4. There is an FIFO version of preflow algorithm with time complex-
ity of O(|V (N)|3). There is a dynamic tree version of preflow algorithm with time
complexity of O(|V (N)||E(N)| log( |V (N)|2|E(N)| )).
4.5.2 Parametric Flow Framework
Algorithm 8 shows how to find CC using a tailored parametric preflow algorithm. It
considers the progressively modified ado as a parameterised capacity in NR, where
NR will be discussed later in great detail. The overall structure of the algorithm is
similar to Algorithm 6, i.e., it continuously generates Ho with higher context weighted
density until reaching the stop condition.
92
Incremental computation. During each iteration, internally the algorithm main-
tains preflow labels via updating the labels computed from the previous iteration.
Further, in order to compute H ′o, preflow value and some edge capacities are updated
according to Ho generated in the previous iteration. More details about the update
procedure are discussed next after introducing our newly designed parametric flow
network NR.
4.5.3 CC Parametric Flow Network
In this section, we present the parametric flow network NR used in Algorithm 8 and
prove that it makes Algorithm 8 find CC correctly.
Parametric flow network basics. A parametric flow network is a flow network in
which edge capacity is a function of λ having following constraints. Firstly, cλ(s, v)
is a nondecreasing function of λ for all v 6= t. Secondly, cλ(v, t) is a nonincreasing
function of λ for all v 6= s. Thirdly, cλ(u, v) is constant for all u 6= s and v 6= t. The
maximum flow or minimum cut in a parametric network is the maximum or minimum
w.r.t. a particular value of the parameter λ.
Parametric flow network for CC search. The NR in Algorithm 8 can be gener-
ated using Algorithm 7 by: (1) reversing direction of all edges, (2) exchanging s and
t, and (3) keeping the same capacities for all edges except for edges directed from
s. The capacities for edges directed from s are considered are a function of λ, i.e.,
cλ(s, v) = λ and λ will be updated to some new generated ado up to the time. That
is, in our designed NR, λ = {ad0, . . . , ado}, where ad0 = AD(G) and ado = AD(Ho),
hence Ho is the CC. The designed NR clearly satisfies the capacity constraints of
parametric flow network.
To ensure the correctness of Algorithm 8, NR shall satisfy two constraints below
and we prove them with corresponding lemmas.
Objective invariant. The min s-t ofNR shall be equivalent to argmaxH′{l(ad,H ′)|ad =
ado, H′ ⊆ G}.
Valid preflow and labelling invariant. After replacing the capacities of all {(s, v)}
in E(NR) with newly generated ado and setting their flow to be ado, the updated NR
93
shall still maintain a valid preflow and labelling.
Lemma 4.5.5. The min s-t of NR for some ado is equivalent to argmaxH′ {l(ad,H ′)
|ad = ado, H′ ⊆ G}.
Proof sketch. The overall proof idea is similar to the proof of Lemma 4.4.4. Firstly,
for any min s-t cut, vertex nodes in T correspond to a valid subgraph Ho in G,
otherwise, the s-t cut must break some edges with infinity capacities.
Let (Sm, Tm) be a min s-t cut, such cut shall break no edges with infinite capacities.
Therefore c(Sm, Tm) can be expressed in∑
v∈Sm c(v, t) +∑
v∈Tm c(s, v). Then, for any
other s-t cut (S, T ) that does not break edges with infinite capacities, the inequality
c(Sm, Tm) ≤ c(S, T ), shall hold.
Since∑
v∈Sm c(v, t) +∑
v∈Tm c(s, v) equals to∑4∈Tri(Sm) w(4) +
∑e∈E(Sm)
w(e) +∑
v∈V (Tm) ado, in which Tri(Sm) denotes the set of triangles represented by
triangle nodes in Sm, E(Sm) denotes the set of edges represented by edge nodes
in Sm and V (Tm) denotes the set of vertices represented by vertex nodes in Tm.
Since∑
4∈Tri(Sm) w(4) +∑
e∈E(Sm)w(e) equals to∑4∈Tri(G) +
∑e∈E(G)w(e) −∑
4∈Tri(Tm)w(4) −∑
e∈E(Tm)w(e), we can derive that c(Sm, Tm) =∑4∈Tri(G) +∑
e∈E(G)w(e) −∑4∈Tri(Tm)w(4) −
∑e∈E(Tm)w(e) +
∑v∈V (Tm) ado.
Applying similar operations on (S, T ), we can get the inequality as follows:∑4∈Tri(G)
+∑
e∈E(G)w(e) −∑4∈Tri(Tm) w(4) −
∑e∈E(Tm)w(e) +
∑v∈V (Tm) ado ≤
∑4∈Tri(G)
+∑
e∈E(G) w(e) −∑4∈Tri(T ) w(4) −
∑e∈E(T ) w(e) +
∑v∈V (T ) ado. After simplifi-
cation we get:∑4∈Tri(Tm) w(4) +
∑e∈E(Tm) w(e) −
∑v∈V (Tm) ado ≤
∑4∈Tri(T )w(4)
+∑
e∈E(T )w(e) −∑
v∈V (T ) ado. Considering Ho as the subgraph induced by vertex
nodes in Tm, the equivalence between min s-t cut in NR and argmaxH′ {l(ad,H ′) |
ad = ado, H′ ⊆ G} is obvious.
Lemma 4.5.6. After replacing capacities to all {(s, v)} in E(NR) with new generated
ado, and setting their flow to be ado, the updated NR still maintains a preflow and
valid labelling.
Proof sketch. After calculating ado using push-relabel algorithm, the preflow in the
previous NR becomes a flow. Since push-relabel algorithm maintains valid labelling
94
during the algorithm and after replacing capacities to all {(s, v)} in E(NR) with
new generated ado, we do not modify the labels, therefore, the valid labelling is still
maintained after updating the capacity.
On the other hand, after deriving the new ado, NR should have a maximum flow,
i.e., for each node v, fn∈V (NR) = 0 shall hold. After setting flows for all {(s, v)} in
E(NR) to be the new generated ado, the flow in NR becomes a preflow since ado is
greater than previously generated ado.
The correctness of Lemmas 4.5.5 and 4.5.6 guarantees that the designed NR
makes Algorithm 8 output CC.
4.5.4 Time Complexity
The time complexity of Algorithm 8 is O(|V (N)|3). Now we briefly discuss the proof
idea. Overall speaking, even though Algorithm 8 progressively modifies the preflow, it
never reduces d(v) for all v ∈ V (N)\{s, t} during runtime. Also since d(v) is bounded
by 2|V (N)|−1 (Lemma 4.5.1), Algorithm 8 can be considered as solving a single min
s-t cut problem. A more detailed proof leading to this faster implementation can be
found in [38].
4.6 Approximation Algorithm
We have demonstrated via exact algorithms that CC can be found in polynomial
time. However, despite the solution optimality the runtime bottleneck lies in finding
min s-t cut in a large flow network. For further scalability of CC search, we instead
look for a greedy based approximation algorithm that trades off solution accuracy
for speed. Our proposed algorithm as follows is inspired from the peeling strategy in
[97, 61]. In our case, the algorithm iteratively removes the vertex affecting the least
context score. The main challenge here is to derive guaranteed approximation ratio
and runtime. Next, we show this approximate algorithm in Algorithm 9 followed by
its approximate ratio and time complexity.
95
Algorithm 9: Approximate CC
Data: G1 Ho ← ∅;2 do ← 0;3 while V (G) 6= ∅ do4 v′ ← argminv∈V (G){
∑4∈T (v,G)w(e) +
∑e∈E(v,G)w(e)};
5 E(G) ← E(G) \ E(v′, G), V (G) ← V (G) \ {v′}, G ← (V (G), E(G));6 if AD(G) > ado then7 Ho ← G;
8 return Ho ;
Theorem 4.6.1. Algorithm 9 is a 13-approximation for finding CC.
Proof sketch. Let Ho be the optimal community and ∀v ∈ V (Ho) let Tri(v,Ho) be
the set of triangles containing v in Tri(Ho). Further, let E(v,Ho) be the set of edges
containing v in E(Ho). Since AD(Ho) ≥ AD(H ′) where H ′ = G(V (Ho) \ {v}), we
can derive the inequality as below:
∑4∈Tri(v,Ho)
w(4) +∑
e∈E(v,Ho)
w(e) ≥ d(Ho) (4.10)
Now, let H be the subgraph in the iteration when the first vertex v ∈ Ho is
removed. Clearly, Ho ⊆ H and for each vertex u ∈ V (H), by the greediness of
Algorithm 9, the following inequality must hold.
∑4∈Tri(u,H)
w(4) +∑
e∈E(u,H)
w(e)
≥∑
4∈Tri(v,H)
w(4) +∑
e∈E(v,H)
w(e)
≥∑
4∈Tri(v,Ho)
w(4) +∑
e∈E(v,Ho)
w(e)
(4.11)
Further by inequalities (4.10) and (4.11), we can easily conclude that∑4∈Tri(u,H)w(4)+∑
e∈E(u,H)w(e) ≥ d(Ho).
96
Since the total weights of triangles and edges in H can be expressed as
AD(H)× |V (H)|
=∑
u∈V (H)
(1
3
∑4∈Tri(u,H)
w(4) +1
2
∑e∈E(u,H)
w(e))(4.12)
Then multiply both sides by 3, we get the inequality as follows.
3× AD(H)× |V (H)|
=∑
u∈V (H)
(∑
4∈Tri(u,H)
w(4) +3
2
∑e∈E(u,H)
w(e))
≥∑
u∈V (H)
(∑
4∈Tri(u,H)
w(4) +∑
e∈E(u,H)
w(e))
≥∑
u∈V (H)
(AD(Ho))
(4.13)
From (4.13) immediately AD(H) ≥ 13AD(Ho).
Time complexity. With simple index structures, the runtime of Algorithm 9 can be
bounded by O (|V (G)| log(|V (G)|) + |E(G)| log(|V (G)|)+ |Tri(G)|). Intuitively this
is because when removing a vertex v from G, we only need to update edge and triangle
changes for the affected vertices in N(v,G). For this, we can just update the context
scores of the affected vertices while in practice the number of affected edges and
triangles is significantly less than |V (G)|. Next, we show how pre-computed indices
actually help us locate affected vertices from their involved edges and triangles after
removing a vertex.
For each vertex, we hash the triangles it is involved in. This index takesO(|Tri(G)|)
space. This simple hash table can help us quickly identify affected triangles and
their involved vertices after removing a vertex. In addition, the graph adjacency
list can help us find affected edges and their vertices after removing a vertex. As
such there is an implementation of Algorithm 9 runs in time O(|V (G)| log(|V (G)|) +
|E(G)| log(|V (G)|) + |Tri(G)|).
97
4.7 Discussions
4.7.1 Finding Large and Connected CC
Algorithms 4, 6 and 9 are primarily designed for finding community with the highest
context weighted density. Since adding size constraint makes Problem 4.2.1 NP-hard
(resulting no polynomial algorithm), we instead harness with heuristics for reporting
a larger connected CC. Our first step is running a depth first search on Ho found by
Algorithm 4, 6 or 9 and outputing the largest connected component in Ho as CC. Then
for flow network Algorithms 4 and 6, we bias reporting larger Ho in every iteration.
We achieve this by first ranking min-cuts found by the preflow algorithm and then
select the min s-t cut partition containing the largest connected components.
Rank min-cuts. As discussed in Section 4.5.1 push-and-relabel algorithm can gen-
erate two types of min s-t cut when the maximum flow has been computed. The
first type maximises the size of S while the second type maximises the size of T . We
denote the first type of cut as (SM , T ) while the second type as (S, TM). We assign
them a score as follows. Let XSM be the vertex nodes in SM , the score of SM is
defined as the size of the largest connected subgraph of G(XSM ), denoted by cs(SM).
Scores are defined for T , S, and TM via the same approach.
Heuristics. Greedily, for Algorithm 4 when finding a min s-t cut after comput-
ing the maximum flow in every iteration, we always select the cut having S ′ =
argmaxS′{cs(S ′)|S ′ ∈ {SM , S}}. Similarly, for Algorithm 6, when finding a min
s-t cut after computing the maximum flow, we always select the cut having T ′ =
argmaxT ′{cs(T ′)|T ′ ∈ {TM , T}}.
4.7.2 State-of-the-art Maximum Flow Algorithms
The essential part of finding exact CC is solving a series of maximum flow problems.
Finding maximum flow has been studied extensively. The time complexity of finding
maximum has been improved from unbounded [82] to O(|V (N)||E(N)|) for any graph
and O( |V (N)|2log(|V (N)|)) when N satisfying |E(N)| = O(|V (N)|) [79].
98
Our implementation. We use FIFO preflow based algorithm with the heuristics
discussed in [21] as the subroutine for CC search in Algorithms 4 and 6. The main
reasons are: first of all, as mentioned in [21] this approach is practically faster than
other state-of-the-art algorithms having similar runtime complexity. Secondly, it is
relatively easy to implement.
Our implied time complexities. Algorithm 4 uses maximum flow algorithm
as a black box, therefore, best known algorithm for finding maximum flow can be
applied directly. As such, the time complexity of Algorithm 4 can be as fast as
O(|V (N)||E(N)| log(|V (G)|)), since the flow network constructed cannot guaran-
tee that |E(N)| = O(|V (N)|). For Algorithm 6, since it can be implemented as
O(|V (N)|) × time complexity of maximum flow algorithm, and the constructed flow
network satisfies |E(N)| = O(|V (N)|), this implies the algorithm can be as fast as
O( |V (N)|3log(|V (N)|)). For Algorithm 8, since it relies on the preflow algorithm, the best
known algorithm can make the time complexity become O(|V (N)||E(N)| log |V (N)|)
since |E(N)| = O(|V (N)|) is satisfied.
4.8 Experimental Results
In this section, we conduct comprehensive and extensive experiment to verify the
effectiveness and efficiency of the proposed contextual community model and algo-
rithms. All the algorithms are implemented in Java and run on a MAC with Intel
Xeon (3.8 GHz) and 128GB main memory.
4.8.1 Experimental Setup
Datasets. We use a list of real-life and synthetic datasets in this experiments, as
shown in Table 4.1. These selected datasets are often applied to evaluate existing
methods of addressing the community search, attributed community search, and key-
word search problems. Facebook is one attributed network dataset with ground-truth
communities. The second line of datasets including DBLP1, and DBpedia and DBLP2
are attributed network datasets without the ground-truth communities, which have
99
Table 4.1: Statistic information of datasets
Dataset #vertices #edges #attributes # avg. triangles (|A(v)| > 3)Facebook 1.9K 8.9K 3064 43DBLP1 977K 3.5M 34213 121
DBpedia 8M 72M 45328 79DBLP2 1.5M 2.4M 13064 32Amazon 335K 926K 1674 19DBLP3 317K 1M 1584 133Youtube 1.1M 3M 5327 42
LiveJournal 4M 35M 11104 213Orkut 3.1M 117M 9926 55
Gowalla 197K 951K 3,890 29Brightkite 58K 214K 2,143 43Foursquare 5M 28M 4,970 16
Weibo 1M 32M 1,976 11Twitter 554K 2M 2,252 16
been used in previous works [35, 58].
We also generate two types of synthetic attributed network datasets. For the
datasets Amazon, DBLP3, Youtube, LiveJournal, and Orkut, they have the ground-
truth communities, but do not include the attribute information in the vertices. To
enrich the attribute information for the network datasets, we first generate a keyword
pool, and then randomly select 7 keywords for each community from the keyword pool.
After that, we assign each vertex with 0-7 of the keywords relating to the community
containing the vertex. For the community members, 80% of them must contain at
least one of the selected keywords. The above synthetic data generation is similar
to that in [53]. In addition, we generate the second type of synthetic datasets from
spatial social network data. For the datasets Gowalla, Brightkite, Foursquare, Weibo,
and Twitter, they have the structure information and the location information of the
users. To enrich the attribute information for the users, we retrieve the local sensitive
words for each location by calling for Google Map API and assign the sensitive words
to the corresponding users based on their location.
Compared models. To verify the performance of our proposed algorithms and
community model, we also implemented the variants of our model and the other
state-of-the-art works. All the compared models and algorithms have been provided
100
Table 4.2: Implemented algorithm for different community models
Model Algorithm Notation Detail
CCBinCC Implemented binary search based algorithm for searching CCMonoCC Implemented Algorithm 8 for searching CC
ACC ApxCC Implemented Algorithm 9 for searching approximate CCECC MonoCC Implemented Algorithm 8 for searching maximum contextual degree density communityTCC MonoCC Implemented Algorithm 8 for searching maximum contextual triangle density communityATC ATC Implemented LocATC[53] for searching attributed truss
in Table 4.2.
Context-weighted degree density (ECC). This model is a variance of contextual com-
munity model, which uses the contextual weighted edges as graph motif to identify
the most conhesive community Ho satisfying Ho=argmaxH{∑e∈E(H) w(e)
|V (H)| |H ⊆ G}.
Context-weighted triangle density (TCC). The contextual weighted triangles are only
used as the graph motif to find the most conhesive community Ho satisfying Ho =
argmaxH{∑
4∈Tri(H) w(4)
|V (H)| |H ⊆ G}.
Attributed truss (ATC in [53]). We also implemented the best algorithm proposed
in [53], denoted by ATC in this work, to search the attributed truss. Given a set
of query vertices, a set of query attributes, two integers k and d, ATC is to find
the (k,d)-truss satisfying constraints: (1) Every edge in the truss has at least (k-2)
common neighbours; and (2) The longest shortest path in the (k,d)-truss is no greater
than d.
Test queries. For each dataset, we randomly pick up and test 200 keyword queries
for each experiment. Each keyword query Q may contain 1-7 terms. The default
keyword queries only contain four terms in each of them. We report the average
performance of running the 200 keyword queries as the result in the experimental
evaluation. To increase the quality of the selected keyword queries, we select the
query terms from the attributed vertices appearing in the ground-truth communities
if the dataset has the ground-truth information. For the other datasets, we randomly
select the query terms and generate the test keyword queries.
In addition, we also generate 200 sets of test queries by following the similar
query generation in [53] where the size of the query vertices and the size of the query
attributes are set to 2.
Metrics. We use three evaluation metrics to verify the effectiveness of the proposed
101
f_1 f_2 f_3 f_4 f_5 f_6 f_7 f_8 f_9 f_100
0.2
0.4
0.6
0.8
1F1
scor
e
CCECC
TCCATC
ACC
Figure 4-3: F1 scores for facebook
contextual community model by simulating different query scenarios.
F1 score. We use F1 score metric to evaluate the quality of the resulting communities
if the dataset has the ground-truth communities. Given a resulting community Ho
and a ground truth community Hg to be targeted, the F1 is defined as F1 (Ho, Hg)
= 2prec(Ho,Hg)×recall(Ho,Hg)prec(Ho,Hg)+recall(Ho,Hg)
, where prec (Ho, Hg) is the average of precV (Ho, Hg) and
precE(Ho, Hg), while recall(Ho, Hg) is the average of recallV (Ho, Hg) and recallE
(Ho, Hg) defined as follows. The precision for vertex(edge) is precV (Ho,Hg) = |V (Ho)∩V (Hg)||V (Ho)|
(precE(Ho, Hg) = |E(Ho)∩E(Hg)||E(Ho)| ) . The recall for vertex (edge) is measured by recallV (Ho,
Hg) = |V (Ho)∩V (Hg)||V (Hg)| ( recallE(Ho, Hg) = |E(Ho)∩E(Hg)|
|E(Hg)| ).
Edge density and Jaccard similarity. For the other datasets without the ground truth
community information, we measure the average edge density and Jaccard similarity.
The edge density measures the cohesiveness of a community H, defined by ed(H) =
|E(H)||V (H)|2 . The Jaccard similarity reflects the attribute cohesiveness of pairwise vertices
from the perspective of query context Q, which is defined as J(u, v) = |Q∩A(u)∩A(v)|A(u)∩A(v) .
4.8.2 Effectiveness Evaluation
4.8.2.1 Dataset with both attributes and ground truth communities
The Facebook dataset has 10 ground-truth communities and their attribute informa-
tion. Figure 4-3 shows the F1 score of this dataset in the experiment. From the
102
Amazon DBLP Youtube LJ Orkut0
0.2
0.4
0.6
0.8
1F1
scor
e
CCECC
TCCATC
ACC
(a) F1 scores no ground-truth networks
Gowalla Brightkite Weibo Twitter Foursquare0.05
0.10.15
0.20.25
0.30.35
Edge
den
sity
CC ECC TCC ACC
(b) Edge density
Gowalla Brightkite Weibo Twitter Foursquare
1234567
Avg.
spat
ial d
ist. (
km)
CC ECC TCC ACC
(c) Average spatial distance
Figure 4-4: Effectiveness evaluation
results, we can see that The community found by CC method is clearly superior over
all the other models. For the ground truth communities f 4 and f 8, CC can find the
communities with the 90% accuracy that is close to the ground-truth communities.
The F1 scores for the communities returned by ACC method are sightly lower than
that of the communities found by CC. The reason is that ACC is approximate CC
having lower context weighted density compared to CC.
Based on the reported F1 scores, ACC performs better than ATC only except f 6.
The major reasons that lead to the lower performance of ATC are (1) ATC finds the
approximate communities based on their model but without theoretical guarantee;
and (2) the structural cohesive measurement of ATC may discard many edges con-
tained in the ground truth communities but do not satisfy the minimum number of
common neighbours constraint. DCC has the lowest performance in terms of F1 score
in this experiment. The reason is that DCC tends to find large communities which
may result in the much lower precision. Similarly, TCC cannot find the communities
that is relevant to the ground-truth communities. This is because the model of TCC
tends to find communities with high precision score, which may lead to the small size
103
of the resulting communities. The small sized communities often have lower score in
Recall. Hence, the TCC method has low F1 scores.
The reasons that our proposed CC search can find near ground truth communities
are as follows. Firstly, in real datasets, the structural characteristic of ground truth
communities is dense while does not confirm certain strict constraint such as k-truss
and k-core. Our edge and triangle density based model can capture such character-
istic. On the other hand, the ground truth commonties in real datasets mostly have
members sharing common interest, where our proposed contextual score can capture
such common interest features if the query context correctly reflects such common
interest. The query context tested in this subsection are randomly selected from
attributes contained in ground truth communities. As such, the finding results are
extremely promising.
4.8.2.2 Dataset with the ground-truth communities only
Figure 4-4 (a) shows the experimental results when we run the queries on the five
datasets, i.e., Amazon, and DBLP, Youtube, LoveJournal(LJ ) and Orkut.
Our proposed methods can find the communities that are most close to the ground-
truth communities. The average F1 score of CC is over 90%. Although the average
F1 score of ACC is about 80%, it still outperforms ATC for most of the datasets
except for DBLP. The main reasons of our methods achieving the superiority are
similar to the explanation in Section 4.8.2.1.
4.8.2.3 Datasets with the spatial attributes
We also evaluate the edge density and average pairwise spatial distance for commu-
nities found by different methods on five generated datasets, i.e., Gowalla, Brightkite,
Foursquare, Weibo, and Twitter. The experimental results are shown in Figure 4-4
(b) and (c). A spatial community will be considered as desired communities in many
applications if it has high structural cohesiveness and low average pairwise spatial
distance among the community members. Based on the edge density metric, TCC
can find the communities with the highest edge density for all datasets. The edge
104
DBLP1 DBpedia DBLP200.10.20.30.40.50.6
Edge
den
sity
CCECC
TCCATC
ACC
(a) Edge density
DBLP1 DBpedia DBLP20
200
400
600
800
Agg.
Jac.
sim
ilarit
y CCECC
TCCATC
ACC
(b) Aggregate Jaccard similarity
Figure 4-5: Attributed networks with no ground-truth
0 1 2 3 4 5 6 7|Q|
0.2
0.4
0.6
0.8
F1 sc
ore
f1AmazonDBLPYoutubeLjOrkut
(a) F1 score
0 1 2 3 4 5 6 7|Q|
0.10.20.30.40.50.60.7
Edge
den
sity
DBLP1DBpediaDBLP2
(b) Edge density
0 1 2 3 4 5 6 7|Q|
0100200300400500600700
Agg.
Jac.
sim
ilarit
y
DBLP1DBpediaDBLP2
(c) Agg. Jac. similarity
0 1 2 3 4 5 6 7|Q|
0.51
1.52.02.5
∞
Avg.
spat
ial d
ist. (
km)
GowallaBrightkiteFoursquareWeiboTwitter
(d) Avg. distance
Figure 4-6: Sensitivity w.r.t. query attribute size
105
20% 40% 60% 80% 100%Percentage of vertices
02468
101214
Runn
ing
time
(s) BinCC
MonoCCApxCCATC
(a) Facebook
20% 40% 60% 80% 100%Percentage of vertices
020406080
100
Runn
ing
time
(s) BinCC
MonoCCApxCCATC
(b) Amazon
20% 40% 60% 80% 100%Percentage of vertices
020406080
100
Runn
ing
time
(s) BinCC
MonoCCApxCCATC
(c) DBLP
20% 40% 60% 80% 100%Percentage of vertices
0
50
100
150
200
Runn
ing
time
(s) BinCC
MonoCCApxCCATC
(d) Youtube
20% 40% 60% 80% 100%Percentage of vertices
0
200
400
600
800
Runn
ing
time
(s) BinCC
MonoCCApxCCATC
(e) LiveJournal
20% 40% 60% 80% 100%Percentage of vertices
0100200300400500600
Runn
ing
time
(s) BinCC
MonoCCApxCCATC
(f) Orkut
Figure 4-7: Scalability
106
density can be over 0.25 in average and 0.33 for Gowalla. CC can also find relative
large edge density communities especially for Brightkite and Foursquare, the edge
density for the found communities is as high as that for communities found by TCC.
In contrast, ACC performs almost the same as CC, which implies the approximate
CC s found by ACC are almost as dense as communities found by CC. In details,
TCC has the shortest average pairwise spatial distance for all datasets, i.e., less than
2 km in average. CC can find communities with the short average pairwise spatial
distance as well, especially for Weibo. The short average pairwise spatial distance of
the communities found by CC is similar to that of TCC. Figure 4-4 (b) and (c) also
show the communities searched by CC and ACC. They have the high edge density
and the short average pairwise spatial distance than the other methods, i.e., they can
find good spatial communities having high cohesiveness and short average pairwise
distance.
4.8.2.4 Datasets with the real attributes
We also evaluate the proposed model using the datasets with real attributes, i.e.,
DBLP1, and DBpedia and DBLP2. Figure 4-5 (a) and (b) illustrate the evaluation
of the different methods using the metrics of edge density and aggregate Jaccard
similarity with regards to different queries. The resulting communities with high edge
density and aggregate Jaccard similarity will be treated as the ideal results because
such community is more cohesive in structure and context, and its community size is
more larger.
From the results, we can see that TCC can find the communities with the highest
edge density for all datasets. The edge density can reach 0.45 in average and 0.5
for DBLP1. But their aggregate Jaccard similarity value is low due to the small size
of the communities to be returned. Another observation is that ECC can find the
communities with the highest aggregate Jaccard similarity value, but the lowest edge
density value for all datasets by comparing with all the other methods. It says that
the communities of ECC may have larger size and higher relevance w.r.t. query but
are very sparse from structural perspective. Furthermore, we can see that CC and
107
20% 40% 60% 80% 100%Percentage of vertices
0100200300400500600700
Runn
ing
time
(s) BinCC
MonoCCApxCCATC
(a) DBpedia
20% 40% 60% 80% 100%Percentage of vertices
05
10152025303540
Runn
ing
time
(s) BinCC
MonoCCApxCCATC
(b) Gowalla
20% 40% 60% 80% 100%Percentage of vertices
010203040
Runn
ing
time
(s) BinCC
MonoCCApxCCATC
(c) Brightkite
20% 40% 60% 80% 100%Percentage of vertices
0200400600800
Runn
ing
time
(s) BinCC
MonoCCApxCCATC
(d) Foursquare
20% 40% 60% 80% 100%Percentage of vertices
0255075
100125150175
Runn
ing
time
(s) BinCC
MonoCCApxCCATC
(e) Weibo
20% 40% 60% 80% 100%Percentage of vertices
020406080
100120
Runn
ing
time
(s) BinCC
MonoCCApxCCATC
(f) Twitter
Figure 4-8: Scalability cont.
108
1 2 3 4 5 6 7|Q|
100
101
102
Runn
ing
time
(s) BinCC
MonoCCApxCC
(a) Facebook
1 2 3 4 5 6 7|Q|
100
101
102
Runn
ing
time
(s) BinCC
MonoCCApxCC
(b) Amazon
1 2 3 4 5 6 7|Q|
10−1
100
101
102
Runn
ing
time
(s)
BinCCMonoCCApxCC
(c) DBLP
1 2 3 4 5 6 7|Q|
100
101
102
Runn
ing
time
(s)
BinCCMonoCCApxCC
(d) Youtube
1 2 3 4 5 6 7|Q|
101
102
103
Runn
ing
time
(s)
BinCCMonoCCApxCC
(e) LiveJournal
1 2 3 4 5 6 7|Q|
100
101
102
103
Runn
ing
time
(s)
BinCCMonoCCApxCC
(f) Orkut
Figure 4-9: Effect of |Q|
109
1 2 3 4 5 6 7|Q|
101
102
103
Runn
ing
time
(s) BinCC
MonoCCApxCC
(a) DBpedia
1 2 3 4 5 6 7|Q|
100
101
102
103
Runn
ing
time
(s) BinCC
MonoCCApxCC
(b) Gowalla
1 2 3 4 5 6 7|Q|
100
101
102
103
Runn
ing
time
(s) BinCC
MonoCCApxCC
(c) Brightkite
1 2 3 4 5 6 7|Q|
101
102
103
Runn
ing
time
(s)
BinCCMonoCCApxCC
(d) Foursquare
1 2 3 4 5 6 7|Q|
101
102
103
Runn
ing
time
(s)
BinCCMonoCCApxCC
(e) Weibo
1 2 3 4 5 6 7|Q|
101
102
Runn
ing
time
(s)
BinCCMonoCCApxCC
(f) Twitter
Figure 4-10: Effect of |Q| cont.
110
ACC have the balanced performance in both high edge density and high aggregate
Jaccard similarity. Therefore, we would like to say that CC and ACC are the ideal
methods to retrieve the cohesive and relevant communities with the balanced benefit.
We also find that the communities of ATC have less edge density and lower aggregate
Jaccard similarity by comparing with CC and ACC. Therefore, we can conclude that
CC and ACC methods are superior to other methods.
4.8.2.5 Varying Query Sizes
Figure 4-6 (a) reports F1 score sensitivity w.r.t. the size of query context for CC
method. For all datasets, F1 score firstly grows as |Q| increases and then decreases
after certain threshold. The intuition is that we can find near ground-truth commu-
nities if the query context correctly describes the desired communities. The small size
of Q implies that the queries cannot be used to precisely describe the ground-truth
communities while the large size Q will induce noise disturbing the search. The ex-
periment shows that the CC method can find the results near to the ground-truth
communities in all datasets when the query context includes 4 to 5 attribute keywords.
Figure 4-6 (b) reports the edge density sensitivity of the CC method when we
vary the size of query context. For all the datasets, the edge density value increases
as |Q| becomes larger. When |Q| = 0, we consider the unweighted subgraph, which
results in the highest edge density. Since the context constrained subgraphs would
always be the sub-components of the unweighted subgraph, the context constrained
subgraphs become close to the unweighted subgraph when |Q| increases, which aligns
with the observed trends. The similar trending observation can be found for the
CC method in terms of the aggregate Jaccard similarity sensitivity when the size
of query varies, as shown in Figure 4-6 (c). Figure 4-6 (d) illustrates the changes
of the average pairwise spatial distance value for CC method when the size of Q
varies. In this experiment, we are only interested in the resulting communities where
the members’ average pairwise spatial distance is no more than 5 km. Otherwise,
the result is considered as infinity. Since we randomly generate the query contexts
and the different contexts represent different locations, CC can only find low average
111
pairwise spatial distance for queries with individual context. The reason is that, when
query contexts is more than one, CC includes users from different regions. As such,
CC finds communities with average pairwise spatial distance around 2 km for all
datasets if the application context describes a suburb.
4.8.3 Efficiency Evaluation
Scalability. Figures 4-7 (a) to (f) and Figures 4-8 (a) to (f)show scalability evalu-
ations for different datasets. For all datasets, our proposed methods MonoCC and
ApxCC perform well as data size increasing. Especially for ApxCC, it can answer
queries within several seconds for all datasets and it is faster than ATC for all
datasets. Please be noticed that ATC is a greedy algorithm with no effectiveness
guarantee. MonoCC can find queries exact CC within few ten of seconds in most
datasets. For all our proposed methods BinCC, MonoCC and ApxCC, the experiment
shows they match the discussed time complexity.
Varying |Q|. Figures 4-9 (a) to (f) and Figures 4-10 (a) to (f) show the running
times as |Q| varying for different datasets. MonoCC outperforms BinCC 4 to 7 times
in average for all datasets, which demonstrates the power of parametric algorithm.
ApxCC performs much better and the running time grow almost linear as |Q| increas-
ing for all datasets. For MonoCC and BinCC, the running time increases significantly
as |Q| becoming greater for all none spatial attributed datasets, i.e., Figures 4-9 (a)
to (f) and Figure 4-10 (a). However, for spatial attributed datasets, the running time
of them increases almost linear with the increase of |Q| as shown in Figures 4-10 (b)
to (f) The reason is that for spatial attributed networks, vertices spatially close with
each other intend to structural close as well and vice versa. As the queries contexts
for these networks are describing suburbs, as |Q| increases, algorithms tend to search
communities in disjoint sub-networks which results the near-linear time w.r.t. |Q|.
112
4.9 Conclusion
In this chapter, we propose a novel parameter-free community model, namely the con-
textual community that only requires a query to provide a set of keywords describing
an application/user context. We propose two network flow based exact algorithms
to solve CC search in polynomial time and an approximation algorithm with an ap-
proximation ratio of 13. Our empirical studies on real social network datasets demon-
strate the superior effectiveness of CC search methods under different query contexts.
Extensive performance evaluations also reveal the superb practical efficiency of the
proposed CC search algorithms.
113
114
Chapter 5
Batch Keyword Query Processing
on Graph Data
Answering keyword queries on textual attributed graph data has drawn a great deal
of attention from database communities. However, most graph keyword search solu-
tions proposed so far primarily focus on a single query setting. We observe that for
a popular keyword query system, the number of keyword queries received could be
substantially large even in a short time interval, and the chance that these queries
share common keywords is quite high. As such, answering keyword queries in batches
would significantly enhance the performance of the system. Motivated by this, We
study efficient batch processing for multiple keyword queries on graph data in this
chapter. Realised that finding both the optimal query plan for multiple queries and
the optimal query plan for a single keyword query on graph data are computationally
hard, we first propose two heuristic approaches which target maximising keyword
overlap and give preferences for processing keywords with short sizes. Then we de-
vise a cardinality based cost estimation model that takes both graph data statistics
and search semantics into account. Based on the model, we design an A* based al-
gorithm to find the global optimal execution plan for multiple queries. We evaluate
the proposed model and algorithms on two real datasets and the experimental results
demonstrate their efficacy.
Chapter map. In Section 5.1, we give an overall introduction to the problem of
115
batch keyword query processing on graph data. Section 5.2 presents preliminaries
and defines the problem formally. In Section 5.3, we present two heuristic rule based
approaches, a shortest list eager one and a maximal overlap driven one. In Section 5.4,
we propose a cost estimation model for estimating the cardinalities of r-join operations
used for evaluating r-cliques. Based on this estimation model, we then discuss how to
find the cost-based optimal query plan for multiple queries efficiently in Section 5.5.
We show the experimental results in Section 5.6. Finally, we conclude this chapter in
Section 5.7.
5.1 Introduction
We study the problem of batch processing of keyword queries on graph data. Our
goal is to process a set of keyword queries as a batch while minimising the total time
cost of answering the set of queries. Batch query processing (also known as multiple-
query optimization) is a classical problem in database communities. In relational
database, Sellis et. al. in [91] studied multiple SQL query optimization. The key idea
is to decompose SQL queries into subqueries and guarantee each SQL query in the
batch can be answered by combining subset of subqueries. However, it may incur a
challenging issue to maintain the intermediate results of all the possible subqueries,
which leads to expensive space cost and extra I/O cost. To do this, Roy et. al. in
[84] evaluated the tradeoff between reuse and recomputation of intermediate results
for subqueries by comparing pipeline cost and reusing cost.
In addition, Jacob and Ives in [55] addressed the problem of interactive keyword
queries as a batch query in relational database. In their work, the keyword search
semantics is defined by the candidate networks [49], which requires to know the
relational data schema in advance. Batch query processing was also studied in other
contexts, e.g., spatial-textual queries [24], RDFSPARQL [63], and XQueries [11].
After investigating batch query processing in different contexts and single key-
word query processing in graph databases, we observe that all the existing techniques
cannot be applied to our problem - batch keyword query processing on graph data.
116
The main reasons come from the following significant aspects. (1) Meaningful Result
Semantics : r-clique can well define the semantics of keyword search on graph data as
r-clique can be used to discover the tightest relations among all the given keywords
in a query [58], but there is no existing work that studies batch query processing with
this meaningful result semantics; (2) Complexity of the Optimal Batch Processing : it
is an NP-complete problem to optimally process multiple keyword queries in a batch.
This is because each single query corresponds to several query plans, and obviously
we cannot enumerate all the possible combinations of single query plans to get the
optimal query plan for multiple queries; (3) Not-available Query Historic Informa-
tion: unlike the batch query processing [107], we do not have the assumption that we
know the result sizes of any subqueries before we actually run these queries because
this kind of historic information is not always available.
Although we can simply evaluate the batch queries in a pre-defined order and
re-use the intermediate results in the following rounds as much as we can, there is
no guarantee the batch queries can be run optimally. To address this, we firstly
develop two heuristic approaches which give preferences for processing keywords with
short sizes and maximise keyword overlaps. Then we devise a cardinality estimation
cost model by considering graph connectivity and result semantics of r-clique. Based
on the cost model, we can develop an optimal batch query plan by extending A*
search algorithm. Since the A* search in the worst case could be exhaustive search,
which enumerates all possible global plans, we propose pruning methods, which can
efficiently prune the search space to get the model based optimal query plan.
5.2 Preliminaries and Problem Definitions
In this section, we introduce preliminaries and define the problem that is to be ad-
dressed in this chapter.
117
5.2.1 Keyword Query on Graph Data
Native graph data. The native graph data G(V,E) consists of a vertex set V (G)
and an edge set E(G). A vertex v ∈ V (G) may contain some texts, which are denoted
as v.KS = {v.key1, . . . , v.keyz}. We call the vertex that contains texts content vertex.
An edge e ∈ E(G) is a pair of vertices (vi, vj) (vi, vj ∈ V ). The shortest distance
between any two vertices vi and vj is the number of edges in the shortest path between
vi and vj, denoted as dist(vi, vj).
Query processing for Single keyword query. Given a keyword query q =
{k1, . . . , km} on a graph G, the answer to q is a set of sub-graphs of G, each subgraph
is generated using an r-clique [58] of G, which is a set of vertices that contains texts
that match all keywords in q and the distance between any two vertices in the r-clique
is less than or equals to r. Given a query q on G, we can get a set of r-cliques, de-
noted RC(q,G). For example, Figure 5-1 shows a subgraph G′ of native graph data
G. Given a query q1 = {k1, k2, k3, k4}, let r = 1. An answer to q1 is the thick vertex
set in Figure 5-1, in which the vertex set {v7, v8, v10} is a 1-clique.
Figure 5-2(a) shows a query plan for q1. A query plan is an operation tree that
contains two types of operations, one is a selection operation σki(G) that selects ver-
tices on graph G whose texts match keyword ki, and the other is an r-join operation
onR that join two r-cliques of G. There could be many query plans to generate the
final r-clique set based on the different processing order of r-joins. For simplifying the
presentation, we use Figure 5-2(b) to express the query plan shown in Figure 5-2(a).
In order to make the selection σki(G) efficiently, we can build up an inverted list of
vertices for each keyword contained in graph G. Then the cost of a selection is O(1).
Therefore the main cost of a query plan depends on the costs of its r-join operations.
5.2.2 Batched Multiple-Keyword Queries
Consider a batch of keyword queries Q = {q1, . . . , qn} on a native graph G, it returns
answers to each query qi ∈ Q.
118
v1 : k2v3 : k2
v5 : k1
v2 : k1; k4v4 : k1 v9 : k2 v10 : k3
v8 : k2
v7 : k1; k4
v6 : k1
G0
Graph data: G
...
... ...
...
...
... ...
Figure 5-1: An example graph G and the answer subsgraphs to q1 in the subgraph G′
σk1(G) σk2
(G) σk3(G) σk4
(G)
./ R
./ R
./ R
(a) p1(q1)
k1 k2 k3 k4
k1k2
k1k2k3
k1k2k3k4
(b) p1(q1)
k5 k3 k1 k4
k3k5
k1k3k5
k1k3k4k5
(c) p2(q2)
k2 k1 k3
k1k3
k1k3k4
k1k3k4k5
k4 k5
k1k2k3k4
(d) p3({q1, q2})
Figure 5-2: Query plans for single queries q1, q2, and batch multiple queries {q1, q2}
A naive way to answer batched keyword queries is to run those queries one by
one. For example we can run the query plans p(q1) and p(q2) shown in Figure 5-
2(b) and Figure 5-2(c) one by one. Obviously it is inefficient. Ideally we hope to
share some (intermediate) results of processed queries to avoid duplicate computation.
For example, Figure 5-2(d) shows a query plan for a batch query {q1, q2}, where
q2 = {k1, k3, k4, k5} and the intermediate results of r-joins σk1(G) onRσk3(G) and
(σk1(G)onRσk3(G))onRσk4(G) can be shared by their upper-level r-join operations.
Then the cost of this query plan p3({q1, q2}) is the summation of the cost of every
r-join operations in the plan.
Problem 5.2.1. Given a batch of keyword queries Q = {q1, . . . , qn} on a native graph
119
G, our aim is to construct a query plan p(Q)opt for all queries in Q such that p(Q)opt
requires minimum cost. It is a typical NP-complete problem [90].
Finding the optimal query plan is non-trivial due to the following reasons.
• A single query corresponds to several query plans, and obviously we do not want
to enumerate combinations of query plans for multiple queries to get the optimal
one; and
• Let RC(K1∪K2, G) = RC(K1, G)onRRC(K2, G). The size of RC(K1∪K2, G) is
not proportional to the sizes of RC(K1, G) and RC(K2, G). Therefore, it is not
easy to predict the size of RC(K1 ∪K2, G).
5.3 Heuristic-based Approaches
We propose two heuristic approaches to target a “good” query plan and answer queries
in the batch Q.
5.3.1 A Shortest List Eager Approach
We first propose an approach Basic whose main idea is to process every query in
turn in the batch Q = {q1, . . . , qn}, and for each query qi ∈ Q it starts from the
shortest list to eagerly join with existing intermediate results if they exist.
Rule 5.3.1. Given two inverted lists of keywords ki and kj, respectively. RC({ki}, G)
takes precedence to r-join with the the existing intermediate results, if the list of ki is
shorter than that of kj.
Algorithm 10 shows the detail of the algorithm Basic, which avoids processing
keywords that have been processed. Therefore, for each iteration, it checks if the
keywords of the current query qi have been processed. For those processed keywords,
it uses the intermediate results of the maximal set of processed keywords, and for those
unprocessed keywords, it starts an r-join onR between the processed intermediate
results and the RC({k}, G) with the smallest size.
120
Algorithm 10: Basic
Data: A graph G, queries Q={q1, . . . , qn}Result: R={RC(q1, G), . . ., RC(qn, G)}
1 Load index H of inverted lists of vertices for keywords;2 R← ∅;3 for i from 1 to n do4 RC(qi, G)← ∅;5 Processed keywords Kp ← ∅;6 foreach keyword k in qi do7 if k is processed in previous queries then8 Kp ← Kp ∪ {k};9 else
10 RC({k}, G) ← Hash vertices in the inverted lists of keyword k;
11 Key ← Kp;// compute processed keyworkds
12 repeat13 Find maximal set of processed keywords Kmax;14 RC(qi, G) ← RC(qi, G) onR RC(Kmax, G);15 Kp ← Kp −Kmax;
16 until Kp is empty ;// compute unprocessed keywords
17 Rank all remaining RC({k}, G) by their sizes in ascending order(k′1, . . . , k
′m);
18 foreach remaining keywork k in qi do19 RC(Key,G) ← RC(Key,G) onR RC({k}, G) ;20 Key ← Key ∪ {k};21 RC(qi, G) ← RC(Key,G);
22 return R;
It is clear that the algorithm Basic is better than the naive approach which simply
processes the queries one after another while does not consider reusing processed
intermediate results.
5.3.2 A Maximal Overlapping Driven Approach
The Algorithm Basic does not make full use of the shared (overlapping) keywords
among all the queries in the batch Q. Therefore, in this section we propose a new
approach based on the observation that more keywords often imply more processing
121
Algorithm 11: Overlap
Data: Q={q1, . . . , qn}Result: R={RC(q1, G), . . ., RC(qn, G)}
1 Algorithm Overlap()
2 Calculate sharing factors in Q;3 repeat4 Calculate frequencies of unprocessed sharing factors;5 Choose the precedent sharing factor sf in Q with maximal
|sf | · freq(sf) according to Rule 5.3.2;6 RC(sf,G)← Cal(sf);7 Remove the subtree rooted at sf ;8 Insert sf to a heap H;9 while H is not empty do
10 Pop the first factor from H to sf ;11 foreach factor s ⊃ sf and |s| − |sf | = 1 do12 RC(s,G)← RC(sf,G)onRRC(s\sf,G);13 if s is a query q in Q then14 Q← Q\{q};15 else16 Insert s to H;
17 Remove sf ;
18 until Q is empty ;19 return R;
20 Procedure Cal(sf)21 RC(sf,G)← ∅;22 if sf contains sub-sharing factors SFc then23 Choose the precedent sharing factor sfc among SFc with maximal
sfc · freq(sfc);24 RC(sf,G)←Cal(sfc);
25 Let k′1, . . . , k′v be keywords sf − sfc whose inverted lists are ranked in
ascending order;26 foreach keyword k′i do27 RC(sf,G)← RC(sf,G)onRRC(k′i, G);
28 return RC(sf,G);
122
k1k2k3k4 k1k3k4k5 k2k3k4k6 k1k3k4k7 k3k4k6k7 k3k4k6k8 k3k4k7k9
k1k3k4k2k3k4 k3k4k6 k3k4k7
k3k4
(a) A quey plan
Iterations Sharing factors and theirfrequencies
1 freq({k2, k3, k4})=2,freq({k1, k3, k4})=3,freq({k3, k4, k6})=3,freq({k3, k4, k7})=3,freq({k3, k4})=4
2 freq({k3, k4, k6})=3,freq({k3, k4, k7})=2freq({k3, k4})=2
3 ∅(b) Sharing factors and their frequencies
Figure 5-3: An example of processes in the algorithm Overlap
cost, and as a result, processing more frequently shared keywords first will benefit
more queries. Before continuing, we define Sharing Factor first.
Definition 5.3.1. Sharing factor. Given a batch query Q= {q1, . . . , qn}, for any
two queries qi, qj ∈Q(i 6= j), we use the intersection of qi and qj to express their
overlapped keywords, which is called sharing factor of qi and qj, denoted SF (qi, qj).
SF (qi, qj) = qi ∩ qj.
Rule 5.3.2. Given a batch Q and let S be the set of sharing factors w.r.t. Q. For any
two sharing factors SFi and SFj in S, RC(SFi, G) takes precedence over RC(SFj, G)
if |SFi| · freq(SFi) > |SFj| · freq(SFj), where freq(SF ) is the frequency of SF in
Q.
Algorithm 11 shows the algorithm Overlap based on Rule 5.3.2. Given a batch of
queries Q, the algorithm Overlap first calculates all sharing factors among queries in
Q. It chooses a sharing factor sf with maximal |sf | ·freq(sf) according to Rule 5.3.2
(line 5). Then in line 6 it calculates the intermediate result RC(sf,G) by invoking
Cal(sf), which recursively processes sharing factors whose keywords are subsets of
the keywords in sf (lines 20–28). This intermediate result RC(sf,G) can benefit
all factors whose keywords are supersets of the keywords of sf . Such benefit can
be propagated to queries in Q. Therefore, the algorithm Overlap pushes all such
sharing factors that can be benefited from computing sf into a heap (line 8). Finally,
123
it calculates r-cliques of the benefited queries (lines 9–17) and removes these processed
queries (line 14). The algorithm repeats the above process until all queries have been
processed.
Figure 5-3 shows an example illustrating the algorithm Overlap. Given a key-
word query batch Q containing queries: q1 = {k1, k2, k3, k4}, q2 = {k1, k3, k4, k5},
q3 = {k2, k3, k4, k6}, q4 = {k1, k3, k4, k7}, q5 = {k3, k4, k6, k7}, q6 = {k3, k4, k6, k8}, and
q7 = {k3, k4, k7, k9}, the Algorithm Overlap firstly calculates sharing factors, which
are shown in Figure 5-3(b).
In the first iteration, all the sharing factors {k1, k3, k4}, {k3, k4, k6}, and {k3, k4, k7}
have the largest size-frequency product as 9 (see the first row in Figure 5-3(b)). With-
out loss of generality, suppose {k1, k3, k4} is selected to evaluate first. It invokes Cal
to calculate RC({k1, k3, k4}, G) as follows: since calculating it may require intermedi-
ate results of other sharing factors whose keywords are a subset of {k1, k3, k4}, CAL
recursively finds the most promising sub-sharing factors of {k1, k3, k4}, i.e. {k3, k4}
in this case (it is also the only sub-sharing factor). After {k3, k4} is processed, it
goes back to the previous recursion state where {k1, k3, k4} is then processed by us-
ing the result of RC({k3, k4}, G). Finally CAL returns RC({k1, k3, k4}, G). Then
the algorithm needs to process all queries that can benefit from RC({k1, k3, k4}, G)
by pushing the sharing factors that are superset of {k1, k3, k4} into a heap, because
these superset sharing factors can benefit from computing RC({k1, k3, k4}, G), i.e.,
{k1, k2, k3, k4}, {k1, k3, k4, k5}, and {k1, k3, k4, k7} are pushed into a heap. Then it
processes each sharing factor s in the heap to calculate RC(s,G) and pushes its su-
perset into the heap until its superset is an original query of the batch. After the first
iteration, queries q1, q2, and q4 have been processed. In the second iteration, there
are only four queries left, which are q3, q5, q6, and q7. Their sharing factors and cor-
responding frequencies are shown in the second row in Figure 5-3(b). The algorithm
chooses {k3, k4, k6} and q3, q4, and q5 can be answered based on {k3, k4, k6}. Finally,
in the last iteration, since only q7 left, the algorithm chooses {k3, k4, k7} to support
query q7. The blue lines in Figure 5-3(a) shows the final query plan by using the
algorithm Overlap.
124
5.4 Cost Estimation for Query Plans
The maximal overlapping driven approach tries to maximize the sharing of intermedi-
ate results. However, this does not mean the overall cost of the query plan is optimal.
Therefore, we provide a cost-based solution to support multiple keyword query on
graph, which mainly contains two parts: (i) estimating cost of a query plan, and (ii)
generating global optimal plan based on the estimated cost.
In this section, we propose a cost model to estimate the query plan. The cost of
a query plan is determined by the cost of involved r-join operations. Therefore, in
Section 5.4.1 we analyse the cost of an r-join and in Section 5.4.2 we estimate the
cardinality of an intermediate result of an r-join between two r-cliques. Finally we
present our estimation cost model for a query plan.
Algorithm 12: rJoin
Data: Two r-cliques RC(Ki, G) and RC(Kj, G)Result: RC(Ki ∪Kj, G)← RC(Ki, G)onRRC(Kj, G)
1 RC(Ki ∪Kj, G)← ∅;2 foreach rc′ ∈ RC(Ki, G) do3 foreach rc′′ ∈ RC(Kj, G) do4 FLAG← true;5 foreach v ∈ rc′ do6 if dist(v, v′) > r for any v′ ∈ rc′′ then7 FLAG← false; break;
8 if FLAG=true then9 RC(Ki ∪Kj, G)← RC(Ki ∪Kj, G) ∪ {rc′ ∪ rc′′};
10 return RC(Ki ∪Kj, G);
5.4.1 Cost of an r-Join
In order to estimate the size of an r-join between two r-cliques, we first illustrate our
implementation of an r-join operation onR.
Given two keyword sets Ki and Kj, let RC(Ki, G) and RC(Kj, G) be two r-clique
sets for Ki and Kj, respectively. Algorithm 12 shows the implementation of an r-join
operation between two r-clique sets RC(Ki, G) and RC(Kj, G). For any r-clique pair
125
〈 rc′, rc′′〉 (rc′ ∈ RC(Ki, G) and rc′′ ∈ RC(Kj, G)), an r-join operation chooses vertex
pairs 〈v, v′〉 from the pair 〈rc′, rc′′〉 such that dist(v, v′) ≤ r. In order to calculate
dist(v, v′) efficiently, we pre-store all shortest paths between every two vertices in G
into a shortest path set SP (G).
Then the cost of the r-join operation o = RC(Ki, G)onRRC(Kj, G) is
cost(o) = O(ni × nj × |Ki| × |Kj|), (5.1)
where ni and nj are numbers of r-cliques in RC(Ki, G) and RC(Kj, G), respectively.
It shows that the cost of an r-join operation is determined by its inputs RC(Ki, G)
and RC(Kj, G).
5.4.2 Estimating Cardinality of an r-Join Result
We observe that given a query q, following the pipeline of r-join operations RC(q,G)
can be derived by a recursive process as follows.
RC(q,G) =
RC({k}, G) if q={k},
RC(q\{k}, G)onRRC({k}, G) if |q\{k}|≥1.
(5.2)
If q contains only one keyword k, the size |RC(q,G)| equals to the length of the
inverted list L(k). So we only need to estimate the size of RC(q,G) = RC(q\{k}, G)
onR RC({k}, G). RC(q,G) merges vertices from RC(q\{k}, G) and RC({k}, G), so
that for any vertex v ∈ RC(q\{k}, G) and v′ ∈ RC({k}, G), dist(v, v′) ≤ r.
According to the r-join operation, we know that for a vertex v ∈ L(k), v cannot
contribute to the result of RC(q,G) if for each v′ ∈ RC(q\{k}, G) their distance
dist(v, v′) > r. We call such v invalid vertex w.r.t. a parameter r, and we can
construct a valid inverted list Lrv(k) such that each vertex in it is a valid vertex
w.r.t. r. Given a graph G and the parameter r, we can easily construct valid inverted
lists of keywords as follows. For each processing vertex v ∈ V (G), we check every
unprocessed vertex v′ whose keywords v′.KS do not overlap with the keywords of v.
126
If ∀v′ belonging to unprocessed vertices, dist(v, v′) > r, we say v is invalid and will
not appear in any list of keywords. Then a valid shortest path set SP rv (G) for all
valid vertices in G can be loaded before receiving any queries. For each v ∈ Lrv(k),
let pr(v) be probability of appearance of v′ ∈ RC(q\{k}, G) such that dist(v, v′) ≤ r.
Then |RC(q,G)| can be estimated as pr(v)× |Lrv(k)| × |RC(q\{k}, G)|.
Estimating cardinality of an r-join between two inverted lists for keywords.
We first consider the simple case where Q = {q1, q2}. Given a graph G, let Lrv(k1)
and Lrv(k2) be the two valid inverted lists of keywords k1 and k2, respectively. Then
|RC(q,G)| can be estimated as:
|RC(q,G)| = pr(v)× |Lrv(k1)| × |Lrv(k2)|, (5.3)
where pr(v) = |SP rv (G)||V (G)|2 , |V (G)|2 is the number of shortest paths in G, and |SP r
v (G)| is
the number of shortest paths in SP rv (G). As we explained above, such statistic value
SP rv (G) can be collected offline for a given a graph G.
Estimating cardinality of an r-join between an r-clique set and an inverted
list for a keyword. We use Equation 5.4 to iteratively estimate the number of
cliques in RC(q,G) when q has more than two keywords.
|RC(q,G)| = (|SP r
v (G)||V (G)|2
)|q|−1 × |RC(q\{k}, G)| × |Lrv(k)|, (5.4)
where |q| > 2 and ( |SPrv (G)|
|V (G)|2 )|q|−1 is the probability pr(v).
Let a query plan p(q) w.r.t. a query q contains a list of r-join operations, then the
final cost of p(q) is cost(P ) =∑
o∈p(q) cost(o), where cost(o) is the cost of the r-join
operation o ∈ p(q) (see Equation 5.1).
5.5 Estimation-based Query Plans
Based on the estimated cost of query plans, we could find a global optimal query
plan by utilising the state-of-the-art approach A* algorithm. By using A*, we could
assess generated partial plans and only expand the most promising partial plan to
127
find the global optimal plan based on estimated cost. In Section 5.5.1 we show how
to construct a search space for A* algorithm, and in Section 5.5.2 we propose pruning
approaches to reduce the search space to make the search more efficient.
5.5.1 Finding Optimal Solution based on Estimated Cost
In this section, we adopt the solution in [92] which is based on A* algorithm to model
our problem as a state space search problem.
Search space. The search space S(Q) for a query batch Q= {q1, . . . , qn}, can be
expressed as: S(Q) = {P (q1) × . . . × P (qn)}, where P (qi) (1 ≤ i ≤ n) is a set
of query plans for single keyword query qi, each of which contains a pipeline of r-
join operations. Let a global query plan for the batch query Q have the form of
〈p1 ∈ P (q1), . . . , pn ∈ P (qn)〉, where pi ∈ P (qi).
Therefore, each state si in the search space is an n-tuple 〈pi1, . . . , pin〉, where pij is
either a {NULL} or a query plan for the i-th query qi ∈ Q. The search space contains
an initial state s0= 〈NULL, . . . , NULL〉 and several final states SF where each pij in
a final state sf ∈ SF corresponds to a query plan for qi ∈ Q. The value of each state
si = 〈p1, . . . , pn〉 equals to the summation of the cost of all query plans in si, i.e.,
v(si) =∑
p∈si,p 6=NULL
cost(p).
The A* algorithm starts from the initial state s0 and finds a final state sf ∈ SF such
that v(sf ) is minimal among all paths leading from s0 to any final state. Obviously,
v(sf ) is the total cost required for processing all n queries.
In order for an A* algorithm to have fast convergence, a lower bound function
lb(si) is introduced on each state si. This function is used to prune down the size of
the search space that will be explored. When A* starts from si−1 and determines if
it is worth to traverse a state si, it gets its lower bound lb(si) as follows:
lb(si) = v(si−1) + pre cost(si), (5.5)
128
where pre cost(si) is the minimal optimistic approximation cost of traversing the
next state si from si−1. That is, starting from si−1, the A* algorithm needs at least
pre cost(si) cost to arrive si, where a new query plan p′ for query qi is to be traversed.
Let p′ contain a set of r-joins, then pre cost(si) =∑
o∈p′ cost(o), where cost(o) is the
minimal optimistic cost of the r-join. For each such an r-join, if it is shared in previous
query plans in si−1, we do not need extra cost to compute it, therefore, cost(o) = 0;
otherwise, suppose this r-join operation can be reused at most nl times by remaining
queries from qi to qn, cost(o) = cost(o)nl
(see Equation 5.6).
cost(o) =
0 if o is shared in si,
cost(o)nl
otherwise.
(5.6)
The cost(o) in Equation 5.6 is the estimated cost of o defined in Equation 5.1.
According to the above analysis, we propose our algorithm EstPlan based on A*
algorithm as follows. If for any state sj (i 6= j), lb(si) < v(sj), EstPlan continues
to traverse pointed from si, otherwise it jumps to state sj and continues traversing
states pointed from sj since the best global plan that is derived from si cannot beat
that of sj. Since the search space is tree structure and the lower bound we used is
always less or equal to actual cost (assume reuse maximumly), the first global plan,
generated by EstPlan that has lowest lower bound in all expanded states, is the
global optimal plan based cost estimation model.
5.5.2 Reducing Search Space
In this section, we analyze how to reduce the search space of query plans. Recall, for
a particular keyword query q in the keyword batch, we will eventually choose only one
query plan to evaluate q. During the evaluation process, some plans in P (q) can be
found as not promising to be the chosen plan for this particular keyword query, and
therefore these plans can be safely pruned. We will introduce two Theorems serving
as the pruning conditions.
129
Theorem 5.5.1. Let pi, pj ∈ P (q) be any two query plans of the single query q. The
plan pi can be pruned, if cost(pi) > cost(pj) and pi does not contain a sharing factor
that is not contained in pj, i.e. SF (pi) ⊆ SF (pj) where SF (p) denotes all the sharing
factors of plan p.
Proof. Basically, Theorem 5.5.1 requires both two conditions be met to prune pi, i.e.,
(a) cost(pi) > cost(pj) and (b) SF (pi) ⊆ SF (pj). We prove by contradiction.
Case 1: if cost(pi) ≤ cost(pj), apparently pi is less expensive than pj. Under the
circumstance that pj does not have shared factors with other queries in Q. Plan pi is
always better than plan pj. As a result, pi cannot be pruned.
Case 2: if SF (pi) 6⊆ SF (pj), it implies there exists one sharing factor SF (SF ∈
SF (pi) and SF /∈ SF (pj)) such that SF is shared with another query q′ ∈ Q, q 6= q′.
Since the computation of SF is shared between q and q′, the actual cost(pi) must be
less than expected and even may be smaller than cost(pj). Consequently, pi cannot
be pruned.
To go further from the proof of Case 2 in Theorem 5.5.1, let SF be a sharing
factor SF ∈ SF (pi) and SF /∈ SF (pj), when pi is chosen as the query plan of q and
pi is evaluated, the best case is that the intermediate result of RC(SF,G) has been
computed in another query plan of q′, and plan pi just simply reuses the result. In such
case, the actual cost of pi is cost(pi)− cost(SF ). Accordingly, we have Theorem 5.5.2
as follows:
Theorem 5.5.2. Let pi, pj ∈ P (q) be any two query plans of a single query q. Let
SF (pi), SF (pj) denote the sharing factors of plan pi, pj respectively. The plan pi can
be pruned, if cost(pi)−∑∀SF∈SF (pi)\SF (pj)
cost(SF ) > cost(pj).
Proof. If plan pi is less preferred than plan pj, it means that cost(pi) is absolutely
larger than cost(pj). This implies that even in the best case where pi reuses as much
shared computation of sharing factors as possible, pi is still more expensive than its
counterpart plan pj. The largest possible reusable cost is∑∀SF∈SF (pi)\SF (pj)
cost(SF ),
which is the maximum cost that pi can save on the condition that the computation of
130
the sharing factors in SF (pi)\SF (pj) has been done in other queries of the batch Q.
The intuition is that, if pi’s minimal possible cost is already larger than pj’s maximal
cost, pi can be pruned safely.
Discussion. All our proposed frameworks can be extended to other search semantics
expect for the semantic specified estimation technique discussed in Section 5.4.
5.6 Experimental Results
In this section, we implemented the Shortest List Eager Approach in Algorithm 10,
the Maximal Overlapping Driven Approach in Algorithm 11, and A* based algorithm
proposed in this chapter. They are briefly denoted as Basic, Overlap and Est-
Plan respectively. Their performances are evaluated and compared by running them
for different multiple query batches over two real datasets. All the tests were con-
ducted on a 2.5GHz CPU and 16 GB memory PC running Ubuntu 14.04.3 LTS. All
algorithms were implemented in GNU C++.
5.6.1 Datasets and Tested Queries
Datasets. We evaluated our algorithms on two real datasets.
1. DBLP dataset1. We generated graphs from DBLP dataset. The generated
graph contains 37,375,895 vertices and 132,563,689 edges where each vertices
represents a publication and each edge represents the citation relationships over
papers.
2. IMDB dataset2. We generated graphs from a processed IMDB dataset [46]. The
vertices in the generated graph represent users or movies. There are 247,753
users and 34,208 movies. The edges in the graph represent relations between
the users and the movies: users may rate movies or comments on movies. In
1http://dblp.uni-trier.de/xml/2http://grouplens.org/datasets/movielens/
131
Table 5.1: Keyword sets for DBLP and IMDB
Number of distinct keyword Keyword frequency rangeDBLP 100 0.015-0.075IMDB 100 0.011-0.045
the generated graph, the total edges consists of 22,884,377 rating relations and
586,994 commenting relations.
Tested Queries. For each dataset, we randomly selected 100 keywords as a keyword
set used for producing tested batch queries. Table 5.1 shows the frequency range of
the keywords in each keyword set. We created batches with different ratios of shared
keywords as follows. We randomly produced 5 subsets of keywords that are picked
from the 100 keywords for DBLP dataset. Each subset contains a certain number of
distinct keywords, i.e., 10 distinct keywords in the 1st subset, 15 distinct keywords
in the 2nd subset, 20 distinct keywords in the 3rd subset, 25 distinct keywords in the
4th subset, and 30 distinct keywords in the 5th subset. For each subset of keywords,
we randomly picked 3 to 7 keywords for individual keyword queries. By repeatedly
working on each subset of keywords until we generated 50 keyword queries as a query
batch. We iterated the above process and generated the tested batch queries for DBLP
and IMDB dataset. We use the generated query batches as experiments input and
report average results. In experimental studies, we fixed the size of multiple keyword
queries in a batch, say 50. Therefore, varying the number of distinct keywords of the
keywords subset from which the batch is generated varies the shared computations
that the query batch contains. If the batch of queries are generated from a small
subset of keywords (e.g., 10 distinct keyword in the subset), the shared computation
contained in the batch is high; otherwise, shared computation of the batch is low.
5.6.2 Evaluation of the Efficiency
We show the computational cost with some configurations in terms of the total run-
ning time for query batches.
Parameters. Parameters that may affect the batch processing efficiency include:
132
0
50
100
150
600 900 1200 1500 1800Data size(MB)
Run
ning
tim
e (S
ec.)
1:SERIAL 2:BASIC 3:OVERLAP 4:EstPlan
(a) DBLP (Number of dist. keywords = 20,r=3)
0
1000
2000
3000
128 256 384 512 640Data size(MB)
Run
ning
tim
e (S
ec.) 1:SERIAL
2:BASIC 3:OVERLAP 4:EstPlan
(b) IMDB (Number of dist. keywords = 20,r=3)
10
15
20
25
600 900 1200 1500 1800Data size(MB)
Spe
edup
(Tim
es) 1:BASIC
2:OVERLAP 3:EstPlan
(c) DBLP (Number of dist. keywords = 20,r=3)
10
15
20
25
128 256 384 512 640Data size(MB)
Spe
edup
(Tim
es) 1:BASIC
2:OVERLAP 3:EstPlan
(d) IMDB (Number of dist. keywords = 20,r=3)
Figure 5-4: Scalability and speedup studies
133
0
50
100
150
2 3 4 5r
Run
ning
tim
e (S
ec.) 1:BASIC 2:OVERLAP 3:EstPlan
(a) DBLP (Number of dist. keywords = 20)
0
10
20
30
40
10 15 20 25 30Number of dist. keywords
Run
ning
tim
e (S
ec.) 1:BASIC 2:OVERLAP 3:EstPlan
(b) DBLP (r=3)
0
200
400
600
2 3 4 5r
Run
ning
tim
e (S
ec.) 1:BASIC 2:OVERLAP 3:EstPlan
(c) IMDB (Number of dist. keywords = 20)
0
100
200
300
10 15 20 25 30Number of dist. keywords
Run
ning
tim
e (S
ec.) 1:BASIC 2:OVERLAP 3:EstPlan
(d) IMDB (r=3)
Figure 5-5: Efficiency of multiple queries
size of the dataset, r, and number of distinct keywords. In the following experiment
configurations, the default configuration of dataset size is the full size of DBLP and
IMDB, and the default configuration of r is 3, and the default number of distinct
keyword in keywords subsets that generates query batches is 20.
Scalability. We report the computational cost of batch processing algorithms: Ba-
sic, Overlap, and EstPlan, when we vary the dataset size of DBLP and IMDB.
To demonstrate the benefit of batch processing algorithms, we also implemented a
serial query processing algorithm, denoted as SERIAL. The size of DBLP data is
from 600(MB) to 1800(MB) with interval of 300(MB). The data size of IMDB is from
128(MB) to 640(MB) with interval of 128(MB). We keep the other parameters as the
default values (e.g., r=3 and the number of distinct keywords is 20). Figure 5-4(a)
and Figure 5-4(b) show that, the batch processing algorithms are one order of magni-
tude faster than SERIAL when we evaluated the batch queries on both DBLP and
IMDB datasets.
134
Speedup. Figure 5-4(c) and Figure 5-4(d) report speedups of batch processing al-
gorithms w.r.t. algorithm SERIAL. Overall speaking, EstPlan outperforms all the
other algorithms on both tested datasets, having the highest average speedup. When
data size is small, the speedup of EstPlan is relatively close to Basic and Over-
lap. That is because the batch queries evaluation on small pieces of data is fast and
the optimization overhead of EstPlan is non-trivial in this case. As the increase of
the data size, the speedups of Basic, Overlap, and EstPlan grow, which demon-
strates the advantages of batch processing algorithms. Due to the significant speedup
of batch processing algorithms, for the rest of experiments, we only focus on reporting
and discussing the results of batch processing algorithms.
Varying r. We show how the running time will be changed when we vary the value
of r while keeping the default settings for the data sizes and the number of distinct
keywords. Figure 5-5(a) and Figure 5-5(c) show that, when r is no less than 3,
EstPlan is much faster than all the other algorithms. When r is small, the r-clique
computational cost is small, the optimization overhead of EstPlan dominates the
overall running time. Because of that, we can see that EstPlan is close to than all
the other batch processing algorithms when r = 2 in Figure 5-5(a). With the value of
r is increased, the running time of all algorithms sharply increase. This is because as
r increases, there are more results to answer the keyword queries in a batch. Noticed
that the average running time for IMDB is almost 10 times higher than DBLP. This
is because the average connectivity of the graph generated from IMDB is much higher
than the graph from DBLP.
Varying the number of distinct keywords. We show how the running time varies
with different numbers of distinct keywords contained in batch queries while r and
the data sizes are set to be default values. Same as the above discussion, varying the
number of distinct keywords is equivalent to approximately change the ratio of shared
computation. Figure 5-5(b) and Figure 5-5(d) show that the time consumptions of
all algorithms grow when the number of distinct keywords increases. This is because
the increase of the number of distinct keywords may lead to less amount of shared
computation that can be taken advantage by Basic, Overlap and EstPlan.
135
The experiments demonstrate that the reuse of shared computations in a batch
improves the efficiency of computation. The EstPlan outperforms all other algo-
rithms in most configurations.
5.6.3 Evaluation of Effectiveness
In this section, we assess the effectiveness of our proposed cardinality based compu-
tational cost estimation model and the pruning effectiveness of Theorem 5.5.1 and
Theorem 5.5.2.
For a given batch of queries, EstPlan first generates a plan for the batch query
based on the proposed cost estimation model and then executes the queries or sub-
queries based on the generated plan. The total time is used as the exactly computa-
tional cost of the batch of queries. To measure the effectiveness of the cost model, we
also need to work out the ground truth plan. We run a large number of alternative
plans for the batch and select the one consuming the minimal time cost. The selected
plan is treated as the ground truth plan in our experiment. As such, the effectiveness
of the cost estimation can be computed by the ratio of the time cost of running Est-
Plan to the time cost of running the ground truth plan. The ratio is always no less
than one. The smaller value of the ratio means the effectiveness of our cost model
is higher. This is because a small ratio represents our cost model based plan to be
generated can closely approach the ground truth plan. Please note that the running
time has excluded the plan generation time in this section.
For an individual query, there are some factors to affect its effectiveness in terms of
cost estimation: the number of keywords in a query and the value of r. This is because
the cost estimation relies on cardinality estimation where the cardinality estimation
depends on r and number of keywords(Section 5.4). In the following experimental
configuration, the default number of keywords is 4 and the default value of r is 3.
Varying r. We show how the effectiveness of our cost estimation model varies with
different values of r when the number of keywords in batch queries are set to be the
default value. Figure 5-6(a) and Figure 5-6(c) show that, for both DBLP and IMDB,
the ratio of the estimated time cost decreases when we vary the value of r from 2
136
0.0
0.5
1.0
1.5
2.0
2 3 4 5r
Run
ning
tim
e ra
tio
(a) DBLP: average number of querykeywords = 4
0.0
0.5
1.0
1.5
2.0
3 4 5 6 7Number of query keywords
Run
ning
tim
e ra
tio
(b) DBLP: r=3
0.0
0.5
1.0
1.5
2.0
2 3 4 5r
Run
ning
tim
e ra
tio
(c) IMDB: average number of querykeywords = 4
0.0
0.5
1.0
1.5
2.0
3 4 5 6 7Number of query keywords
Run
ning
tim
e ra
tio
(d) IMDB: r=3
0.0
0.5
1.0
1.5
50 60 70 80 90Batch size
Run
ning
tim
e ra
tio
(e) DBLP: r=3
0.0
0.5
1.0
1.5
50 60 70 80 90Batch size
Run
ning
tim
e ra
tio
(f) IMDB: r=3
Figure 5-6: Accuracy of cardinality estimation
137
to 5. As the lower ratio means higher effectiveness, the experimental study indicates
that our cost estimation model based query plan can approach the ground truth plan
for the query evaluation. This is because, when r is high, the results of r-cliques on
the graph data generated from DBLP and IMDB tend to be Cartesian product of
content vertices while the processed cardinality estimation equation follows the same
tendency, which leads to more effective cost estimation.
Varying average number of query keywords in query batches. We show how
the effectiveness of the cost estimation model varies when the individual queries in
the batch query contain more keywords, and r is set as the default value. Figure 5-
6(b) and Figure 5-6(d) show that, on both DBLP and IMDB datasets, the ratio
increases with the increment of the individual query length. As we know the larger
ratio means lower effectiveness of cost estimation. The experimental study implies
that the effectiveness drops as the length of individual query increases.
Varying the batch size. We also study how the effectiveness of cost estimation
model varies when we change the batch size. r is set as the default value. We vary
batch size from 50 to 90 with interval of 10. From the experimental study in Figure 5-
6(e) and Figure 5-6(f), the ratio doesn’t change a lot. The effectiveness becomes a
bit low when we increase the size of query batches for both DBLP and IMDB. It says
our proposed cost estimation model is stable when it is used for big batch queries or
small batch queries.
Pruning effectiveness of theorems. Here we show the effect of proposed pruning
methods discussed in Theorem 5.5.1 and Theorem 5.5.2. We adopt generation time
of EstPlan that is used to get the optimal plan with inputs of: 1) pruned search
space (denote as P) and 2) none-pruned search space (denote as NP). In Figure 5-
7(a) and Figure 5-7(c), we compare their plan generation time with bathes having
different amounts of shared computations. Figure 5-7(a) and Figure 5-7(c) show that,
by increasing the distinct number of keywords in batches (means deceases amount of
shared computations contained in a batch), the plan generation time of EstPlan
with pruned search space decreases while that of none pruned search space is irrelevant
to amount shared computations contained in batches. This is because the pruning
138
0
1
2
10 15 20 25 30Number of dist. keywords
Run
ning
tim
e (S
ec.)
NPP
(a) Running time comparison (DBLP)
0.00
0.25
0.50
0.75
1.00
10 15 20 25 30Number of dist. keywords
Pru
ning
effe
ctiv
enes
s
(b) Pruning effictivenss (DBLP)
0
1
2
10 15 20 25 30Number of dist. keywords
Run
ning
tim
e (S
ec.)
NPP
(c) Running time comparison (IMDB)
0.00
0.25
0.50
0.75
1.00
10 15 20 25 30Number of dist. keywords
Pru
ning
effe
ctiv
enes
s
(d) Pruning effictivenss (IMDB)
Figure 5-7: Pruning effectiveness
effect is associated with shared r-join operations and fewer sharing would result in
higher pruning effectiveness. On the other hand, for the none pruned global optimal
plan searching space, its plan generation is independent with keyword set size but is
relevant to keyword batch size and keyword query length. It is noticeable that the
plan generation time of EstPlan with pruned search space is averagely 5 time less
than the plan generation time of EstPlan with none pruned search space. Figure 5-
7(b) and Figure 5-7(d) show that the pruning effectiveness in terms of the ratio of
the average number of global plans in pruned space and the average number of global
plans contained in none pruned search space. Obviously, the higher ratio will lead to
the better pruning effectiveness. On both datasets, the ratio increases with the more
distinct keywords in a batch (means deceases amount of shared computation contained
in a batch), which represents the better pruning effectiveness. It is noticeable that
the pruning effectiveness is over 0.75 on average for both DBLP and IMDB datasets.
139
5.7 Conclusion
In this chapter, we have studied the new problem, batch keyword query processing on
native graph data. r-clique is used as keyword query result semantics. We developed
two heuristic algorithms to heuristically find good query plans, which are based on
reusing shared computations between multiple queries, i.e. shortest list eager approach
(reuse if possible), maximal overlapping driven approach (reuse as much as possible).
To optimally run the batched queries, we devised an estimation-based cost model to
assess the computational cost of possible sub-queries, which is then used to identify the
optimal plan of the batch query evaluation. We have conducted extensive experiments
to test the performance of the three algorithms on DBLP and IMDB datasets. Cost
estimation based approach has been identified as the ideal solution.
140
Chapter 6
Conclusion and Future Work
In this chapter, we summarise the principal contributions made in this thesis and
discuss some interesting future research directions that can be further explored.
6.1 Conclusion
In this thesis, we studied how to search cohesive subgraphs, i.e. k-truss, densest sub-
graph and clique, in spatial or textual attributed graph data and proposed efficient
algorithms finding the studied cohesive subgraphs. In partially, three models of co-
hesive subgraph with different objectives were explored: (1) search spatial attributed
k-truss that are structural and spatial cohesive on purpose of discovering co-located
community; (2) search textual attributed densest subgraphs that are structural cohe-
sive and textual co-related for discovering contextual community; and (3) efficiently
processing multiple textual attributed r-cliques aiming to answer multiple keyword
queries on graph data.
Towards objective (1), we studied the maximum co-located community search
problem in large scale social networks. We proposed a novel community model, co-
located community, considering both social and spatial cohesiveness. We developed
efficient exact algorithms to find all maximum co-located communities. We designed
approximation algorithm with guaranteed spatial error ratios. We further improved
the performance using proposed TQ-tree index. We conducted extensive experiments
141
on large real-world networks, and the results demonstrate the effectiveness and effi-
ciency of the proposed algorithms.
Towards objective (2), we proposed a novel parameter-free community model,
namely the contextual community that only requires a query to provide a set of
keywords describing an application/user context. We proposed two network flow
based exact algorithms to solve CC search in polynomial time and an approximation
algorithm with an approximation ratio of 13. Our empirical studies on real social
network datasets demonstrate the superior effectiveness of CC search methods under
different query contexts. Extensive performance evaluations also reveal the superb
practical efficiency of the proposed CC search algorithms.
Towards objective (3), we studied the new problem, batch keyword query process-
ing on native graph data. r-clique is used as keyword query result semantics. We
developed two heuristic algorithms to heuristically find good query plans, which are
based on reusing shared computations between multiple queries, i.e. shortest list eager
approach (reuse if possible), maximal overlapping driven approach (reuse as much as
possible). To optimally run the batched queries, we devised an estimation-based cost
model to assess the computational cost of possible sub-queries, which is then used to
identify the optimal plan of the batch query evaluation. We have conducted exten-
sive experiments to test the performance of the three algorithms on DBLP and IMDB
datasets. Cost estimation based approach has been identified as the ideal solution.
6.2 Future Work
New models. In a number of real networks, attributed graph data can be represented
as multi-dimensional graph. To fit a certain application scenario, cohesiveness for each
dimension of the multi-dimensional graph could be very different. As such, we can
expect various combinations of different cohesive subgraph models for proposing new
cohesive subgraph models and design efficient algorithms.
More theoretical studies. The hardness for enumerating maximal clique in a
UDG is still an open problem. Although there is a study mentioning maximal clique
142
enumeration problem in some special graphs [98], the proof ideal and gadgets they
proposed are difficult to be adopted to prove the hardness of maximal clique enu-
meration in a UDG. In addition, most existing hardness proofs for UDG problems
are reductions of the same problems on planar graph. However, since UDG graph is
obvious more complicated than planar graph, problems that are easy in panner do
not mean they are easy in UDG. Therefore, it is possible that enumerating maximal
clique in a UDG is intractable.
Better algorithm. Improving the time complexity of maximum density subgraph
problem is still a challenging problem. All existing maximum density subgraph algo-
rithms rely on algorithms solving a series of min s-t cut problems. The most efficient
maximum density subgraph algorithm especially relies on push-relabel min s-t cut
algorithm. However, the push-relabel algorithm has been beaten by algorithms pro-
posed in [79]. We observe that using the algorithm in [79] to solve one min s-t cut
problem of a series of min s-t cut problems for finding a maximum density subgraph
can also reuse some computations of the previously solved min s-t cut, which may
lead to a faster algorithm for maximum density subgraph problem than all existing
algorithms.
143
144
Bibliography
[1] Alok Aggarwal, Hiroshi Imai, Naoki Katoh, and Subhash Suri. Finding k pointswith minimum diameter and related problems. Journal of Algorithms, 12(1):38– 56, 1991.
[2] Sanjay Agrawal, Surajit Chaudhuri, and Gautam Das. DBXplorer: a systemfor keyword-based search over relational databases. In ICDE, pages 5–16, 2002.
[3] Esra Akbas and Peixiang Zhao. Truss-based community search: a truss-equivalence based indexing approach. PVLDB, 10(11):1298–1309, 2017.
[4] E. A. Akkoyunlu. The enumeration of maximal cliques of large graphs. SIAMJournal on Computing, 2(1):1–6, 03 1973.
[5] H. Aksu, M. Canim, Y. Chang, I. Korpeoglu, and O. Ulusoy. Distributedk-core view materialization and maintenance for large dynamic graphs. IEEETransactions on Knowledge and Data Engineering, 26(10):2439–2452, Oct 2014.
[6] R. Andersen. Finding large and small dense subgraphs. eprintarXiv:cs/0702032, February 2007.
[7] Reid Andersen and Kumar Chellapilla. Finding dense subgraphs with sizebounds. In Algorithms and Models for the Web-Graph, pages 25–37. SpringerBerlin Heidelberg, 2009.
[8] Vladimir Batagelj and Matjaz Zaversnik. An o(m) algorithm for cores decom-position of networks. CoRR, cs.DS/0310049, 2002.
[9] Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, andShashank Sudarshan. Keyword searching and browsing in databases usingBANKS. In ICDE, pages 431–440. IEEE, 2002.
[10] Coen Bron and Joep Kerbosch. Algorithm 457: Finding all cliques of an undi-rected graph. Commun. ACM, pages 575–577, 1973.
[11] Nicolas Bruno, Luis Gravano, Nick Koudas, and Divesh Srivastava. Navigation-vs. index-based XML multi-query processing. In ICDE, pages 139–150. IEEE,2003.
145
[12] Guo-Ray Cai and Yu-Geng Sun. The minimum augmentation of any graph toa k-edge-connected graph. Networks, pages 151–172, 1989.
[13] Xin Cao, Gao Cong, Christian S Jensen, and Beng Chin Ooi. Collective spatialkeyword querying. In SIGMOD, pages 373–384. ACM, 2011.
[14] Lijun Chang, Jeffrey Xu Yu, and Lu Qin. Fast maximal cliques enumeration insparse graphs. Algorithmica, 66(1):173–186, May 2013.
[15] Moses Charikar. Greedy approximation algorithms for finding dense compo-nents in a graph. In Approximation Algorithms for Combinatorial Optimization,pages 84–95. Springer Berlin Heidelberg, 2000.
[16] P. Chen, C. Chou, and M. Chen. Distributed algorithms for k-truss decompo-sition. In 2014 IEEE International Conference on Big Data (Big Data), pages471–480, Oct 2014.
[17] Yu Chen, Jun Xu, and Minzheng Xu. Finding community structure in spatiallyconstrained complex networks. International Journal of Geographical Informa-tion Science, 29(6):889–911, 2015.
[18] Chun-Hung Cheng, Ada Waichee Fu, and Yi Zhang. Entropy-based subspaceclustering for mining numerical data. In SIGKDD, pages 84–93. ACM, 1999.
[19] Hong Cheng, Yang Zhou, Xin Huang, and Jeffrey Xu Yu. Clustering largeattributed information networks: an efficient incremental computing approach.Data Mining and Knowledge Discovery, 25(3):450–477, Nov 2012.
[20] J. Cheng, Y. Ke, S. Chu, and M. T. Ozsu. Efficient core decomposition inmassive networks. In 2011 IEEE 27th International Conference on Data Engi-neering, pages 51–62, April 2011.
[21] Boris V Cherkassky and Andrew V Goldberg. On implementing the push-relabelmethod for the maximum flow problem. Algorithmica, 19(4):390–410, 1997.
[22] Vladislav Chesnokov. Overlapping community detection in social networks withnode attributes by neighborhood influence. In Models, Algorithms, and Tech-nologies for Network Analysis, pages 187–203. Springer International Publish-ing, 2017.
[23] Eunjoon Cho, Seth A. Myers, and Jure Leskovec. Friendship and mobility:User movement in location-based social networks. In SIGKDD, pages 1082–1090. ACM, 2011.
[24] Farhana M Choudhury, J Shane Culpepper, and Timos Sellis. Batch processingof top-k spatial-textual queries. In 2nd Intl. ACM Workshop on Managing andMining Enriched Geo-Spatial Data, pages 7–12, 2015.
146
[25] Brent N. Clark, Charles J. Colbourn, and David S. Johnson. Unit disk graphs.Discrete Mathematics, 86(1):165 – 177, 1990.
[26] Jonathan Cohen. Trusses: Cohesive subgraphs for social network analysis.National Security Agency Technical Report, 16, 2008.
[27] Wanyun Cui, Yanghua Xiao, Haixun Wang, Yiqi Lu, and Wei Wang. Onlinesearch of overlapping communities. In SIGMOD, pages 277–288, 2013.
[28] Wanyun Cui, Yanghua Xiao, Haixun Wang, and Wei Wang. Local search ofcommunities in large graphs. In SIGMOD, pages 991–1002, 2014.
[29] Ian De Felipe, Vagelis Hristidis, and Naphtali Rishe. Keyword search on spatialdatabases. In ICDE, pages 656–665. IEEE, 2008.
[30] Werner Dinkelbach. On nonlinear fractional programming. Management Sci-ence, 13(7):492–498, 1967.
[31] D. Eppstein, M. Loffler, and D. Strash. Listing All Maximal Cliques in SparseGraphs in Near-optimal Time. ArXiv e-prints, June 2010.
[32] P. Erdos and G. Szckeres. A Combinatorial Problem in Geometry, pages 49–56.Birkha user Boston, 1987.
[33] Paul Expert, Tim S Evans, Vincent D Blondel, and Renaud Lambiotte. Un-covering space-independent communities in spatial networks. Proceedings of theNational Academy of Sciences, 108(19):7663–7668, 2011.
[34] Yixiang Fang, Reynold Cheng, Xiaodong Li, Siqiang Luo, and Jiafeng Hu.Effective community search over large spatial graphs. PVLDB, 10(6):709–720,2017.
[35] Yixiang Fang, Reynold Cheng, Siqiang Luo, and Jiafeng Hu. Effective commu-nity search for large attributed graphs. PVLDB, 9(12):1233–1244, 2016.
[36] U. Feige, D. Peleg, and G. Kortsarz. The dense k-subgraph problem. Algorith-mica, 29(3):410–421, Mar 2001.
[37] Santo Fortunato. Community detection in graphs. Physics reports, 486(3):75–174, 2010.
[38] Giorgio Gallo, Michael D. Grigoriadis, and Robert E. Tarjan. A fast paramet-ric maximum flow algorithm and applications. SIAM Journal on Computing,18(1):30–55, February 1989.
[39] Michelle Girvan and Mark EJ Newman. Community structure in social andbiological networks. Proc. Natl. Acad. Sci. USA, 99(cond-mat/0112110):8271–8276, 2001.
147
[40] Andrew V. Goldberg and Robert E. Tarjan. A new approach to the maximum-flow problem. J. ACM, 35(4):921–940, October 1988.
[41] Andrew V. Goldberg and Robert E. Tarjan. Efficient maximum flow algorithms.Commun. ACM, 57(8):82–89, August 2014.
[42] AV Goldberg. Finding a maximum density subgraph. Technical report, 1984.
[43] Goetz Graefe and William J. McKenna. The volcano optimizer generator: Ex-tensibility and efficient search. In ICDE, pages 209–218, Washington, DC, USA,1993. IEEE Computer Society.
[44] Diansheng Guo. Regionalization with dynamically constrained agglomerativeclustering and partitioning (redcap). International Journal of Geographical In-formation Science, 22(7):801–823, 2008.
[45] Rajarshi Gupta, Jean Walrand, and Olivier Goldschmidt. Maximal cliques inunit disk graphs: Polynomial approximation. In INOC, 2005.
[46] F Maxwell Harper and Joseph A Konstan. The movielens datasets: History andcontext. ACM Trans. Interact. Intell. Syst., 5(4):19:1–19:19, December 2015.
[47] Hao He, Haixun Wang, Jun Yang, and Philip S Yu. Blinks: ranked keywordsearches on graphs. In SIGMOD, pages 305–316. ACM, 2007.
[48] J. E. Hopcroft and R. M. Karp. A n5/2 algorithm for maximum matchings inbipartite. In 12th Annual Symposium on Switching and Automata Theory (swat1971), pages 122–125, Oct 1971.
[49] Vagelis Hristidis and Yannis Papakonstantinou. Discover: Keyword search inrelational databases. In VLDB, pages 670–681. VLDB Endowment, 2002.
[50] Jiafeng Hu, Xiaowei Wu, Reynold Cheng, Siqiang Luo, and Yixiang Fang.Querying minimal steiner maximum-connected subgraphs in large graphs. InCIKM, pages 1241–1250. ACM, 2016.
[51] Xin Huang, Hong Cheng, Lu Qin, Wentao Tian, and Jeffrey Xu Yu. Queryingk-truss community in large and dynamic graphs. In SIGMOD, pages 1311–1322,2014.
[52] Xin Huang, Hong Cheng, and Jeffrey Xu Yu. Dense community detection inmulti-valued attributed networks. Inf. Sci., 314(C):77–99, September 2015.
[53] Xin Huang and Laks V. S. Lakshmanan. Attribute-driven community search.PVLDB, 10(9):949–960, 2017.
[54] Xin Huang, Laks VS Lakshmanan, Jeffrey Xu Yu, and Hong Cheng. Approxi-mate closest community search in networks. PVLDB, 9(4):276–287, 2015.
148
[55] Marie Jacob and Zachary Ives. Sharing work in keyword search over databases.In SIGMOD, pages 577–588. ACM, 2011.
[56] Picard Jean-Claude and Queyranne Maurice. A network flow solution to somenonlinear 0-1 programming problems, with applications to graph theory. Net-works, 12(2):141–159, 1982.
[57] Varun Kacholia, Shashank Pandit, Soumen Chakrabarti, S Sudarshan, RushiDesai, and Hrishikesh Karambelkar. Bidirectional expansion for keyword searchon graph databases. In VLDB, pages 505–516. VLDB Endowment, 2005.
[58] Mehdi Kargar and Aijun An. Keyword search in graphs: Finding r-cliques.PVLDB, 4(10):681–692, 2011.
[59] Tarun Kathuria and S. Sudarshan. Efficient and provable multi-query optimiza-tion. In PODS, pages 53–67, New York, NY, USA, 2017. ACM.
[60] S. Khot. Ruling out ptas for graph min-bisection, dense k-subgraph, and bipar-tite clique. SIAM Journal on Computing, 36(4):1025–1071, 2006.
[61] Samir Khuller and Barna Saha. On finding dense subgraphs. In Proceedings ofthe 36th International Colloquium on Automata, Languages and Programming:Part I, ICALP ’09, pages 597–608. Springer-Verlag, 2009.
[62] Andrea Lancichinetti, Santo Fortunato, and Janos Kertesz. Detecting the over-lapping and hierarchical community structure in complex networks. New Jour-nal of Physics, 11(3):033015, 2009.
[63] Wangchao Le, Anastasios Kementsietsidis, Songyun Duan, and Feifei Li. Scal-able multi-query optimization for SPARQL. In ICDE, pages 666–677. IEEE,2012.
[64] J. J. Levandoski, M. Sarwat, A. Eldawy, and M. F. Mokbel. Lars: A location-aware recommender system. In ICDE, pages 450–461, April 2012.
[65] Guoliang Li, Shuo Chen, Jianhua Feng, Kian-lee Tan, and Wen-syan Li. Effi-cient location-aware influence maximization. In SIGMOD, pages 87–98. ACM,2014.
[66] Guoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, and Lizhu Zhou.EASE: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data. In SIGMOD, pages 903–914. ACM, 2008.
[67] Jianxin Li, Chengfei Liu, and Md Saiful Islam. Keyword-based correlated net-work computation over large social media. In ICDE, pages 268–279, 2014.
[68] Jianxin Li, Xinjue Wang, Deng Ke, Xiaochun Yang, Timos Sellis, and Xu YuJeffrey. Most influential community search over large social networks. In ICDE,pages 871–882, April 2017.
149
[69] Rong-Hua Li, Lu Qin, Fanghua Ye, Jeffrey Xu Yu, Xiaokui Xiao, Nong Xiao,and Zibin Zheng. Skyline community search in multi-valued networks. In SIG-MOD, pages 457–472. ACM, 2018.
[70] Rong-Hua Li, Lu Qin, Jeffrey Xu Yu, and Rui Mao. Influential communitysearch in large networks. PVLDB, 8(5):509–520, 2015.
[71] Rong-Hua Li, Jiao Su, Lu Qin, Jeffrey Xu Yu, and Qiangqiang Dai. Persistentcommunity search in temporal networks. In 34th IEEE International Conferenceon Data Engineering, ICDE 2018, Paris, France, April 16-19, 2018, pages 797–808, 2018.
[72] Rong-Hua Li, Jeffrey Xu Yu, and Rui Mao. Efficient core maintenance in largedynamic graphs. Knowledge and Data Engineering, IEEE Transactions on,26:2453–2465, 10 2014.
[73] Yan Liu, Alexandru Niculescu-Mizil, and Wojciech Gryc. Topic-link lda: Jointmodels of topic and author community. In ICML, pages 665–672. ACM, 2009.
[74] Ziyang Liu and Yi Chen. Identifying meaningful return information for XMLkeyword search. In SIGMOD, pages 329–340, 2007.
[75] Can Lu, Jeffrey Xu Yu, Hao Wei, and Yikai Zhang. Finding the maximum cliquein massive graphs. Proc. VLDB Endow., 10(11):1538–1549, August 2017.
[76] R. Duncan Luce and Albert D. Perry. A method of matrix analysis of groupstructure. Psychometrika, 14(2):95–116, Jun 1949.
[77] Ramesh M. Nallapati, Amr Ahmed, Eric P. Xing, and William W. Cohen. Jointlatent topic models for text and citations. In SIGKDD, pages 542–550. ACM,2008.
[78] Mark EJ Newman and Michelle Girvan. Finding and evaluating communitystructure in networks. Physical review E, 69(2):026113, 2004.
[79] James B Orlin. Max flows in o (nm) time, or better. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 765–774. ACM,2013.
[80] Lu Qin, Rong-Hua Li, Lijun Chang, and Chengqi Zhang. Locally densest sub-graph discovery. In SIGKDD, pages 965–974, 2015.
[81] Lu Qin, J.X. Yu, Lijun Chang, and Yufei Tao. Querying communities in rela-tional databases. In ICDE, pages 724–735, March 2009.
[82] Lester R Ford Jr and Delbert R Fulkerson. Maximal Flow Through a Network,pages 243–248. Birkhauser Boston, 1987.
150
[83] Mojtaba Rezvani, Weifa Liang, Chengfei Liu, and Jeffrey Xu Yu. Efficientdetection of overlapping communities using asymmetric triangle cuts. TKDE,pages 1–1, 2018.
[84] Prasan Roy, Srinivasan Seshadri, S Sudarshan, and Siddhesh Bhobe. Efficientand extensible algorithms for multi query optimization. SIGMOD, 29(2):249–260, 2000.
[85] Yiye Ruan, David Fuhry, and Srinivasan Parthasarathy. Efficient communitydetection in large networks using content and links. In WWW, pages 1089–1098.ACM, 2013.
[86] Raman Samusevich, Maximilien Danisch, and Mauro Sozio. Local triangle-densest subgraphs. In Proceedings of the 2016 IEEE/ACM International Con-ference on Advances in Social Networks Analysis and Mining, pages 33–40.IEEE Press, 2016.
[87] Ahmet Erdem Sarıyuce, Bugra Gedik, Gabriela Jacques-Silva, Kun-Lung Wu,and Umit V Catalyurek. Streaming algorithms for k-core decomposition. Pro-ceedings of the VLDB Endowment, 6(6):433–444, 2013.
[88] M. Sarwat, J. J. Levandoski, A. Eldawy, and M. F. Mokbel. An efficient andscalable location-aware recommender system. TKDE, 26(6):1384–1399, June2014.
[89] Stephen B. Seidman. Network structure and minimum degree. Social Networks,5(3):269 – 287, 1983.
[90] Timos Sellis and Subrata Ghosh. On the multiple-query optimization problem.TKDE, 2(2):262–266, Jun 1990.
[91] Timos K Sellis. Multiple-query optimization. ACM Trans. Database Syst.,13(1):23–52, 1988.
[92] Kyuseok Shim, Timos Sellis, and Dana Nau. Improvements on a heuristicalgorithm for multiple-query optimization. Data & Knowledge Engineering,12(2):197–222, 1994.
[93] Mauro Sozio and Aristides Gionis. The community-search problem and how toplan a successful cocktail party. In SIGKDD, pages 939–948, 2010.
[94] IZUMI Taisuke and SUZUKI Daisuke. Faster enumeration of all maximal cliquesin unit disk graphs using geometric structure. IEICE Transactions on Infor-mation and Systems, E98.D(3):490–496, 2015.
[95] Nikolaj Tatti and Aristides Gionis. Density-friendly graph decomposition. InWWW, pages 1089–1099. International World Wide Web Conferences SteeringCommittee, 2015.
151
[96] Etsuji Tomita, Akira Tanaka, and Haruhisa Takahashi. The worst-case timecomplexity for generating all maximal cliques and computational experiments.Theoretical Computer Science, 363(1):28–42, 2006.
[97] Charalampos Tsourakakis. The k-clique densest subgraph problem. In WWW,pages 1122–1132. International World Wide Web Conferences Steering Com-mittee, 2015.
[98] S. Vadhan. The complexity of counting in sparse, regular, and planar graphs.SIAM Journal on Computing, 31(2):398–427, 2001.
[99] Jia Wang and James Cheng. Truss decomposition in massive networks. PVLDB,5(9):812–823, 2012.
[100] N. Wang, D. Yu, H. Jin, C. Qian, X. Xie, and Q. Hua. Parallel algorithm for coremaintenance in dynamic graphs. In 2017 IEEE 37th International Conferenceon Distributed Computing Systems (ICDCS), pages 2366–2371, June 2017.
[101] Dong Wen, Lu Qin, Ying Zhang, Xuemin Lin, and Jeffrey Xu Yu. I/o effi-cient core graph decomposition at web scale. 2016 IEEE 32nd InternationalConference on Data Engineering (ICDE), pages 133–144, 2016.
[102] Peng Wu and Li Pan. Mining application-aware community organizationwith expanded feature subspaces from concerned attributes in social networks.Knowledge-Based Systems, 139:1 – 12, 2018.
[103] Yubao Wu, Ruoming Jin, Jing Li, and Xiang Zhang. Robust local communitydetection: On free rider effect and its elimination. PVLDB, 8(7):798–809, 2015.
[104] Yanghua Xiao, Wentao Wu, Jian Pei, Wei Wang, and Zhenying He. Efficientlyindexing shortest paths by exploiting symmetry in graphs. In EDBT, pages493–504, 2009.
[105] Yu Xu and Yannis Papakonstantinou. Efficient keyword search for smallestLCAs in XML databases. In SIGMOD, pages 527–538. ACM, 2005.
[106] Zhiqiang Xu, Yiping Ke, Yi Wang, Hong Cheng, and James Cheng. A model-based approach to attributed graph clustering. In SIGMOD, pages 505–516.ACM, 2012.
[107] Liang Yao, Chengfei Liu, Jianxin Li, and Rui Zhou. Efficient computation ofmultiple XML keyword queries. In WISE, pages 368–381. Springer, 2013.
[108] Long Yuan, Lu Qin, Wenjie Zhang, Lijun Chang, and Jianye Yang. Index-baseddensest clique percolation community search in networks. TKDE, pages 1–1,2018.
[109] Fan Zhang, Ying Zhang, Lu Qin, Wenjie Zhang, and Xuemin Lin. When en-gagement meets similarity: efficient (k, r)-core computation on social networks.PVLDB, 10(10):998–1009, 2017.
152
[110] Yikai Zhang, Jeffrey Xu Yu, Ying Zhang, and Lu Qin. A fast order-basedapproach for core maintenance. 2017 IEEE 33rd International Conference onData Engineering (ICDE), pages 337–348, 2017.
[111] Yang Zhou, Hong Cheng, and Jeffrey Xu Yu. Graph clustering based on struc-tural/attribute similarities. PVLDB, 2(1):718–729, August 2009.
153