12
Simplex2Vec embeddings for community detection in simplicial complexes Jacob Charles Wright Billings, 1 Mirko Hu, 2 Giulia Lerda, 1 Alexey N. Medvedev, 3 Francesco Mottes, 4 Adrian Onicas, 2 Andrea Santoro, 5, 6 and Giovanni Petri 1, 7, * 1 Institute for Scientific Interchange, Torino, Italy 2 IMT School for Advanced Studies, Lucca, Italy 3 ICTEAM, Université Catholique de Louvain, Louvain-la-Neuve, Belgium 4 Università di Torino, Physics Department and INFN, Torino, Italy 5 School of Mathematical Sciences, Queen Mary University of London, London E1 4NS, United Kingdom 6 The Alan Turing Institute, The British Library, NW1 2DB, London, United Kingdom 7 ISI Global Science Foundation, NewYork, USA (Dated: June 24, 2019) Topological representations are rapidly becoming a popular way to capture and encode higher- order interactions in complex systems. They have found applications in disciplines as different as cancer genomics, brain function, and computational social science, in representing both descriptive features of data and inference models. While intense research has focused on the connectivity and homological features of topological representations, surprisingly scarce attention has been given to the investigation of the community structures of simplicial complexes. To this end, we adopt recent advances in symbolic embeddings to compute and visualize the community structures of simplicial complexes. We first investigate the stability properties of embedding obtained for synthetic simplicial complexes to the presence of higher order interactions. We then focus on complexes arising from social and brain functional data and show how higher order interactions can be leveraged to improve clustering detection and assess the effect of higher order interaction on individual nodes. We conclude delineating limitations and directions for extension of this work. Keywords: topological data analysis, simplicial complex, dimensionality reduction, community detection I. INTRODUCTION In the last decades, network science has success- fully characterized real-world complex systems [1–3] by studying each system’s corresponding network repre- sentation. Nodes represent the elements of the system, and links represent pairwise interactions [4, 5]. How- ever, many real-world networks including social systems [6–10], collaboration networks among scientists [7, 11], joint appearances among actors [12], neural activity of the human brain [13–18], etc., often exhibit interactions that cannot be captured when only considering pairwise relationships. As an example, consider a congress who meets as a single body when in session. Representing this relationship as a set of pairwise meetings would miss the fact that more than two congresspersons must meet together to make a legal quorum. Moreover, sev- eral real-world networks display "small-world" topolo- gies, where high concentrations of edges exist among special groups of vertices, and where low concentrations of edges exist between these groups. Owing to the importance of understanding correlated processes in terms of their function(s) as a network, much attentions has been devoted to the development of new and scalable methods to extract communities and their boundaries based on the structural roles of * Correspondence email address: [email protected] interconnected nodes [19–24]. Simplicial complexes were introduced as a way to account for higher-order in- teractions beyond more traditional pairwise descriptors [25–28]. Indeed, it has been shown that simplicial rep- resentations provide new and satisfactory explanations for many complex dynamics in neuroscience [15–18], social systems [29, 30] and dynamical systems [31–33]. Surprisingly, with the exception of few notable exam- ples [34–36], scarce attention has been devoted to define community structure of simplicial complexes. To this end, the present work implements recent advances in symbolic embeddings, i.e. node2vec and word2vec [37, 38], combined with biased and unbiased random walks to map, detect and visualize the commu- nity structure of simplicial complexes. II. METHODS A. Simplices and simplicial complexes The simplest definition of a k-dimensional simplex σ is combinatorial: a simplex is a set of k +1 ver- tices σ =[x 0 ,x 1 ,...,x k ]. It is easy to understand how simplices can describe both pairwise and group inter- actions. An edge is a collection of two vertice [x 0 ,x 1 ], while larger sets represent groups, e.g. a 2-simplex is a “filled” triangle [x 0 ,x 1 ,x 2 ], while the set of all its edges is [x 0 ,x 1 ], [x 0 ,x 2 ], [x 1 ,x 2 ]. Notice that a set of arXiv:1906.09068v1 [physics.soc-ph] 21 Jun 2019

arXiv:1906.09068v1 [physics.soc-ph] 21 Jun 2019

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Simplex2Vec embeddings for community detection in simplicial complexes

Jacob Charles Wright Billings,1 Mirko Hu,2 Giulia Lerda,1 Alexey N. Medvedev,3

Francesco Mottes,4 Adrian Onicas,2 Andrea Santoro,5, 6 and Giovanni Petri1, 7, ∗

1Institute for Scientific Interchange, Torino, Italy2IMT School for Advanced Studies, Lucca, Italy

3ICTEAM, Université Catholique de Louvain, Louvain-la-Neuve, Belgium4Università di Torino, Physics Department and INFN, Torino, Italy

5School of Mathematical Sciences, Queen Mary University of London, London E1 4NS, United Kingdom6The Alan Turing Institute, The British Library, NW1 2DB, London, United Kingdom

7ISI Global Science Foundation, NewYork, USA(Dated: June 24, 2019)

Topological representations are rapidly becoming a popular way to capture and encode higher-order interactions in complex systems. They have found applications in disciplines as different ascancer genomics, brain function, and computational social science, in representing both descriptivefeatures of data and inference models. While intense research has focused on the connectivity andhomological features of topological representations, surprisingly scarce attention has been given tothe investigation of the community structures of simplicial complexes. To this end, we adopt recentadvances in symbolic embeddings to compute and visualize the community structures of simplicialcomplexes. We first investigate the stability properties of embedding obtained for synthetic simplicialcomplexes to the presence of higher order interactions. We then focus on complexes arising fromsocial and brain functional data and show how higher order interactions can be leveraged to improveclustering detection and assess the effect of higher order interaction on individual nodes. We concludedelineating limitations and directions for extension of this work.

Keywords: topological data analysis, simplicial complex, dimensionality reduction, community detection

I. INTRODUCTION

In the last decades, network science has success-fully characterized real-world complex systems [1–3]by studying each system’s corresponding network repre-sentation. Nodes represent the elements of the system,and links represent pairwise interactions [4, 5]. How-ever, many real-world networks including social systems[6–10], collaboration networks among scientists [7, 11],joint appearances among actors [12], neural activity ofthe human brain [13–18], etc., often exhibit interactionsthat cannot be captured when only considering pairwiserelationships. As an example, consider a congress whomeets as a single body when in session. Representingthis relationship as a set of pairwise meetings wouldmiss the fact that more than two congresspersons mustmeet together to make a legal quorum. Moreover, sev-eral real-world networks display "small-world" topolo-gies, where high concentrations of edges exist amongspecial groups of vertices, and where low concentrationsof edges exist between these groups.

Owing to the importance of understanding correlatedprocesses in terms of their function(s) as a network,much attentions has been devoted to the developmentof new and scalable methods to extract communitiesand their boundaries based on the structural roles of

∗ Correspondence email address: [email protected]

interconnected nodes [19–24]. Simplicial complexeswere introduced as a way to account for higher-order in-teractions beyond more traditional pairwise descriptors[25–28]. Indeed, it has been shown that simplicial rep-resentations provide new and satisfactory explanationsfor many complex dynamics in neuroscience [15–18],social systems [29, 30] and dynamical systems [31–33].Surprisingly, with the exception of few notable exam-ples [34–36], scarce attention has been devoted to definecommunity structure of simplicial complexes.

To this end, the present work implements recentadvances in symbolic embeddings, i.e. node2vec andword2vec [37, 38], combined with biased and unbiasedrandom walks to map, detect and visualize the commu-nity structure of simplicial complexes.

II. METHODS

A. Simplices and simplicial complexes

The simplest definition of a k-dimensional simplexσ is combinatorial: a simplex is a set of k + 1 ver-tices σ = [x0, x1, . . . , xk]. It is easy to understand howsimplices can describe both pairwise and group inter-actions. An edge is a collection of two vertice [x0, x1],while larger sets represent groups, e.g. a 2-simplex isa “filled” triangle [x0, x1, x2], while the set of all itsedges is [x0, x1], [x0, x2], [x1, x2]. Notice that a set of

arX

iv:1

906.

0906

8v1

[ph

ysic

s.so

c-ph

] 2

1 Ju

n 20

19

2

simplices constitutes simplicial complexes in a similarmanner to how a set of edges defines a network. Moreformally, a simplicial complex K on a given set of ver-tices V , with |V | = N , is a collection of simplices, withtwo conditions: (i) if σ ∈ K then all the possible sub-simplices τ ⊂ σ constructed from any subsets of σ arealso contained in K and (ii) the intersection of any twoσi, σj ∈ K is empty or a face of both σi and σj .

There are several ways to build a simplicial complex.A commonly used one is via clique complexes: we startfrom a network and promote each k−clique C to (k−1)simplex defined by the nodes of C. Then we focus our at-tention on maximal cliques, since they are automaticallypromoted to facets, i.e. simplices that are not a faceof another simplex of the complex, and a list of facetsuniquely defines a simplicial complex. Moreover, eachfacet contains all its subsimplices [14]. In a clique com-plex however it is impossible to have an empty triangle([x0, x1], [x0, x2], [x1, x2], without [x0, x1, x2]), since itsstructure is defined by its 1-dimensional skeleton, thecollection of its edges (1-simplices). For this reason, insection II E we consider alternative ways to build sim-plices.

B. Hasse diagrams and persistent homology

Computational topology and Topological Data Anal-ysis (TDA) are concerned with how to represent andquantify the structure of simplicial complexes [39]. Oneof the most common, indeed equivalent, mathematicalrepresentations of simplicial complexes is based on theHasse diagram. This is a directed acyclic graph (DAG),where each vertex represents a k-simplex of the simpli-cial complex and there exist an edge connecting two ver-tices v0 and v1, if v0 ⊂ v1 and dim(v0) = dim(v1) − 1,i.e. they correspond to two simplices of consecutive di-mensions (see Figure 2b for a graphical representation).The Hasse diagram effectively is the backbone of inclu-sions between simplices of different dimensions.

In addition, in many contexts we are required to makequantitative comparisons within and across datasets.Persistent homology [25, 40] was developed for this pur-pose of quantifying the shape of data by providing amultiscale description of its topological structure. Thisis achieved by encoding data as a sequence of simpli-cial complexes, called a filtration. Features of the com-plexes (connected components, one-dimensional cycles,three-dimensional cavities) persist along this sequenceof simplicial complexes for different intervals, and theirpersistence defines how relevant they are for the shapeof the dataset. In such a way, it is possible to identifyunique meso-scale structures otherwise invisible to clas-sic analytical tools. This intervals can also be mappedin a persistence diagram [41] and distances between per-sistence diagrams can be used. Notably, the persistence

diagrams come with various notions of metric: in thiswork, we will adopt one of the most flexible ones, the(sliced) Wasserstein distance dSWS [42], which is oftenused to compare diagrams.

C. Random Walks and symbolic embedding

Random walks (RWs) are sequences of locations vis-ited by a walker on a substrate. This substrate canbe anything from a metric space to a discrete structure,like a network. Because of this flexibility, RWs are oftenused as exploration processes for different properties ofthe substrate. In particular, RWs can be unbiased orbiased. In the former case, walkers have no preferencefor the future direction and choose uniformly at randomfrom the possible next locations for a jump, e.g. amongthe neighbours of the node the walker is sitting on. Bi-ased and correlated models were developed to take intoaccount that, in certain situations, some directions aremore probable than others and the situations where thesequential step orientations are correlated [43].

In order to generate the symbolic sequences that willbe processed by word2vec in the embedding process, weemployed both the unbiased and biased approaches torandom walk on the Hasse Diagram (for our purpose,we discard the direction of the inclusions in the DACwhen performing a jump). In the unbiased model, thewalker starts from any node of the graph and has uni-form probability to move to any of the neighbours. Thisis the RW scheme used for the synthetic simplicial com-plexes analysis in section IIIA.

In the biased model we attach specific weights tonodes of the Hasse diagram and transition probabilityat each step is proportional to the weight of the desti-nation node (see Figure 1). Specifically, Whereas thenode weighting scheme can be in principle arbitrary,we implement two specific weighting schemes used in

i

k τjk lτjl

0-simplex

1-simplex

2-simplex

j

rτji=

ωjωΣ r

Figure 1. Random walk on simplices. At each time-stept, a walker located on a k-simplex i of the Hasse diagram canjump to a simplex j with a certain probability τji, which isproportional to the weight of the simplex ωj . In our analysiswe use two different weighting schemes, namely, with a) thebias towards lower order simplices and b) the bias towardshigher order simplices.

3

(a)

data structure:simplicial complex

0-simplex

1-simplex

2-simplex

α

ε

(b)

Hasse Diagram

δ β

γ

Metric spaceembedding

α β γ δ ε ... ...S:

Symbolic Sequences

(c)

Word2vec

Figure 2. Mapping simplicial complexes to metric spaces: Simplex2Vec. The collection of simplices, representingd-dimensional group interactions are glued together in a simplicial complex (a). The simplicial complex is then representedthrough a Hasse diagram (b), which is a directed acyclic graph on the partially ordered set of simplices. A random walkon the nodes (k-simplex) of the Hasse diagram, graphically represented as a dashed green line in the panel, produces asymbolic sequence which preserve the high-order information on the topological structure of the simplicial complex. Thesymbolic sequences are then mapped into a metric space using word2vec and finally, an agglomerative clustering method isused to clusterize nodes into different modules.

the analysis: a) with the bias towards lower order sim-plices (lower order bias) and b) to higher order simplices(higher order bias). The higher order bias scheme isdescribed as follows. Initially all nodes in the Hassediagram has assigned weight zero and we by start se-quentially considering simplices in the data. Each n-simplex in the data adds weight 1 to a respective node inthe Hasse diagram, the weight 1

n+1 to (n− 1)-simplicesconnected to the given n-simplex, the weight 1

(n+1)n

to (n − 2)-simplices connected to respective (n − 1)-simplices and so on. Overall, if the (n − k)-simplex jis directly reachable from the considered n-simplex, itreceives the weight

ωj =1

(n+ 1)n . . . (n− k + 2).

The lower order bias scheme is analogous to the higherorder scheme, except the additional weigths ω are in-verse:

ωj =1

ωj.

The motivation behind the weighting schemes is to com-pare two different exploration strategies - the lower or-der bias tends to explore 0-simplices, whereas the higher

order bias pushes the exploration to higher order sim-plices in order to better capture the higher order struc-ture.

While the biased random walk we implemented ismotivated by the combinatorial structure of the inclu-sions among simplices in the Hasse Diagram itself, anyweighting strategy for RW can be chosen according tothe needs of the user. The length L and the number ofwalks nw are important parameters that influence theresults of the analysis as well, and usually should bechosen such that Lnw ∼ |K|, where K is a complex andits cardinality corresponds to the number of simplicesin it (or nodes in its Hasse Diagram).

Notice that in our formulation, RWs are essential tonavigate the k-simplices of the simplicial complex. Asa matter of fact, the sequence of locations visited by aRW can be regarded as a sequence of symbols that en-code the local structure of adjacency of the substrate.For example, random walks are at the core of graphembedding techniques, e.g. node2vec [44] and Deep-Walk [45]. These rely on word embeddings developedin the context of natural language processing and un-supervised learning, out of which the most commonlyused is word2vec [46].

These models, usually, consist of two-layer neural net-

4

works that are trained to reconstruct linguistic contextsof words. Indeed, word2vec takes as its input a largecorpus of texts and produces a vector space, typicallyof several hundred dimensions, with each unique wordin the corpus being assigned a corresponding vector inthe space. Word vectors are positioned in the vectorspace such that words that share similar contexts inthe dataset are positioned close to one another in thespace [38]. Word2vec can both predict a word from itscontext or predict the context from a word. Details onthe two main implementations can be found in [47].

DeepWalk leverages random walks to sample nodecontexts (their quasi-local environment), that are sub-sequently fed to word2vec in order to calculate the em-bedding of the nodes. The DeepWalk method com-putes a graph embeddings in two steps. A graph isfirst sampled with random walks. Then, random walksare treated as sentences in the word2vec approach andfed to its neural network. Node2vec is a modificationof DeepWalk. In the node2vec approach, two addi-tional parameters are employed in order to implementan exploration-exploitation trade-off mechanism. Oneparameter defines how probable it is that the randomwalk will discover the undiscovered part of the graph,while the other one defines how probable it is that therandom walk will go back to the node it visited in thestep before.

D. Simplex2Vec

The method we propose here, Simplex2Vec, buildson these ideas and it is based on constructing contextsfor the nodes of a graph which may then be fed to theword2vec package. More specifically, we use RWs de-fined on the Hasse diagram associated with a simplicialcomplex. In such a way, by navigating through the k-simplices, each walk preserves high-order informationon the topological structure of the simplicial complex.We sample a large set of random walks on the Hasse Di-agram, which can be regarded as symbolic sequences or“words”, and subsequently we embed the contexts cre-ated in this manner in a metric space by computing avector representation (see Figure 2 for a graphical rep-resentation of the workflow). Note that since our RWsare defined on the Hasse diagram, our method simul-taneously returns a vector representation for verticesand all of the higher-dimensional simplices (up to thechosen maximal simplicial dimension). A preliminaryimplementation of the Simplex2Vec method is availableat https://github.com/lordgrilo/doublenegroni.

E. Data processing and simplicial construction

1. Costa-Farber Model

To explore the effect of higher order simplices onthe embeddings and corresponding partitions, we usethe Costa-Farber random simplicial complex model[48]. They proposed a simple construction, based ona flexible model for random simplicial complexes withrandomness in all dimensions. It starts with a set ofN vertices and retains each of them with probabilityp0 (usually set to 1); then it connects every pair ofretained vertices by an edge with probability p1, andthen fills in every 3-clique (closed path of edges) witha 2-simplex (a full triangle) in the obtained randomgraph with probability p2, and so on for constructingsimplices of order k with probability pk. At thefinal step we obtain a random simplicial complexwhich depends on the set of probability parameters(p0, p1, ..., pk), 0 ≤ pi ≤ 1 for all i = 1, . . . , k.

2. Sociopatterns Data

We consider a dataset of face-to-face interactions be-tween children and teachers of a primary school (Lyon-School) [49] and interactions between students of a highschool (Thiers13) [50], available from sociopatterns.org.Interactions have a temporal resolution of 20 s, butwe aggregated the data using a temporal window of∆t = 15 min. Within each time window we definea graph by aggregating the pairwise interactions andcomputed all the maximal cliques that appear in thattime-window. These maximal cliques can be consideredfacets. We collect them across all windows and build asimplicial complex by aggregating over all the time win-dows. We are able to subsequently reconstruct all thelower-order simplices stemming from them, thus recov-ering the whole structure of the Hasse diagram. Notethat, for computational reasons in the following analy-sis, we limit ourselves to consider simplices up to order3. While higher-dimensional cliques are not included inthe final simplicial complex, their sub-cliques up to size4 are considered in the counting.

3. Functional Connectivity

Furthermore, we consider a dataset composed ofgroup average fMRI time series correlations (n = 819).These correlations are understood to characterize re-lationships among brain regions, i.e., their degree of"functional connectivity" [51–53]. Data were acquiredfrom the Human Connectome Project [54]. Specifically,

5

we used an adjacency matrix of group-level correla-tions among 200 brain regions as input to the Hassegraph. The 200 regions are a projection from the rawfMRI voxel space onto 200 independent components(ICA200). The native connectome matrices (netmats)arrive as z-statistics of Pearson correlation r-values.Data were therefore preprocessed to normalize inputsto the range [−1, 1] by calculating the inverse Fischertransform, with all constant terms set to 0.18. We thencalculated absolute values on the adjacency matrix be-cause we were interested in describing k-dimensional co-ordinations (correlation with a 180 degree phase shift)among ICA components rather than distinguishing be-tween strongly positive versus strongly negative correla-tions. This is because coordinated activation and sup-pression of brain hemodynamics develops unique andimportant routes for information flow with respect toa given brain state. Finally, the adjacency matrix wasmasked with a threshold of the upper 5% of correla-tion values. Facets were constructed from the maximalcliques of the thresholded graphs, and included all sub-cliques. Equal weights were given to the random-walktransition probabilities between facets connected by anedge in the Hasse graph.

F. Clustering

When considering a graph structure, clustering of thenodes is usually carried out by directly probing thetopological structure of the graph itself. This task goesunder the name of community detection, and many al-gorithms have been developed for this purpose [21, 55–57]. In the case of simplicial complexes however itis less clear how one might want to re-define classicalcommunity detection algorithms. Against this back-ground, embeddings are very helpful: indeed, one ofthe main purposes of embeddings is to represent datain a non-trivial metric space, where it becomes easierfor clustering algorithms to partition data into sensiblegroups. In order to test the quality of our embeddingprocedure, we run the widespread hierarchical cluster-ing algorithm [58] on its representations of nodes. Wethen proceed to test the quality of the obtained resultsagainst the true labels of groups of nodes, by means ofthe normalized mutual information (NMI) metric [59].Even though more sophisticated clustering algorithmshave been devised during the years, the good resultsthat can be obtained already with this relatively simpleprocedure show the validity of our embedding approach.

dSWS/d

SWS

max

0.0 0.2 0.4 0.6 0.8

0.2

0.4

0.6

0.8

1.0

1.2

N =100

N =200

N =300

N =400

(a)

2-simplex probability p2

L=5

L=10

L=20

L=40

L=50

0.0 0.2 0.4 0.6 0.8

(b)

0.2

0.4

0.6

0.8

1.0

1.2

dSWS/d

SWS

max

Figure 3. Effect of higher order interactions. (a) SlicedWasserstein distance dSWS

1 (X[p1, 0], X[p1, p2]) between one-dimensional persistent homology groupH1 of the embeddingcorresponding to a pure network (p2 = 0) and the embed-ding obtained for simplicial complexes as a function of p2(walks length L = 30) and the number of nodes N of thenetwork. (b) dSWS

1 distance for the embeddings of Costa-Farber simplicial complexes as a function of p2 and differentwalk lengths. We find that for walk lengths L � N theembeddings are stable, and display smaller changes as com-pared with the effect of p2 (number of nodes N = 200).For visualization purposes, for fixed N in (a) or L in (b),we normalize dSWS by the maximum distance value dSWS

max

observed.

III. RESULTS

A. Evolution of embedding structure

The first step toward understanding the role of higherorder interactions in shaping the resulting embeddingis quantifying the difference between the embeddingsobtained with and without the contribution of higherorder interactions.

To investigate this, we construct a series of simpli-cial complexes using the Costa-Farber random simpli-cial complex model (described in the Methods, sectionII E 1). In this case, the model has only two parame-ters: the probability p1 of two nodes being connected(exactly like in the standard Erdos-Renyi model), andthe probability p2 that a 3-clique in the underlying 1-skeleton will be filled by a 2-simplex. In particular,

6

when p2 = 0 the simplicial complex is one-dimensionaland corresponds therefore to a standard graph. Hence,for fixed number of nodes N and edge probability p1,we can investigate the effect of introducing higher ordersimplices by gradually increasing p2.

We quantify the effects of increasing p2 by computingthe one-dimensional persistence homology of the em-beddings (using the standard Rips-Vietoris filtration)and then calculating the sliced Wassersstein distancebetween the resulting persistence diagrams (Figure 3a).We find large changes in the embedding topology al-ready for small p2 values; in particular, the variabilityof topology across different realizations of the Costa-Farber model for the same p2 value is much smallerthan change induced by increasing p2. We show thesame general behaviour for different number of verticesN (Figure 3a) and for the walk lengths L, which is oneof the main parameter of the embedding construction(Figure 3b).

B. Effects of higher-order interactions onclustering stability

While we showed that the overall topology of the em-bedding is stable over realization of the same randommodel, this does not necessarily imply that the localstructure of the embedding is stable too. This is impor-tant to understand however because the local relationsbetween the embedded points are relevant for the result-ing clustering partition. We check this as follows: weconstruct a graph, that is, an instance X of the Costa-Farber model without triangles (p2 = 0); then startingfrom the same graph, we add triangles with different p2and compute the corresponding embeddings. We findthat already when small amounts of higher-order inter-actions are present, the partitions are very dissimilarfrom the one obtained considering only the edge struc-ture (Figure 4a). The same is true when we compareacross different values of p2. We find in fact that thepartition structure shows consistently a low similarityacross different triangle densities (i.e. average similarity∼ 0.4, Figure 4(b)). Note also that different realizationswith the same triangle density (the diagonal in Figure4(b)) result in partitions that are much closer than thoseobtained for different triangle densities, once more high-lighting the critical role of higher order interactions.

C. Simplicial communities in social systems

We test now whether using the information containedin high-order interactions is relevant to obtain informa-tion about the community structure of a simplicial sys-tem. We investigate this by considering a datasets of

temporal face-to-face interactions collected by the So-ciopatterns collaboration: data from contacts betweenstudents in an elementary school and in a high school[60, 61].

Following Iacopini et al. [29] (see section II E 2), weaggregate interactions over small non-overlapping timewindows (15 minutes) in sub-networks. We then con-sider each fully connected sub-graph of interactions cre-ated within a window as a simplex and construct thesimplicial complex aggregating all interactions acrossall windows.

We compute the Simplex2Vec embedding for eachdataset and detect clusters in the embeddings usingstandard clustering techniques. In Figures 5 and 6 wedisplay the results of the clustering detection on the em-beddings built using simplices up to dimension k = 4for sociopattern datasets and compare them with dataon which class the students belong to. Both panels con-tains results for both the weighting schemes, biased to-ward lower order simplices (top row), and biased towardhigher order simplices (bottom row). We find a betterNMI (∼ 0.5 for the second case highlighting again therole of higher order simplices in detecting the large andmeso-scale structure of the system.

While these results are already encouraging, it is in-teresting to quantify how much information is encodedin higher-order interactions. This can easily be cap-ping the maximal simplex dimension that we allow inthe Hasse Diagram used to construct the embedding.In other words, we repeat the construction of the em-beddings but considering only simplices up 1-simplices(edges), then up to 2-simplices (triangles), then up to3-simplices (tetrahedra) and so on.

This provides us with a natural complexity ladder forthe simplicial embeddings. For each maximal simplicialdimension, we repeat the whole construction and com-pute the corresponding clusters. As expected, we findthat, when we bias toward lower-order simplices, in-creasing the order of the Hasse Diagram has little tono effect on the resulting NMI. In contrast, when usingconsidering the bias toward higher order simplices, wefind a clear increase in the resulting NMI (Figure 7).

D. Communities of fMRI functional connectivity

We computed Simplex2Vec embeddings for the aver-age functional connectivity matrix described in sectionII E 3 the embedding using different maximal simplicialdimensions in the Hasse Diagrams. We then performclustering over the different embeddings and show theresults in Figure 8 (only points corresponding to brainregions are showed). We find a wide variability in the re-sults of both the embedding and clustering. We use thisvariability to ask which regions display a conserved com-munity participation structure across different maximal

7

0.000 0.050 0.100 0.150 0.200

p2

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Clu

steri

ng S

imila

rity

CluSim with Node2Vec

0.00.02

0.040.07

0.090.11

0.130.16

0.18 0.2

0.0

0.02

0.04

0.07

0.09

0.11

0.13

0.16

0.18

0.2

CluSim mean

0.0

0.2

0.4

0.6

0.8

(a) (b)

Figure 4. Higher-order interactions disrupt cluster structures. (a) Clustering similarity between the cluster par-titions obtained at p2 = 0 and those obtained for positive p2 (N = 200). Interestingly, even with a small increase in p2,we clearly find that the partitions obtained are very dissimilar from the one obtained considering only the edge structure.Analougous results are obtained when comparing partition across different values of p2 (b).

;

15 10 5 0 5 10 1515

10

5

0

5

10

15

Lowe

r ord

er b

ias

Real dataLyonSchoolStudentsTeachers

15 10 5 0 5 10 1515

10

5

0

5

10

15

Predicted labelsLyonSchoolStudentsTeachers

10 5 0 5 10

10

5

0

5

10

High

er o

rder

bia

s

LyonSchoolStudentsTeachers

10 5 0 5 10

10

5

0

5

10

(a) (b)

(c) (d)

LyonSchoolStudentsTeachers

Figure 5. Simplicial Communities for face-to-face interaction data in a primary school. We present the Sim-plex2Vec embedding of group gatherings of students of a primary school (LyonSchool). The real labelling of the nodeembedding is represented in (a) and (c), where the colors represent attribution to the same class and crosses (+) denoteteachers. Random walk bias towards lower order simplices (b) produces less distinguishable embedding in comparison withthe higher order bias (d). Higher order bias also permits to glance over the teachers attribution to classes. The coloringorder between real data and predicted labels does not match and presented only for the sake of partition distinguishability.

8

15 10 5 0 5 10 1515

10

5

0

5

10

15Lo

wer o

rder

bia

sReal data

Thiers13Students

15 10 5 0 5 10 1515

10

5

0

5

10

15

Predicted labelsThiers13Students

20 15 10 5 0 5 10 15

15

10

5

0

5

10

15

High

er o

rder

bia

s

Thiers13Students

20 15 10 5 0 5 10 15

15

10

5

0

5

10

15

(a) (b)

(c) (d)

Thiers13Students

Figure 6. Simplicial communities for face-to-face interaction data in a high school. We present the Simplex2Vecembedding of group gatherings of students of a high school (Thiers13). The real labelling of the node embedding isrepresented in (a) and (c), where the colors represent attribution to the same class. This dataset does not have separate labelsfor teachers. Random walk bias towards lower order simplices (b) produces less distinguishable embedding in comparisonwith the higher order bias (d). The coloring order between real data and predicted labels does not match and presentedonly for the sake of partition distinguishability.

1 2 3 4 5 6 7Order threshold

0.2

0.3

0.4

0.5

0.6

0.7

0.8

NMI s

core

Lower order bias

LyonSchoolThiers13

1 2 3 4 5 6 7Order threshold

0.2

0.3

0.4

0.5

0.6

0.7

0.8

(a) (b)

Higher order bias

LyonSchoolThiers13

Figure 7. Similarity between simplicial embedding partitions and ground data grows with maximal simplicialdimension.. We plot the NMI values for the partition (obtained via agglomerative clustering) from the Simplex2Vecembeddings using random walks that are biased towards (a) lower order simplices and (b) higher order simplices for twodatasets of face-to-face interactions. We clearly find that biasing towards higher order simplices results in an improved NMIbetween the detected clusters and the class ground truth.

9

simplicial orders. For each region i and maximum sim-plicial order d, we compute a community participationvector ci(d). We then compute the cosine similarity be-tween the ci(1) corresponding to only considering edges,and all the higher order forms, si = 〈cos(ci(1), ci(d))〉daveraged over all d ∈ (2, 6).

We use si as a coarse indicator of which regions haveintegration patterns that are prone to be affected byhigher order interactions (low si), and regions that areless prone to be affected by higher order interactions.Regions strongly affected across repetitions include thereticular formation, the secondary visual cortex (in-cluding early areas of the dorsal and ventral streams),the conjunction of the secondary somatosensory cortexand thalamic nucleai, and single components that high-lighted distributed nucleai in the brain stem. A verylarge plurality of regions in the cerebellum were stableacross maximal interaction orders, as were focal brainstem nuclei, large patches of the bilateral dorso-ventralthalamus, and the caudate nucleus.

IV. DISCUSSION

Embeddings are a very general tool to study thestructure and –potentially– dynamics of interacting sys-tems, thanks to their capacity to represent discrete sys-tems in continuous spaces. This motivated intense re-search to develop informative and scalable algorithmsable to construct such embeddings for unstructured and–more recently– network data. However, with increas-ing frequency and across a range of disciplines, datasetsare produced in which interactions are not well de-scribed by pairs of agents, but rather involve a largerand at times heterogeneous number of agents, with ap-plications ranging from structure and prediction in tem-poral networks [32, 62] to contagion dynamics [29]. Sim-plicial complexes represent a promising way for repre-senting such interactions, but the identification of com-munities remains an important challenge. The currentwork takes a step towards filling this gap, by provid-ing a solution based on the developments in symbolicembeddings.

We showed in simple case studies that higher ordersimplices can alter strongly the community structure ofan interacting system and should therefore be consid-ered (or at least their contribution tested). We provideevidence from a social system that simplicial contribu-tions are important to improve the correspondence ofthe detected clusters to the ground truth. Finally, us-ing a parsimonious example from network neuroscience,we we showed how to assess which nodes in a systemare more (or less) prone to be affected by higher orderinteractions.

Naturally, due to time constraints, this work hasmany limitations, which we plan to address in the near-

est future. In particular, interesting avenues for futurework include:

1. extending the Simplex2Vec analysis to other dy-namical systems, where the procedure describedin section II E 2 is relevant to extract group inter-actions over small temporal scales;

2. testing the capacity of Simplex2Vec to predictmissing interactions in arbitrary dimension: sinceSimplex2Vec simultaneously embeds all interac-tions, it is possible to generate predictions for anygroup size;

3. extending the community analysis to higher or-der simplices: this has already been done to somedegree by considering edge-communities (in neu-roscience for example, see [63]), but Simplex2Vecprovides a general framework for this;

4. exploring a wider range of RW navigation bi-asing schemes (for example, maximally entropicRW[64]) and include data-driven weights on sim-plices;

5. and, finally, relating the results of unconstrainedRW with those constrained to moving usingthe combinatorial Laplacian [65, 66] (walks con-strained to simplices in dimension k ± 1).

V. ACKNOWLEDGMENTS

This work was produced by the DoubleNegroni groupat Complexity72h Workshop, held at IMT Schoolin Lucca, 17-21 June 2019. Website: complex-ity72h.weebly.com

Data were provided [in part] by the Human Con-nectome Project, WU-Minn Consortium (Principal In-vestigators: David Van Essen and Kamil Ugurbil;1U54MH091657) funded by the 16 NIH Institutes andCenters that support the NIH Blueprint for Neuro-science Research; and by the McDonnell Center forSystems Neuroscience at Washington University. A.M.acknowledges support of the Fonds de la RechercheScientifique-FNRS under Grant n. 33722509 (Collec-tiveFootprints) and the grant 19-01-00682 of the Rus-sian Foundation for Basic Research. G.P. and JCW.B.acknowledges partial support from Compagnia SanPaolo (ADnD grant). F.M. acknowledges the supportof Istituto Nazionele di Fisica Nucleare (INFN), underthe BIOPHYS experiment grant.

10

Figure 8. Functional Connectivity Embeddings. We show the results of the Simplex2Vec construction for increasingmaximum simplicial dimension. We find a large variability across brain regions in the embedding and clustering (colors)results.

[1] Réka Albert and Albert-László Barabási. Statisticalmechanics of complex networks. Rev. Mod. Phys.,74(1):47–97, 2002.

[2] Mark Newman. Networks: An Introduction. OxfordUniversity Press, Inc., New York, NY, USA, 2010.

[3] Vito Latora, Vincenzo Nicosia, and Giovanni Russo.Complex Networks: Principles, Methods and Applica-tions. Cambridge University Press, 2017.

[4] S. Boccaletti, V. Latora, Y. Moreno, M. Chavez, andD-U. Hwang. Complex networks : Structure and dy-namics. Phys. Rep., 424(4-5):175–308, 2006.

[5] Alain Barrat, Marc Barthelemy, Romualdo Pastor-Satorras, and Alessandro Vespignani. The architectureof complex weighted networks. Proceedings of the na-tional academy of sciences, 101(11):3747–3752, 2004.

[6] Paolo Bajardi, Matteo Delfino, André Panisson, Gio-vanni Petri, and Michele Tizzoni. Unveiling patterns ofinternational communities in a global city using mobilephone data. EPJ Data Science, 4(1):3, 2015.

[7] Corrie J Carstens and Kathy J Horadam. Persistent ho-mology of collaboration networks. Mathematical prob-lems in engineering, 2013, 2013.

[8] Anthony Bonato, Jeannette Janssen, and Paweł Prałat.A geometric model for on-line social networks. In Pro-ceedings of the International Workshop on Modeling So-cial Media, page 4. ACM, 2010.

[9] Klaus B Schebesch and Ralf W Stecking. Topologicaldata analysis for extracting hidden features of clientdata. In Operations research proceedings 2015, pages483–489. Springer, 2017.

[10] Vedran Sekara, Arkadiusz Stopczynski, and SuneLehmann. Fundamental structures of dynamic socialnetworks. Proceedings of the national academy of sci-ences, 113(36):9977–9982, 2016.

[11] Alice Patania, Giovanni Petri, and Francesco Vaccarino.The shape of collaborations. EPJ Data Science, 6(1):18,aug 2017.

[12] José J Ramasco, Sergey N Dorogovtsev, and RomualdoPastor-Satorras. Self-organization of collaboration net-works. Physical review E, 70(3):036106, 2004.

[13] Chad Giusti, Eva Pastalkova, Carina Curto, andVladimir Itskov. Clique topology reveals intrinsic ge-ometric structure in neural correlations. Proceedingsof the National Academy of Sciences, 112(44):13455–13460, 2015.

[14] Giovanni Petri, Paul Expert, Federico Turkheimer,Robin Carhart-Harris, David Nutt, Peter J Hellyer, andFrancesco Vaccarino. Homological scaffolds of brainfunctional networks. Journal of The Royal Society In-terface, 11(101):20140873, 2014.

[15] Louis-David Lord, Paul Expert, Henrique M Fernan-des, Giovanni Petri, Tim J Van Hartevelt, Francesco

11

Vaccarino, Gustavo Deco, Federico Turkheimer, andMorten L Kringelbach. Insights into brain architecturesfrom the homological scaffolds of functional connectiv-ity networks. Frontiers in systems neuroscience, 10:85,2016.

[16] Danielle S Bassett and Olaf Sporns. Network neuro-science. Nature neuroscience, 20(3):353, 2017.

[17] Paul Bendich, James S Marron, Ezra Miller, Alex Pie-loch, and Sean Skwerer. Persistent homology analysisof brain artery trees. The annals of applied statistics,10(1):198, 2016.

[18] Jaejun Yoo, Eun Young Kim, Yong Min Ahn, andJong Chul Ye. Topological persistence vineyard for dy-namic functional brain connectivity during resting andgaming stages. Journal of neuroscience methods, 267:1–13, 2016.

[19] Santo Fortunato. Community detection in graphs.Physics reports, 486(3-5):75–174, 2010.

[20] Andrea Lancichinetti, Filippo Radicchi, José J Ram-asco, and Santo Fortunato. Finding statistically signif-icant communities in networks. PloS one, 6(4):e18961,2011.

[21] Mark EJ Newman. Modularity and community struc-ture in networks. Proceedings of the national academyof sciences, 103(23):8577–8582, 2006.

[22] Mark EJ Newman. Spectral methods for communitydetection and graph partitioning. Physical Review E,88(4):042822, 2013.

[23] Muhammad Aqib Javed, Muhammad Shahzad Younis,Siddique Latif, Junaid Qadir, and Adeel Baig. Com-munity detection in networks: A multidisciplinary re-view. Journal of Network and Computer Applications,108:87–111, 2018.

[24] Alice Patania, Francesco Vaccarino, and Giovanni Petri.Topological analysis of data. EPJ Data Science, 6(1):7,jun 2017.

[25] Robert Ghrist. Barcodes: the persistent topology ofdata. Bulletin of the American Mathematical Society,45(1):61–75, 2008.

[26] Gunnar Carlsson, Afra Zomorodian, Anne Collins, andLeonidas J Guibas. Persistence barcodes for shapes.International Journal of Shape Modeling, 11(02):149–187, 2005.

[27] Frédéric Chazal and Bertrand Michel. An introduc-tion to topological data analysis: fundamental andpractical aspects for data scientists. arXiv preprintarXiv:1710.04019, 2017.

[28] Renaud Lambiotte, Martin Rosvall, and Ingo Scholtes.From networks to optimal higher-order models of com-plex systems. Nature physics, page 1, 2019.

[29] Iacopo Iacopini, Giovanni Petri, Alain Barrat, and VitoLatora. Simplicial models of social contagion. Naturecommunications, 10(1):2485, 2019.

[30] Per Sebastian Skardal and Alex Arenas. Abruptdesynchronization and extensive multistability in glob-ally coupled oscillator simplexes. Phys. Rev. Lett.,122:248301, Jun 2019.

[31] Ginestra Bianconi and Robert M Ziff. Topological per-colation on hyperbolic simplicial complexes. PhysicalReview E, 98(5):052308, 2018.

[32] Giovanni Petri and Alain Barrat. Simplicial activity

driven model. Physical review letters, 121(22):228301,2018.

[33] Slobodan Maletić, Yi Zhao, and Milan Rajković. Persis-tent topological features of dynamical systems. Chaos:An Interdisciplinary Journal of Nonlinear Science,26(5):053105, 2016.

[34] Owen T Courtney and Ginestra Bianconi. Generalizednetwork structures: The configuration model and thecanonical ensemble of simplicial complexes. PhysicalReview E, 93(6):062311, 2016.

[35] Ginestra Bianconi and Christoph Rahmede. Complexquantum network manifolds in dimension d> 2 arescale-free. Scientific reports, 5:13979, 2015.

[36] Ginestra Bianconi and Christoph Rahmede. Emer-gent hyperbolic network geometry. Scientific reports,7:41974, 2017.

[37] Aditya Grover and Jure Leskovec. node2vec: Scalablefeature learning for networks. In Proceedings of the 22ndACM SIGKDD international conference on Knowledgediscovery and data mining, pages 855–864. ACM, 2016.

[38] Yoav Goldberg and Omer Levy. word2vec explained:deriving mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722,2014.

[39] Herbert Edelsbrunner and John Harer. Computationaltopology: an introduction. American MathematicalSoc., 2010.

[40] Giovanni Petri, Martina Scolamiero, Irene Donato, andFrancesco Vaccarino. Topological strata of weightedcomplex networks. PloS one, 8(6):e66506, 2013.

[41] Gunnar Carlsson. Topology and data. Bulletin of theAmerican Mathematical Society, 46(2):255–308, 2009.

[42] Yuriy Mileyko, Sayan Mukherjee, and John Harer.Probability measures on the space of persistence dia-grams. Inverse Problems, 27(12):124007, 2011.

[43] Clifford S Patlak. Random walk with persistence andexternal bias. The bulletin of mathematical biophysics,15(3):311–338, sep 1953.

[44] Palash Goyal and Emilio Ferrara. Graph embeddingtechniques, applications, and performance: A survey.Knowledge-Based Systems, 151:78–94, 2018.

[45] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena.Deepwalk: Online learning of social representations. InProceedings of the 20th ACM SIGKDD internationalconference on Knowledge discovery and data mining,pages 701–710. ACM, 2014.

[46] Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. Efficient Estimation of Word Representationsin Vector Space. arXiv e-prints, page arXiv:1301.3781,Jan 2013.

[47] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-rado, and Jeffrey Dean. Distributed representations ofwords and phrases and their compositionality. In Pro-ceedings of the 26th International Conference on NeuralInformation Processing Systems - Volume 2, NIPS’13,pages 3111–3119, USA, 2013. Curran Associates Inc.

[48] Armindo Costa and Michael Farber. Random simpli-cial complexes. In Configuration spaces, pages 129–153.Springer, 2016.

[49] Valerio Gemmetto, Alain Barrat, and Ciro Cattuto.Mitigation of infectious disease at school: targeted class

12

closure vs school closure. BMC infectious diseases,14(1):695, 2014.

[50] Mathieu G’enois and Alain Barrat. Can co-location beused as a proxy for face-to-face contacts? EPJ DataScience, 7(1):11, May 2018.

[51] Stephen M. Smith, Christian F. Beckmann, JesperAndersson, Edward J. Auerbach, Janine Bijsterbosch,Gwenaëlle Douaud, Eugene Duff, David A. Feinberg,Ludovica Griffanti, Michael P. Harms, Michael Kelly,Timothy Laumann, Karla L. Miller, Steen Moeller,Steve Petersen, Jonathan Power, Gholamreza Salimi-Khorshidi, Abraham Z. Snyder, An T. Vu, Mark W.Woolrich, Junqian Xu, Essa Yacoub, Kamil Uğurbil,David C. Van Essen, and Matthew F. Glasser. Resting-state fmri in the human connectome project. NeuroIm-age, 80:144 – 168, 2013. Mapping the Connectome.

[52] Jacob C.W. Billings, Alessio Medda, Sadia Shakil, Xi-aohong Shen, Amrit Kashyap, Shiyang Chen, AnzarAbbas, Xiaodi Zhang, Maysam Nezafati, Wen-Ju Pan,Gordon J. Berman, and Shella D. Keilholz. Instanta-neous brain dynamics mapped to a continuous statespace. NeuroImage, 162:344 – 352, 2017.

[53] Caleb Geniesse, Olaf Sporns, Giovanni Petri, and Man-ish Saggar. Generating dynamical neuroimaging spa-tiotemporal representations (dyneusr) using topologicaldata analysis. Network Neuroscience, 0(0):1–16, 2019.

[54] D.C. Van Essen, K. Ugurbil, E. Auerbach, D. Barch,T.E.J. Behrens, R. Bucholz, A. Chang, L. Chen,M. Corbetta, S.W. Curtiss, S. Della Penna, D. Feinberg,M.F. Glasser, N. Harel, A.C. Heath, L. Larson-Prior,D. Marcus, G. Michalareas, S. Moeller, R. Oostenveld,S.E. Petersen, F. Prior, B.L. Schlaggar, S.M. Smith,A.Z. Snyder, J. Xu, and E. Yacoub. The human con-nectome project: A data acquisition perspective. Neu-roImage, 62(4):2222 – 2231, 2012. Connectivity.

[55] Martin Rosvall and Carl T Bergstrom. Maps of randomwalks on complex networks reveal community struc-ture. Proceedings of the National Academy of Sciences,105(4):1118–1123, 2008.

[56] Vincent D Blondel, Jean-Loup Guillaume, RenaudLambiotte, and Etienne Lefebvre. Fast unfolding ofcommunities in large networks. Journal of statisticalmechanics: theory and experiment, 2008(10):P10008,2008.

[57] Vincent A Traag, Ludo Waltman, and Nees Jan vanEck. From louvain to leiden: guaranteeing well-connected communities. Scientific reports, 9, 2019.

[58] Jerome Friedman, Trevor Hastie, and Robert Tibshi-rani. The elements of statistical learning, volume 1.Springer series in statistics New York, 2001.

[59] Thomas M Cover and Joy A Thomas. Elements of in-formation theory. John Wiley & Sons, 2012.

[60] Rossana Mastrandrea, Julie Fournet, and Alain Bar-rat. Contact patterns in a high school: A compar-ison between data collected using wearable sensors,contact diaries and friendship surveys. PLoS ONE,10(9):e0136497, 09 2015.

[61] Valerio Gemmetto, Alain Barrat, and Ciro Cattuto.Mitigation of infectious disease at school: targeted classclosure vs school closure. BMC infectious diseases,14(1):695, December 2014.

[62] Austin R Benson, Rediet Abebe, Michael T Schaub, AliJadbabaie, and Jon Kleinberg. Simplicial closure andhigher-order link prediction. Proceedings of the NationalAcademy of Sciences, 115(48):E11221–E11230, 2018.

[63] Demian Battaglia, Boudou Thomas, Enrique CAHansen, Sabrina Chettouf, Andreas Daffertshofer, An-thony R McIntosh, Joelle Zimmermann, Petra Ritter,and Viktor Jirsa. Functional connectivity dynamicsof the resting state across the human adult lifespan.bioRxiv, page 107243, 2017.

[64] Roberta Sinatra, Jesús Gómez-Gardenes, Renaud Lam-biotte, Vincenzo Nicosia, and Vito Latora. Maximal-entropy random walks in complex networks with limitedinformation. Physical Review E, 83(3):030103, 2011.

[65] Abubakr Muhammad and Magnus Egerstedt. Controlusing higher order laplacians in network topologies. InProc. of 17th International Symposium on Mathemati-cal Theory of Networks and Systems, pages 1024–1038.Citeseer, 2006.

[66] Michael T Schaub, Austin R Benson, Paul Horn, Ga-bor Lippner, and Ali Jadbabaie. Random walks on sim-plicial complexes and the normalized hodge laplacian.arXiv preprint arXiv:1807.05044, 2018.