9
Wikipedia graph mining: dynamic structure of collective memory Volodymyr Miz, Kirell Benzi, Benjamin Ricaud, and Pierre Vandergheynst Ecole Polytechnique Fédérale de Lausanne Lausanne, Switzerland first.last@epfl.ch ABSTRACT Wikipedia is the biggest encyclopedia ever created and the fifth most visited website in the world. Tens of millions of people surf it every day, seeking answers to various questions. Collective user activity on its pages leaves publicly available footprints of human behavior, making Wikipedia an excellent source for analysis of collective behavior. In this work, we propose a new method to analyze and retrieve collective memories, the way social groups remember and recall the past. We use the Hopfield network model as an artificial mem- ory abstraction to build a macroscopic collective memory model. To reveal memory patterns, we analyze the dynamics of visitors activity on Wikipedia and its Web network structure. Each pattern in the Hopfield network is a cluster of Wikipedia pages sharing a common topic and describing an event that triggered human cu- riosity during a finite period of time. We show that these memories can be remembered with good precision using a memory network. The presented approach is scalable and we provide a distributed implementation of the algorithm. CCS CONCEPTS Information systems Wrappers (data mining); Web log analysis; Networks Network dynamics; Human-centered computing Wikis; KEYWORDS Collective memory, Graph Algorithm, Hopfield Network, Wikipedia, Web Logs Analysis ACM Reference Format: Volodymyr Miz, Kirell Benzi, Benjamin Ricaud, and Pierre Vandergheynst. 2018. Wikipedia graph mining: dynamic structure of collective memory. In Proceedings of (Submitted to KDD’18). , 9 pages. https://doi.org/10.475/123_4 1 INTRODUCTION Over recent years, the Web has significantly affected the way people learn, interact in social groups, store and share information. Apart from being an essential part of modern life, social networks, online Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. Submitted to KDD’18, August 19–23, London, United Kingdom © 2018 Association for Computing Machinery. ACM ISBN 123-4567-24-567/08/06. . . $15.00 https://doi.org/10.475/123_4 services, and knowledge bases generate a massive amount of logs, containing traces of global online activity on the Web. A large-scale example of such publicly available information is the Wikipedia knowledge base and its history of visitors activity. This data is a great source for collective human behavior analysis at scale. Due to this reason, the analysis of the Wikipedia in this area has become popular over the recent years [27], [14], [21]. Collective memory [17] is an interesting social phenomenon of human behavior. Studying this concept is a way to enhance our understanding of a common view of events in social groups and identify the events that influence remembering of the past. Early research on collective memory relied on interviews and self-reports that led to a limited number of subjects and biased results [26]. The availability of the Web activity data opened new opportunities toward systematic studies at a much larger scale [10], [16], [14]. Nonetheless, the general nature of collective memory formation and its modeling remain open questions. Can we model collective and individual memory formation similarly? Is it possible to find collective memories and behavior patterns inside a collaborative knowledge base? In this work, we adopt a data-driven approach to shed some light on these questions. An essential part of the Web visitors activity data is the under- lying human-made graph structure that was initially introduced to facilitate navigation. The combination of the activity dynamics and structure of Web graphs inspires an idea of their similarity to biological neural networks. A good example of such network is the human brain. Numerous neurons in the brain constitute a biological dynamic neural network, where dynamics are expressed in terms of neural spikes. This network is in charge of perception, decision making, storing memories, and learning. During learning, neurons in our brain self-organize and form strongly connected groups called neural assemblies [1]. These groups express similar activation patterns in response to a spe- cific stimuli. When learning is completed, and the stimuli applied once again, reactions of the assemblies correspond to consistent dynamic activity patterns, i.e. memories. Synaptic plasticity mecha- nisms govern this self-organization process. Hebbian learning theory [18] proposes an explanation of this self-organization and describes the basic rules that guide the net- work design. The theory implies that simultaneous activation of a pair of neurons leads to an increase in the strength of their connec- tion. In general,the Hebbian theory implies that the neural activity transforms brain networks. This assumption leads to an interesting question. Can temporal dynamics cause a self-organization process in the Wikipedia Web network, similar to the one, driven by neural spikes in the brain? arXiv:1710.00398v5 [cs.IR] 14 Feb 2018

Wikipedia graph mining: dynamic structure of collective memory · Halbwachs in 1925 [17]. He proposed a concept of a social group that shares a set of collective memories that exist

  • Upload
    vomien

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

Wikipedia graph mining: dynamic structure of collectivememory

Volodymyr Miz, Kirell Benzi, Benjamin Ricaud, and Pierre VandergheynstEcole Polytechnique Fédérale de Lausanne

Lausanne, [email protected]

ABSTRACTWikipedia is the biggest encyclopedia ever created and the fifthmost visited website in the world. Tens of millions of people surfit every day, seeking answers to various questions. Collective useractivity on its pages leaves publicly available footprints of humanbehavior, making Wikipedia an excellent source for analysis ofcollective behavior.

In this work, we propose a new method to analyze and retrievecollective memories, the way social groups remember and recallthe past. We use the Hopfield network model as an artificial mem-ory abstraction to build a macroscopic collective memory model.To reveal memory patterns, we analyze the dynamics of visitorsactivity on Wikipedia and its Web network structure. Each patternin the Hopfield network is a cluster of Wikipedia pages sharing acommon topic and describing an event that triggered human cu-riosity during a finite period of time. We show that these memoriescan be remembered with good precision using a memory network.The presented approach is scalable and we provide a distributedimplementation of the algorithm.

CCS CONCEPTS• Information systems → Wrappers (data mining); Web loganalysis; •Networks→Networkdynamics; •Human-centeredcomputing →Wikis;

KEYWORDSCollectivememory, GraphAlgorithm,HopfieldNetwork,Wikipedia,Web Logs Analysis

ACM Reference Format:Volodymyr Miz, Kirell Benzi, Benjamin Ricaud, and Pierre Vandergheynst.2018. Wikipedia graph mining: dynamic structure of collective memory. InProceedings of (Submitted to KDD’18). , 9 pages. https://doi.org/10.475/123_4

1 INTRODUCTIONOver recent years, theWeb has significantly affected the way peoplelearn, interact in social groups, store and share information. Apartfrom being an essential part of modern life, social networks, online

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] to KDD’18, August 19–23, London, United Kingdom© 2018 Association for Computing Machinery.ACM ISBN 123-4567-24-567/08/06. . . $15.00https://doi.org/10.475/123_4

services, and knowledge bases generate a massive amount of logs,containing traces of global online activity on the Web. A large-scaleexample of such publicly available information is the Wikipediaknowledge base and its history of visitors activity. This data is agreat source for collective human behavior analysis at scale. Due tothis reason, the analysis of the Wikipedia in this area has becomepopular over the recent years [27], [14], [21].

Collective memory [17] is an interesting social phenomenon ofhuman behavior. Studying this concept is a way to enhance ourunderstanding of a common view of events in social groups andidentify the events that influence remembering of the past. Earlyresearch on collective memory relied on interviews and self-reportsthat led to a limited number of subjects and biased results [26].The availability of the Web activity data opened new opportunitiestoward systematic studies at a much larger scale [10], [16], [14].Nonetheless, the general nature of collective memory formationand its modeling remain open questions. Can we model collectiveand individual memory formation similarly? Is it possible to findcollective memories and behavior patterns inside a collaborativeknowledge base? In this work, we adopt a data-driven approach toshed some light on these questions.

An essential part of the Web visitors activity data is the under-lying human-made graph structure that was initially introducedto facilitate navigation. The combination of the activity dynamicsand structure of Web graphs inspires an idea of their similarityto biological neural networks. A good example of such networkis the human brain. Numerous neurons in the brain constitute abiological dynamic neural network, where dynamics are expressedin terms of neural spikes. This network is in charge of perception,decision making, storing memories, and learning.

During learning, neurons in our brain self-organize and formstrongly connected groups called neural assemblies [1]. Thesegroups express similar activation patterns in response to a spe-cific stimuli. When learning is completed, and the stimuli appliedonce again, reactions of the assemblies correspond to consistentdynamic activity patterns, i.e. memories. Synaptic plasticity mecha-nisms govern this self-organization process.

Hebbian learning theory [18] proposes an explanation of thisself-organization and describes the basic rules that guide the net-work design. The theory implies that simultaneous activation of apair of neurons leads to an increase in the strength of their connec-tion. In general,the Hebbian theory implies that the neural activitytransforms brain networks. This assumption leads to an interestingquestion. Can temporal dynamics cause a self-organization processin the Wikipedia Web network, similar to the one, driven by neuralspikes in the brain?

arX

iv:1

710.

0039

8v5

[cs

.IR

] 1

4 Fe

b 20

18

Submitted to KDD’18, August 19–23, London, United Kingdom V. Miz et al.

To answer this question, we introduce a collective memory mod-eling approach inspired by the theory of learning in the brain,in particular, a content-addressable memory system, the Hopfieldnetwork [19]. In our experiments, we use a dynamic network ofWikipedia articles, where the dynamics comes from the visitorsactivity on each page, i.e. the number of visits per page per hour. Weassume that the Wikipedia network can self-organize similar to aHopfield network with a modified Hebbian learning rule and learncollective memory patterns under the influence of visitors activitysimilar to neurons in the brain. The results of our experimentsdemonstrate that memorized patterns correspond to groups of col-lective memories, containing clusters of linked pages that have aclosely related meaning. A topic of a cluster corresponds to a realworld event that triggered the interest of Wikipedia visitors duringa finite period of time. In addition, the collective memory givesaccess to the time lapse when the event occured. We also show thatour collective memory model is able to recall an event recorded inthe collective memory, given only a part of the event cluster. Herethe term recall means that we recover a missing fraction of thevisitor activity signal in a memory cluster.

Contributions. Our contributions are as follows:• We propose a novel collective memory learning framework,inspired by artificial models of individual memory – theHebbian learning theory and the Hopfield network model.

• We formalize our findings into an content-addressable modelof collective memories. So far, collective memory [17] hasbeen considered just as a concept that lacks a general modelof memory formation. In the experiments we demonstratethat given a modified Hebbian learning rule, Wikipedia hy-perlinks network can self-organize and gain properties rem-iniscent of an associative memory.

• We develop a graph algorithm for collective memory ex-traction. Computations are local on the graph, allowing tobuild an efficient implementation. We provide a distributedimplementation of the algorithm to show that it can handledense graphs and massive amounts of time-series data.

• We present graph visualizations as an interactive tool forcollective memory studies.

The rest of this paper organized as follows. In Section 2, wegive an overview of related works on large-scale collective memoryresearch and analysis. Then, we present a graph algorithm and acommunity detection approach in Section 4. Section 3 describesthe dataset and data preprocessing. We then discuss the results ofour experiments and evaluate discovered memory properties inSection 6. Finally, information about the data, the code and toolsused for the study as well as online visualizations are provided inSection 8.

2 RELATEDWORKThe term Collective memory first appeared in the book of MauriceHalbwachs in 1925 [17]. He proposed a concept of a social groupthat shares a set of collective memories that exist beyond a memoryof each member and affects understanding of the past by this socialgroup. Halbwachs’s hypothesis influenced a range of studies insociology [2], [4], psychology [9], [12], cognitive sciences [10],and, only recently, in computer science [3], where authors extract

collective memories using LDA [7], applied to a collection of newsarticles.

Despite the fact that Wikipedia is the largest ever created en-cyclopedia of public knowledge and the fifth most visited web-site in the world, the studies on collective memory considered theWikipedia visitor activity data only recently. The idea of regard-ing Wikipedia as a global memory space was first introduced byPentzold in 2009 [25]. Then, it was followed by a range of collectivememory studies based on various Wikipedia data archives.

Analyzing Wikipedia page views, Kanhabua et al. [21] proposeda collective memory model, investigating 5500 events from 11 cat-egories. The authors proposed a remembering score based on acombination of time-series analysis and location information. Thefocus of the work is on the four types of events: aviation accidents,earthquakes, hurricanes, and terrorist attacks. The work presentsextensive experimental results, however, it limited to particulartypes of collective memories.

Traumatic events such as attacks and bombings have also beeninvestigated in [11], [10] based on the Wikipedia edit activity data.The authors investigate the difference between traumatic and non-traumatic events using natural language processing techniques.The study is limited to a certain type of events.

Another case study [14] focuses on memory-triggering patternsto understand collective memories. The collective memory modelis inferred from the Wikipedia visitors activity. The work considersonly a case of aircraft incidents reported in English Wikipedia. Theauthors try to build a general mathematical model and explainthe phenomenon of collective memory, extracted from Wikipedia,based on that single topic.

Popularity and celebrities represent another focus point of publicinterest. The Wikipedia hourly visits on the pages of celebritieswas used to investigate fame levels of tennis players [29]. Theauthors, though, did not tackle collective memories and aimed toquantify the relationship between performance and popularity ofthe athletes.

To the best of our knowledge, we are the first to apply a content-addressable memory model to the Wikipedia viewership data forcollective memory analysis. We build our collective memory modelbased on the assumption that it is similar to a memory of an individ-ual. For the first time, we demonstrate that Wikipedia Web networkand its dynamics can gain properties reminiscent of associativememory. Unlike the results of previous works, our model is notlimited to particular classes of events and does not require labeleddata. We do not analyze the content of the pages and only rely onthe collective viewership dynamics in the network that makes thepresented model tractable.

3 DATASETWe use the dataset described in [6]. This dataset is based on twoWikipedia SQL dumps: English language articles and user visitcounts per page per hour. The original datasets are publicly availableon the Wikimedia website [13].

Wikipedia graph mining: dynamic structure of collective memory Submitted to KDD’18, August 19–23, London, United Kingdom

A

B

C

D

E

t

t t

t

t

A B

C

D

E

a) b)t

Figure 1: a) Data structure built from Wikipedia data, withtime-series associated to nodes of the Wikipedia hyperlinkgraph. b) Focus on the time-series signals (number of visitsper hour) residing on the vertices of the graph.

The Wikipedia network of pages is first constructed using thedata from article dumps that contain information about the ref-erences (edges) between the pages (nodes)1. Time-series are thenassociated to each node (Fig. 1), corresponding to the visits historyfrom 02:00, 23 September 2014 to 23:00, 30 April 2015.

The graph contains 116 016 Wikipedia pages (out of a total 4 856639 pages) with 6 573 475 hyperlinks. Most of the Wikipedia pagesremain unvisited. Therefore, only the pages that have a number ofvisits higher than 500 at least once in the recording are kept. Thetime-series associated to the nodes have a length ofT = 5278 hours.

Preprocessing. To identify potential core events of collectivememories and reduce further the processing time, we select nodesthat have bursts of visitor activity, i.e. spikes in the signal. This isdone on a monthly basis for two reasons. Firstly, it reduces the pro-cessing and the needs for memory as each month can be processedindependently. Secondly, it allows for a study of the evolution ofthe collective memories on a monthly basis, by investigating thetrained Hopfield networks (see next section).

For each month, we define the burstiness b of a page as thenumber and length of the peaks of visits over time. We denote byxi the time-series giving the number of visits per hour for a nodelabeled i , during monthm. To define a burst of visits we computethe mean µ and the standard deviation σ of xi . We select values thatare above nσ +µ, where n is a tunable activity rate parameter (n = 5here). The burstiness of the page associated to node i is definedby b =

∑Tm−1t=0 k[t] where Tm is the length of the time-series for

monthm and

k[t] ={1, if xi [t] > nσ + µ,

0, otherwise.(1)

For each month, we discard the pages that have a burstinesssmaller or equal to 5.

4 COLLECTIVE MEMORY LEARNINGOur goal is to create a self-organized memory from the Wikipediadata. This memory is not a standard neural network but morea concept network or a network made of pieces of information.1Note that Wikipedia is continuously updating. Some links that existed at the momentwe made the dump may have been removed from current versions of the pages.To check consistency with past versions, one can use the dedicated search tool athttp://wikipedia.ramselehof.de/wikiblame.php.

Indeed, the Wikipedia pages replace neurons in the network. Henceit can be seen as a memory connecting high-level concepts. The self-organized structure is shaped as a Hopfield network with additionalconstraints, relating it to the concept of associative memory. Thelearning process follows theHebbian rules, adapted to our particulardataset.

Hopfield network. Our approach is based on a Hopfield modelof artificial memory [19]. A Hopfield network is a recurrent neuralnetwork serving as an associative memory system. It allows forstoring and recalling patterns. Let N be the number of nodes in theHopfield network. Starting from an initial partial memory patternP0 ∈ RN , the action of recalling a learned pattern is done by thefollowing iterative computation:

Pj+1 = fθ (WPj ), (2)

whereW is the weight matrix of the Hopfield network. The functionfθ : RN → RN is a nonlinear thresholding function (step functiongiving values {−1, 1}) that binarize the vector entries. The valueθ is the threshold (same for all entries). In our case, we build anetwork per month soW is associated to a particular month. Foreach j ≥ 0, Pj is a matrix of (binarized) time-series where each rowis associated to a node of the network and each column correspondsto an hour of the month considered. We stop the iteration whenthe iterative process has converged to a stable solution (Pj+1 = Pj ).Note that P0 is a binary matrix, obtained from the time-series usingthe function defined in Eq. (1).

Dimensionality reduction. The learning process has beenmodified in order to cope with a large amount of data. We recallthat the number of pages has already been reduced by keeping onlythe ones with bursts of activity. However, it is still a large number ofneurons and weights for a Hopfield network. Therefore, instead oftraining a fully connected network, we reduce the number of linksin the following manner. We consider only hyperlink connectionsbetween pages and we learn their associated weight. In this way,nodes in the network are linked by human-made connections andit can be seen as a sort of pre-trained, constrained, network. Thisis a strong assumption as no link in the Hopfield network can becreated between pages that are not related by a hyperlink.

Hebbian learning.We use a synaptic-plasticity-inspired com-putational model [18] in the proposed graph learning algorithm tocompute weights. The main idea is that a co-activation of two neu-rons results in the enforcement of a connection (synapse) betweenthem. We do not take causality of activations into account. For twoneurons i and j , their respective activity (number of visits) over timexi and x j are compared to determine if they are co-active or notat each time step. We introduce the following similarity measureSim{i, j, t} between node i and j at time t , where Sim{i, j, t} = 0 ifxi [t] = x j [t] = 0 and otherwise:

Sim{i, j, t} =min(xi [t],x j [t])max(xi [t],x j [t])

∈ [0, 1]. (3)

This function compares the ratio of the number of visits, puttingmore emphasis on the pages receiving a similar amount of visits.

Submitted to KDD’18, August 19–23, London, United Kingdom V. Miz et al.

101 102 103 104

Weighted degree

100

101

102

103

104

105

106

Deg

ree

dist

ribut

ion

P(k)

InitialLearned

Figure 2: Weighted degree distribution in log-log scale. Lin-earity in log-log scale corresponds to power-law behaviorP(k) ∼ k−γ . Power-law exponent γ = 3.81 for the initialgraph (blue), and γ = 2.85 for the learned graph (red). Thedecrease of the exponent in the learned graph indicates thatthe weight distribution becomes smoother and the numberof greedy hubs drops with the learning.

For each time step t , the edge weight wi j between i and j isupdated by the following amount:

∆wi j =

{+Sim{i, j}, if Sim{i, j} > λ,

0, otherwise,(4)

where λ = 0.5 is a threshold parameter.The computations are tractable because 1) they are local on the

graph, i.e. weight updates depend on a node and its first orderneighborhood, 2) weight updates are iterative, and 3) a weightupdate occurs only between the connected nodes and not among allpossible combinations of nodes. These three facts allow us to builda distributed model to speed up computations. For this purpose,we use a graph-parallel Pregel-like abstraction, implemented in theGraphX framework [15], [28].

5 GRAPH VISUALIZATION ANDCOMMUNITY DETECTION

The structure of each learned network shows patterns correspond-ing to communities of correlated nodes. We analyze all the monthlyHopfield networks, extract communities and visualize them in thefollowing sections. We use a heuristic method based on modularityoptimization (Louvain method) [8] for community detection. Col-ors in the graph depict and highlight the detected communities. Aresolution parameter controls the size of communities. To representthe graph in 2D space for visualization, we use a force-directedlayout [20]. The resulting spatial layout reflects the organization ofthe network. An interactive version of all visualizations presentedin this paper is available online [24].

0 500 1000 1500 2000 2500 3000 3500 4000 4500Edge weight

101

102

103

104

105

106

Freq

uenc

y (lo

g sc

ale)

Figure 3: Edge weight distribution in the learned graph inlog scale. The learned graph has skewed heavy-tailedweightdistribution.

6 EXPERIMENTS AND RESULTSUsing the approach presented in Section 4, we intend to learn col-lective memories and describe their properties. After the learning,we inspect different resulting networks on different time scales.We extract strongly connected clusters of nodes from these net-works and discuss their features. Also, we investigate the temporalbehavior of the memories.

A7months network.Wefirst learn the graph using the 7-monthWikipedia activity dataset.We start from the initial graph ofWikipediapages connected with hyperlinks, described in Section 3. The Hop-field network, the result of the learning process, is a networkG = (V ,E,W ), whereV is the set of Wikipedia pages, E is the set ofreferences between the pages, andW is the set of weights, reflectingthe similarity of activity patterns between articles.

After the learning, the majority of weights are 0 (6 297 977,95.8%). Only 275 498 edges (4.2%) have strictly positive weights. Fora better visualization, we prune out the edges that have 0-weight.We also delete disconnected nodes that remain after edge pruningand remove small disconnected components of the graph. Thenumber of remaining nodes is 35 839 (31% of the initial number).Figure 4 shows snapshots of the initial Wikipedia graph beforelearning (a) and the learned graph after weight update and pruning(b).

The initial and learned Wikipedia graphs have statistically het-erogeneous connectivity (Fig. 2) and weight patterns (Fig. 3) thatcorrespond to skewed heavy-tailed distributions.

The degree distribution of the initial graph has a larger power-law exponent γ = 3.81 (Fig. 2, blue) than the learned one (γ =2.85). This shows that the initial Wikipedia graph is dominatedby large hubs that attract most of the connections to numerouslow-degree neighbors. These hubs correspond to general topics inthe Wikipedia network. They often link broad topics such as, for in-stance, the “United States” page, with a large number of hyperlinkspointing to it from highly diverse subjects.

Wikipedia graph mining: dynamic structure of collective memory Submitted to KDD’18, August 19–23, London, United Kingdom

(a) Initial (b) Learned

Figure 4: Wikipedia graph of hyperlinks (left) and learnedHopfield network (right). Colors correspond to the detectedcommunities. The learned graph is much more modularthan the initial one, with a larger number of smaller com-munities. The layout is force-directed.

We have applied a community detection algorithm, describedin Section 5, to both graphs. The initial Wikipedia graph (Fig. 4)is dense and cluttered with a significant number of unused refer-ences, while the learned graph reveals smaller and more separatedcommunities. This is confirmed by the community size distributionof the graphs (Fig. 5). The number of communities and their sizechange after learning. Initially, the small number of large commu-nities dominate the graph (blue), while after the learning (red) wesee a five times increase in the number of communities. Moreover,as a result of the learning, the size of the communities decreasesby one order of magnitude. The modularity of the learned graph is25% higher, giving the first evidence of the creation of associativestructures.

The analysis of each community of nodes in the learned graphgives a glimpse of the events that occurred during the 7-monthperiod. Each cluster is a group of pages related to a common topicsuch as a championship, a tournament, an awards ceremony, aworld-level contest, an attack, an incident, or a popular festiveevents such as Halloween or Christmas. The graph contains thesummary of events that occurred during the period of interest,hence its name of collective memory.

Before going deeper in the clusters analysis and the informationthey contain, we investigate the evolution of the graph structureover time.

Monthly networks. To obtain a better, fine-grained view ofthe collective memories, we focus on a smaller time-scale. Indeed,events attracting the attention of Wikipedia users for a periodlonger than a week or two are rare. Therefore, we split our datasetinto months. Monthly graphs are smaller, compared to the 7-monthsgraph, and contain 10 000 nodes on average. However, the proper-ties and distributions of monthly graphs are similar to the 7-monthsone, described above.

Short-term (monthly) graphs allow to understand and visualizethe dynamics of the memory formation in the long-term (7 months)graph. To give an example of the memory evolution, we discussthe cluster of the USA National Football League championship andthe collective memories that are mostly related to the previous NFL

100 101 102 103 104 105

Community size

0

5

10

15

20

Num

ber o

f com

mun

ities

LearnedInitial

Figure 5: Community size distribution of the initialWikipedia graph of hyperlinks (blue) and the learned Hop-field network (red). The total number of communities: 32 forthe initial graph, 172 for the learned one.

seasons (Fig. 6). NFL is one of the most popular sports leagues inthe USA and its triggers a lot of interest on Wikipedia. Thanksto the high number of visits on this topic we were able to spot acluster related to the NFL on each of the monthly graphs. Figure 6shows information about the NFL clusters. The top part of the figurecontains the learned graphs, for each month, where the NFL clusteris highlighted in red. The final game of the 2014 season, SuperBowl XLIX, had been played on February 1, 2015. This explainsthe increase in size until February where it reaches its maximum.The activity collapses after this event and the cluster disappears.For the sake of interpretability, we extracted 30 NFL team pagesfrom the original cluster (485 pages) to show the details of theevolution in time as a table on Fig. 6. This fraction of the nodesreflects overall dynamics in the entire cluster. Each row describesthe hourly activity of a page, while the columns split the plot intomonths. The sum of visits for the selected pages is plotted as a redline on the bottom. The dynamics of the detected cluster reflectsthe real timeline of the NFL championship. The spiking nature ofthe overall activity corresponds to weekends when most of thegames were played. Closer to the end of the championship, thepeaks become stronger, following the increasing interest of fans.We see the highest splash of the activity on 1 February, when thefinal game was played.

We want to emphasize that this cluster, as well as all the others,was obtained in a completely unsupervised manner. Football teampages were automatically connected together in a cluster having“NFL” as a common topic. Moreover, the cluster is not formed byoneWikipedia page and its direct neighbors, it involves many pageswith distances of several hops on the graph.

The NFL championship case is an example of a periodic (yearly)collective memory. The interest increases over the months until theexpected final event. Accidents and incidents are different typesof events as they appear suddenly, without prior activity. Our pro-posed learning method allows for the detection of these kinds of

Submitted to KDD’18, August 19–23, London, United Kingdom V. Miz et al.

Graph

evolution

F

I

N

A

L

Figure 6: Evolution of the National Football League 2014-2015 championship cluster. We show 30 NFL teams from the maincluster. Top: the monthly learned graph with the NFL cluster highlighted in red. Middle table: visitors activity per hour on theNFL teams’ Wikipedia pages in greyscale (the more visits, the darker). Bottom: timeline, in red, of the overall visitor activityin the cluster.

collective memories as well. We provide examples of three accidentsto demonstrate the behavior of the collective memory in case of anunexpected event.

We pick three core events among 172 detected and discuss themto show the details of memory detection by our approach. Figure 7shows the extracted clusters from the learned graph (top) and theoverall timeline of the clusters’ activity (bottom).

Charlie Hebdo shooting. 7 January 2015. This terrorist attack isan example of an unexpected event. The cluster emerged over aperiod of 72 hours, following the attack. All pages in the clusterare related to the core event. Strikingly, a look at the title of thepages is sufficient to get a precise summary of what the event isabout. There is a sharp peak of activity on the first day of the attack,slowly decaying over the following week.

Germanwings flight 9525 crash. 24 March 2015. This cluster notonly involves pages describing the crash or providing more infor-mation about it, but also the pages of similar events that happenedin the past. It includes, for example, a page enumerating airlanecrashes and the page of a previous crash that happened in Decem-ber 2014, the Indonesia AirAsia Flight 8501 crash. As a result, thememory of the event is connected to the memory of the Flight 8501crash, that is why we can see an increase of visits in December.This is an example where our associative memory approach notonly connects pages together but also events.

Ferguson unrest. Second wave. November 24, 2014 – December2, 2014. This is an example of an event that has official beginningand end dates. A sharp increase in the activity at the beginning ofprotests highlights the main event. This moment triggers the corecluster emergence. We also see that the cluster becomes active onceagain at the end of the unrest.

Eventually, we conclude our exploration of the clusters of thelearned graphs by providing on Table 1 a list of handpicked pagetitles inside each cluster that refer to previous events and relatedsubjects. The connected events occurred outside of the 7-monthsperiod we analyze. This illustrates the associative features of ourmethod. Firstly, pages are grouped by eventswith the help of visitorsactivity. Secondly, events themselves are connected together by thisactivity. Memory is made of groups of concepts tightly connectedand these groups are in turn connected together through conceptsthey share.

Recalling memories. In this section, our goal is to test thehypothesis that the proposed method, as a memory, allows recallingevents from partial information. We emulate recall processes usingthe Hopfield network approach, described in Section 4. We showthat the learned graph structure can recover a memory (cluster ofpages and its activations) from an incomplete input.

We create incomplete patterns for the Hopfield network by se-lecting randomly a few pages contained in a chosen cluster. We

Wikipedia graph mining: dynamic structure of collective memory Submitted to KDD’18, August 19–23, London, United Kingdom

1972_Andes_flight_disaster Accidents_and_incidents_involving_the_Airbus_A320_family

Aeroflot_Flight_593

Air_France_Flight_178

Air_France_Flight_296

Air_France_Flight_447

Air_France_Flight_4590 Airbus

Airbus_A320_family

Airbus_A380

Alive_(1993_film)

Alps

Andreas_Lubitz

Aviation_accidents_and_incidents

Aviation_safety

Barcelonnette

Black_box

Boeing_737

Boeing_747

Boeing_777

Carsten_Spohr

Democratic_Republic_of_the_Congo

Digne-les-Bains

Düsseldorf

EgyptAir_Flight_990

Eurowings

Exxon_Valdez

Exxon_Valdez_oil_spill

Flight_recorder

French_Alps

Gameel_Al-Batouti

Germanwings

Germanwings_Flight_9525

Helios_Airways_Flight_522

Hypoxia_(medical)

Japan_Airlines_Flight_123 Japan_Airlines_Flight_350

LAM_Mozambique_Airlines_Flight_470List_of_accidents_and_incidents_involving_commercial_aircraft

List_of_aircraft_accidents_and_incidents_resulting_in_at_least_50_fatalities

Lufthansa

March_24

Maria_RadnerMontabaur

Nanking_Incident

Oleg_Bryjak

Pacific_Southwest_Airlines_Flight_1771

Prads-Haute-Bléone

Royal_Air_Maroc_Flight_630

Sakuradamon_Incident_(1860)Scott_Clendenin

SilkAir_Flight_185

Stalag_Luft_III

Suicide_by_pilot

Sukhoi_PAK_FA

Tenerife_airport_disaster

Transponder_(aeronautics)

US_Airways_Flight_1549Shooting_of_Tamir_Rice

George_Zimmerman

O._J._Simpson_murder_case

Rodney_King

Death_of_Kelly_Thomas

Shooting_of_Michael_Brown

Ferguson_unrestJesse_Jackson

Sunny_Hostin

Death_of_Aiyana_Jones

Sean_Bell_shooting_incident

Grand_juries_in_the_United_States

Trayvon_Martin

List_of_killings_by_law_enforcement_officers_in_the_United_States

BART_Police_shooting_of_Oscar_Grant

Ferguson_Police_Department_(Missouri)Jay_Nixon

2011_England_riots

Shooting_of_Amadou_Diallo

1992_Los_Angeles_riots

Death_of_Eric_Garner

Missouri

Benjamin_Crump

Grand_jury

Cigarillo

IndictmentList_of_incidents_of_civil_unrest_in_the_United_StatesRodney_Alcala

George_Stephanopoulos

Death_of_Caylee_Anthony

Plessy_v._Ferguson

Attack_on_Reginald_Denny

Bob_McCulloch_(prosecutor)

Al_Sharpton

Shooting_of_Trayvon_Martin

David_Petraeus

Al-Qaeda

Depictions_of_Muhammad

Anjem_ChoudaryJihad

Paula_Broadwell

Clandestine_cell_system

Villers-Cotterêts

Dammartin-en-Goële

Muslim

Reims

Rowan_Atkinson

Kurt_WestergaardEmpire Everybody_Draw_Mohammed_Day

Satire

Christian_terrorism

Place_de_la_République

2015_Baga_massacre

ADX_FlorenceTheo_van_Gogh_(film_director)

Martyr

Islamic_terrorism

Anwar_al-Awlaki

La_Marseillaise

Christopher_Hitchens Richard_Reid

Black_ribbon

201_(South_Park)

Sharia

Al-Qaeda_in_the_Arabian_Peninsula

Picardy

11th_arrondissement_of_Paris

Charlie_Hebdo_shootingStephen_Fry

Longpont

Salman_Rushdie

List_of_(non-state)_terrorist_incidents

7_July_2005_London_bombings

Jyllands-Posten_Muhammad_cartoons_controversy

Maajid_Nawaz

Anders_Behring_Breivik

MI5

Porte_de_Vincennes_hostage_crisis

Ayaan_Hirsi_Ali

Jihadism

The_Satanic_Verses

List_of_Islamist_terrorist_attacks

Jews

Innocence_of_Muslims

Pegida

Abu_Hamza_al-Masri

Charlie_Hebdo

Hezbollah

October November December January February March April0.0

0.2

0.4

0.6

0.8

1.0

1.21e5

October November December January February March April0.0

0.5

1.0

1.5

2.0

2.5

3.01e5

October November December January February March April0.0

0.5

1.0

1.5

2.0

2.51e5

(a) Germanwings 9525 crash (b) Ferguson unrest (c) Charlie Hebdo attack

Figure 7: Collective memory clusters (top) and overall activity of visitors in the clusters (bottom).

Table 1: Collective memories triggered by core events.

Charlie Hebdo attack Germanwings 9525 crash Ferguson unrest

Porte de Vincennes hostage crisis Inex-Adria Aviopromet Flight 1308 Shooting of Tamir RiceAl-Qaeda Pacific Southwest Airlines Flight 1771 Shooting of Amadou DialloIslamic terrorism SilkAir Flight 185 Sean Bell Shooting IncidentHezbollah Suicide by pilot Shooting of Oscar Grant2005 London bombings Aviation safety 1992 Los Angeles riotsAnders Behring Breivik Air France Flight 296 O.J. Simpson murder caseJihadism Air France Flight 447 Shooting of Trayvon Martin2015 Baga massacre Airbus Attack on Reginald Denny

built the input matrix P0 setting to (−1) (inactive state) all the time-series except for the few pages selected. We then apply iterativelyEq. (2).

The results of the experiment are illustrated with an example onFigure 8. We selected the cluster associated to the Charlie HebdoAttack. From the list of pages, we select a subset of it, here 80%.We applied the learned graph for the month of January, when thememory was detected. After the recall (Fig. 8, right), most of thecluster is recovered at the correct position in time. Note that themodel forgets a part of the activity, plotted in light red. This missingpart is made of pages that are active outside of the time of the event,giving the evidence that they are not directly related (or weaklyrelated) to the event.

To evaluate the performance of the remembering, we mute dif-ferent fractions of nodes in memory clusters and compute recallerrors. We define the recall accuracy A as the ratio of correctlyrecalled activations over the number of initial activations. The erroris given by 1 −A. Figure 10 shows the results of the evaluation. Weconsider three types of errors. First, we measure the accuracy of arecalled pattern over 7 months (green). Second, the accuracy of arecalled pattern over a period of 72 hours after the start of the event

(blue). And third, a relaxed accuracy of the recall during the sameperiod of time, with a correct recovery if an initially active node isactive at least once in the recalled pattern (red). As expected, thequality of the recalled memory increases with the quality of theinput pattern. The better performance of the recall in the 72 hourszone is due to the memory effect that focuses on the most activepart of the cluster and forgets small isolated activity spikes of theindividual pages scattered over the 7 months.

In Figure 9, we show the results of the recall process for eachmonthly learned graph when the entire time-series are given atthe input. The output of the graph is summed up to create a globalcurve of activity over the 7 months. The initial global activity issubtracted from the output curve to obtain the final curve plottedon the figure. Red areas on the plot correspond to a low activity inthe output, hence an absence of memory in the zone. Green areasrepresent positive recalls. The memory of each of the monthlygraph corresponds to its learning period. This demonstrates theglobal effectiveness of the learning and the recall process.

Submitted to KDD’18, August 19–23, London, United Kingdom V. Miz et al.

Oct Nov Dec Jan Feb Mar Apr

01020304050

#Pag

e in

the

clus

ter Original memory

Oct Nov Dec Jan Feb Mar Apr

01020304050

#Pag

e in

the

clus

ter Partial memory

Oct Nov Dec Jan Feb Mar Apr

01020304050

#Pag

e in

the

clus

ter Recalled memory

1.00.80.60.40.2

0.00.20.40.60.81.0

Figure 8: Recall of an event from a partial pattern (Charlie Hebdo attack). The red vertical lines define the start of the eventand its most active part, ending 72 hours from the start. Left: full activity over time of the pages in the cluster. Middle: patternwith 20% of it set inactive. Right: result of the recall using the Hopfield network model of associative memory. In light red areshown the difference with the original pattern (the forgotten activity).

Oct Nov Dec Jan Feb Mar Apr500400300200100

0100200300400

Oct Nov Dec Jan Feb Mar Apr600500400300200100

0100200

Oct Nov Dec Jan Feb Mar Apr500400300200100

0100200300400

Oct Nov Dec Jan Feb Mar Apr400300200100

0100200

Oct Nov Dec Jan Feb Mar Apr500400300200100

0100200300

Oct Nov Dec Jan Feb Mar Apr600

400

200

0

200

400

Oct Nov Dec Jan Feb Mar Apr600400200

0200400600

October November December January February March April

Figure 9: Recalled activity over time for each monthly learned graph.

10 20 30 40 50 60 70 80 90 100Fraction of muted nodes (%)

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Erro

r

Figure 10: Plot of the recall error rate with three differenterror measures.

7 CONCLUSIONS AND FUTUREWORKIn this paper, we propose a new method that allows learning andremembering collective memories in an unsupervised manner. Toextract memories, the method analyses the Wikipedia Web net-work and hourly viewership history of its articles. The collectivememories are summaries of the events that raised the interest ofWikipedia visitors. The approach is able to perform an efficientknowledge retrieval from the Wikipedia encyclopedia allowing tohighlight the changing interests of visitors over time and to shedsome light on their collective behavior.

This approach is able to handle large-scale datasets. We havenoted experimentally a high robustness of the method to the tuningof the parameters. This will be investigated further in future work.

This work opens new avenues for dynamic graph-structureddata analysis. For example, the proposed approach could be used

in a framework for automated event detection, monitoring, and fil-tering in network structured visitor activity streams. The resultingframework is also of interest for the understanding of the humanmemory and the simulation of an artificial memory.

8 TOOLS, IMPLEMENTATION, CODE ANDONLINE VISUALIZATIONS

All learned graphs (overall September-April, monthly activity, andlocalized events) are available online [24] to foster further explo-ration and analysis. For graph visualization we used the open sourcesoftware package Gephi [5] and layout ForceAtlas2 [20]. We usedApache Spark GraphX [28], [15] for graph learning implementa-tion and graph analysis. The presented results can be reproducedusing the code for Wikipedia dynamic graph learning, written inScala [23]. The dataset is available on Zenodo [22].

ACKNOWLEDGMENTSWe would like to thank Michaël Defferrard and Andreas Loukas forfruitful discussions and useful suggestions.

The research leading to these results has received funding fromthe EuropeanUnion’s H2020 Framework Programme (H2020-MSCA-ITN-2014) under grant agreement no 642685 MacSeNet.

REFERENCES[1] DA Allport. 1985. Distributed memory, modular subsystems and dysphasia.

Current perspectives in dysphasia (1985), 32–60.[2] Jan Assmann and John Czaplicka. 1995. Collective memory and cultural identity.

New German Critique 65 (1995), 125–133.[3] Ching-man Au Yeung and Adam Jatowt. 2011. Studying how the past is re-

membered: towards computational history through large scale text mining. InProceedings of the 20th ACM international conference on Information and knowledgemanagement. ACM, 1231–1240.

[4] Jeffrey Andrew Barash. 2016. Collective Memory and the Historical Past. Universityof Chicago Press.

[5] Mathieu Bastian, Sebastien Heymann, Mathieu Jacomy, et al. 2009. Gephi: anopen source software for exploring and manipulating networks. Icwsm 8 (2009),361–362.

[6] Kirell Maël Benzi. 2017. From recommender systems to spatio-temporal dynamicswith network science. EPFL, Chapter 5, 97–98.

Wikipedia graph mining: dynamic structure of collective memory Submitted to KDD’18, August 19–23, London, United Kingdom

[7] DavidMBlei, Andrew YNg, andMichael I Jordan. 2003. Latent dirichlet allocation.Journal of machine Learning research 3, Jan (2003), 993–1022.

[8] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefeb-vre. 2008. Fast unfolding of communities in large networks. Journal of statisticalmechanics: theory and experiment 2008, 10 (2008), P10008.

[9] Alin Coman, Adam D Brown, Jonathan Koppel, and William Hirst. 2009. Collec-tive memory from a psychological perspective. International Journal of Politics,Culture, and Society IJPS 22, 2 (2009), 125–141.

[10] Michela Ferron. 2012. Collective memories in Wikipedia. Ph.D. Dissertation.University of Trento.

[11] Michela Ferron and Paolo Massa. 2011. Studying collective memories inWikipedia. Journal of Social Theory 3, 4 (2011), 449–466.

[12] Michela Ferron and Paolo Massa. 2012. Psychological processes underlyingWikipedia representations of natural and manmade disasters. In Proceedings ofthe Eighth Annual International Symposium on Wikis and Open Collaboration.ACM, 2.

[13] Wikimedia Foundation. 2016. Wikimedia Downloads: Database tables as sql.gzand content as XML files. (May 2016). https://dumps.wikimedia.org/other/Accessed: 2016-May-16.

[14] Ruth García-Gavilanes, Anders Mollgaard, Milena Tsvetkova, and Taha Yasseri.2017. The memory remains: Understanding collective memory in the digital age.Science Advances 3, 4 (2017), e1602368.

[15] Joseph E Gonzalez, Reynold S Xin, Ankur Dave, Daniel Crankshaw, Michael JFranklin, and Ion Stoica. 2014. GraphX: Graph Processing in a DistributedDataflow Framework.. In OSDI, Vol. 14. 599–613.

[16] David Graus, Daan Odijk, andMaarten de Rijke. 2017. The birth of collective mem-ories: Analyzing emerging entities in text streams. arXiv preprint arXiv:1701.04039(2017).

[17] Maurice Halbwachs. 2013. Les cadres sociaux de la mémoire. Albin Michel.[18] Donald Olding Hebb. 2005. The organization of behavior: A neuropsychological

theory. Psychology Press.

[19] John J Hopfield. 1982. Neural networks and physical systems with emergentcollective computational abilities. Proceedings of the national academy of sciences79, 8 (1982), 2554–2558.

[20] Mathieu Jacomy, Tommaso Venturini, Sebastien Heymann, and Mathieu Bastian.2014. ForceAtlas2, a continuous graph layout algorithm for handy networkvisualization designed for the Gephi software. PloS one 9, 6 (2014), e98679.

[21] Nattiya Kanhabua, Tu Ngoc Nguyen, and Claudia Niederée. 2014. What triggershuman remembering of events? A large-scale analysis of catalysts for collec-tive memory in Wikipedia. In Digital Libraries (JCDL), 2014 IEEE/ACM JointConference on. IEEE, 341–350.

[22] Benzi Kirell, Miz Volodymyr, Ricaud Benjamin, and Vandergheynst Pierre. 2017.Wikipedia time-series graph. (Sept. 2017). https://doi.org/10.5281/zenodo.886484

[23] V. Miz. 2017. Wikipedia graph mining. GraphX implementation. https://github.com/mizvol/WikiBrain. (2017).

[24] V. Miz. 2017. Wikipedia graph mining. Visualization. (2017). https://goo.gl/Xgb7QG

[25] Christian Pentzold. 2009. Fixing the floating gap: The online encyclopaediaWikipedia as a global memory place. Memory Studies 2, 2 (2009), 255–272.

[26] Arthur A Stone, Christine A Bachrach, Jared B Jobe, Howard S Kurtzman, andVirginia S Cain. 1999. The science of self-report: Implications for research andpractice. Psychology Press.

[27] Ramine Tinati, Markus Luczak-Roesch, and Wendy Hall. 2016. Finding Structurein Wikipedia Edit Activity: An Information Cascade Approach. In Proceedings ofthe 25th International Conference Companion on World Wide Web. InternationalWorld Wide Web Conferences Steering Committee, 1007–1012.

[28] Reynold S Xin, Joseph E Gonzalez, Michael J Franklin, and Ion Stoica. 2013.Graphx: A resilient distributed graph system on spark. In First InternationalWorkshop on Graph Data Management Experiences and Systems. ACM, 2.

[29] Burcu Yucesoy and Albert-László Barabási. 2016. Untangling performance fromsuccess. EPJ Data Science 5, 1 (2016), 17.