11
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 252–262 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics 252 Spatial Aggregation Facilitates Discovery of Spatial Topics Aniruddha Maiti Temple University Philadelphia, PA-19122, USA [email protected] Slobodan Vucetic Temple University Philadelphia, PA-19122, USA [email protected] Abstract Spatial aggregation refers to merging of doc- uments created at the same spatial location. We show that by spatial aggregation of a large collection of documents and applying a tra- ditional topic discovery algorithm on the ag- gregated data we can efficiently discover spa- tially distinct topics. By looking at topic dis- covery through matrix factorization lenses we show that spatial aggregation allows low rank approximation of the original document-word matrix, in which spatially distinct topics are preserved and non-spatial topics are aggre- gated into a single topic. Our experiments on synthetic data confirm this observation. Our experiments on 4.7 million tweets collected during the Sandy Hurricane in 2012 show that spatial and temporal aggregation allows rapid discovery of relevant spatial and temporal top- ics during that period. Our work indicates that different forms of document aggregation might be effective in rapid discovery of vari- ous types of distinct topics from large collec- tions of documents. 1 Introduction Social microblogging sites such as Twitter gener- ate large volumes of short documents through the activity of hundreds of millions of users around the world. This provides an unprecedented ac- cess to the pulse of the global society. Due to the sheer volume and diversity of the generated con- tent, topic discovery has been an invaluable tool in an effort to make sense of this data. Regardless of a precise definition of a topic and a particular topic model, topics discovery is used to describe per- tinent themes in a document corpus and serve to identify events, trends, and interests at the global, local, or a social group level. Among the most popular topic modeling tech- niques are Latent Dirichlet Allocation (LDA), La- tent Semantic Analysis (LSA), and Non-negative Matrix factorization (NMF). When applying those techniques for topic discovery from microblogs, there are three main challenges: (1) how to im- prove computational speed, (2) how to extract use- ful topics, and (3) how to deal with short texts. Many papers were published that address one or more of these challenges and most of them pro- pose to modify the original topic models. In this paper, we are focusing on aggregation (also referred to as pooling)(Alvarez-Melis and Saveski, 2016)(Hong and Davison, 2010)(Weng et al., 2010)(Steinskog et al., 2017), a particular document preprocessing technique that has been empirically shown to be useful for topic discov- ery from microblogs. The main idea of aggrega- tion is to combine multiple documents into a sin- gle document according to some external criterion and to apply a topic discovery algorithm on the ag- gregated documents. The earliest mentions of ag- gregation (Mehrotra et al., 2013)(Hong and Davi- son, 2010)(Weng et al., 2010) are motivated by the difficulty when applying NMF and LDA to very short text documents (Hong and Davison, 2010). This difficulty in finding useful topics is often at- tributed to the sparseness of the document-word matrix (Yan et al., 2013)(Cheng et al., 2014), which fails to provide confident counts of word co- occurrence and information about the shared con- text (Phan et al., 2008). Microblogs often come with metadata such as hashtags, author name, time stamp, or location. By aggregating the mi- croblogs according to such metadata, the intuition is that the resulting aggregated documents contain a sufficient number of words for topic modeling schemes to identify meaningful topics. In addi- tion, the authors of those early papers observe that aggregating microblogs that are similar in some sense (semantically, temporally) enriches the con- tent present in a single document and results in better topics (Mehrotra et al., 2013). Finally, due

Spatial Aggregation Facilitates Discovery of Spatial Topicsvucetic/documents/maiti2019acl.pdf · Spatial aggregation refers to merging of doc-uments created at the same spatial location

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Spatial Aggregation Facilitates Discovery of Spatial Topicsvucetic/documents/maiti2019acl.pdf · Spatial aggregation refers to merging of doc-uments created at the same spatial location

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 252–262Florence, Italy, July 28 - August 2, 2019. c©2019 Association for Computational Linguistics

252

Spatial Aggregation Facilitates Discovery of Spatial Topics

Aniruddha MaitiTemple University

Philadelphia, PA-19122, [email protected]

Slobodan VuceticTemple University

Philadelphia, PA-19122, [email protected]

Abstract

Spatial aggregation refers to merging of doc-uments created at the same spatial location.We show that by spatial aggregation of a largecollection of documents and applying a tra-ditional topic discovery algorithm on the ag-gregated data we can efficiently discover spa-tially distinct topics. By looking at topic dis-covery through matrix factorization lenses weshow that spatial aggregation allows low rankapproximation of the original document-wordmatrix, in which spatially distinct topics arepreserved and non-spatial topics are aggre-gated into a single topic. Our experiments onsynthetic data confirm this observation. Ourexperiments on 4.7 million tweets collectedduring the Sandy Hurricane in 2012 show thatspatial and temporal aggregation allows rapiddiscovery of relevant spatial and temporal top-ics during that period. Our work indicatesthat different forms of document aggregationmight be effective in rapid discovery of vari-ous types of distinct topics from large collec-tions of documents.

1 Introduction

Social microblogging sites such as Twitter gener-ate large volumes of short documents through theactivity of hundreds of millions of users aroundthe world. This provides an unprecedented ac-cess to the pulse of the global society. Due to thesheer volume and diversity of the generated con-tent, topic discovery has been an invaluable tool inan effort to make sense of this data. Regardless ofa precise definition of a topic and a particular topicmodel, topics discovery is used to describe per-tinent themes in a document corpus and serve toidentify events, trends, and interests at the global,local, or a social group level.

Among the most popular topic modeling tech-niques are Latent Dirichlet Allocation (LDA), La-tent Semantic Analysis (LSA), and Non-negative

Matrix factorization (NMF). When applying thosetechniques for topic discovery from microblogs,there are three main challenges: (1) how to im-prove computational speed, (2) how to extract use-ful topics, and (3) how to deal with short texts.Many papers were published that address one ormore of these challenges and most of them pro-pose to modify the original topic models.

In this paper, we are focusing on aggregation(also referred to as pooling) (Alvarez-Melis andSaveski, 2016) (Hong and Davison, 2010) (Wenget al., 2010) (Steinskog et al., 2017), a particulardocument preprocessing technique that has beenempirically shown to be useful for topic discov-ery from microblogs. The main idea of aggrega-tion is to combine multiple documents into a sin-gle document according to some external criterionand to apply a topic discovery algorithm on the ag-gregated documents. The earliest mentions of ag-gregation (Mehrotra et al., 2013) (Hong and Davi-son, 2010)(Weng et al., 2010) are motivated by thedifficulty when applying NMF and LDA to veryshort text documents (Hong and Davison, 2010).This difficulty in finding useful topics is often at-tributed to the sparseness of the document-wordmatrix (Yan et al., 2013) (Cheng et al., 2014),which fails to provide confident counts of word co-occurrence and information about the shared con-text (Phan et al., 2008). Microblogs often comewith metadata such as hashtags, author name,time stamp, or location. By aggregating the mi-croblogs according to such metadata, the intuitionis that the resulting aggregated documents containa sufficient number of words for topic modelingschemes to identify meaningful topics. In addi-tion, the authors of those early papers observe thataggregating microblogs that are similar in somesense (semantically, temporally) enriches the con-tent present in a single document and results inbetter topics (Mehrotra et al., 2013). Finally, due

Page 2: Spatial Aggregation Facilitates Discovery of Spatial Topicsvucetic/documents/maiti2019acl.pdf · Spatial aggregation refers to merging of doc-uments created at the same spatial location

253

(a) Common Topic (b) Distinct Topic

Figure 1: Examples of common and distinct topics. a)Common topic: a work-related topic. b) Distinct tem-poral topic: presidential debate

to reduction in a number of documents, aggrega-tion also leads to computational savings.

While aggregation has received interest in theresearch community and there are several empir-ical studies illustrating its benefits, we are notaware of a study that manages to provide, beyondbrief intuitive arguments, an insight into why ag-gregation works and what are its advantages andlimitations. In this paper we attempt to providesuch an insight from the perspective of discover-ing spatially specific topics. As will be evident,our insights extend to other means of aggregation.

Our argument will be given in the context ofmatrix factorization, where a document-word ma-trix X is represented as a product W · H , wherej-th row of matrix H represents word distributionin j-th topic and i-th row of matrixW represents adistribution of topics in i-th document. We adoptthe terminology from (Kim et al., 2015), whichdistinguishes between common and distinct topics(see Figure 1), where distribution of common top-ics within the corpus is not impacted by the ag-gregation metadata such as location, time, or au-thor of a microblog, and distribution of distincttopics is correlated with the metadata. We showthat factorization of the aggregated matrixXa, ob-tained by merging documents based on metadata(e.g., location), allows its low rank approximationas Wa · Ha, where the resulting topic matrix Ha

retains the distinct topics fromH (e.g., spatial top-ics) and where the common topics from H aremerged into a single topic in Ha. We will showempirical results confirming this observation bothon synthetic and real-life data. In particular, wewill demonstrate this behavior in case of spatialand temporal aggregation.

The main contribution of this paper is in demon-strating that applying standard topic discovery al-gorithms such as NMF and LDA on aggregateddocuments results in discovery of topics relatedto the aggregation method. Moreover, since the

aggregated matrix Xa can be orders of magnitudesmaller than the original matrix X , the computa-tional cost can also be reduced by orders of mag-nitude. Finally, as observed in the previous work,aggregation also alleviates the problem of sparsitywhen discovering topics in microblogs.

2 Related Work

Topic modeling from microblogs has a vastamount of literature (Steiger et al., 2015). Earlywork includes using NMF on term correlation ma-trix (Yan et al., 2013) and ncut-weighted NMF(Yan et al., 2012). Recent work includes NMijF(Nugroho et al., 2017), which takes into accounttweet-to-tweet interactions. Location recommen-dation model based on topic modeling was pro-posed in (Hu et al., 2013). NMF is used in Dis-cNMF (Kim et al., 2015) and STExNMF (Shinet al., 2017) to identify spatio-temporal topics.Pairfac (Wen et al., 2016) employs tensor decom-position accounting for location, time, and venue.In TopicOnTiles (Choi et al., 2018), the entirespace-time is divided into small tiles and NMFis performed on each tile separately. LDA (Bleiet al., 2003) has also been used for topic detec-tion. In (Zhao et al., 2011), LDA is used to catego-rize and summarize tweets. In (Weng et al., 2010),LDA is used to find influential users in Twitter.

Traditional topic modeling techniques such asLDA, LSA, and NMF are sensitive to sparsity(Hong and Davison, 2010). Different types ofdocument aggregation schemes have been sug-gested to overcome this issue (Alvarez-Melis andSaveski, 2016). One example of an aggregationscheme is the author-topic model (Weng et al.,2010), in which multiple tweets from the sameuser are aggregated to construct documents rep-resentative of the user. In (Hong and Davison,2010), it was observed that document aggrega-tion endows the resulting dataset with interestingproperties, where aggregation based on authorshas been reported to produce topics which are dif-ferent from topics discovered on non-aggregateddataset. User level aggregation was also found tobe useful in related papers (Giorgi et al., 2018).Similar results were also observed for aggregationbased on hashtags (Steinskog et al., 2017). Thesepapers did not attempt to explain the mechanismbehind changes in the discovered topics and this iswhere our current paper makes a contribution.

Page 3: Spatial Aggregation Facilitates Discovery of Spatial Topicsvucetic/documents/maiti2019acl.pdf · Spatial aggregation refers to merging of doc-uments created at the same spatial location

254

3 Methodology

3.1 Problem SetupLet us assume we are given a corpus of documentsD = {d1, d2, d3, .., dN}, where N is the totalnumber of documents. Let V be the vocabularyof unique words in the corpus. By using the bagof words representation, the corpus can be repre-sented by a document-word matrix X of dimen-sion N × V , where element Xi,j is the count ofj-th word in i-th document. We will also assumethat each document di is associated with a timestamp t(di) ∈ 1, ...T , where T is the number oftime steps, and location l(di) ∈ 1, ...L, where L isthe number of locations.

We will make an assumption that there are Ktopics t1, ...tK , where topic tk defines probabil-ity that word wj will be generated by the topicas p(wj |tk), and that each document in a corpusis represented by a single topic. Our simplifyingassumption that each document is generated by asingle topic is acceptable when dealing with shortdocuments such as microblogs. In addition, it willmake it easier to describe the main effect of docu-ment aggregation.

Among the K topics, we will assume that thefirst Kd are spatially distinct topics and the sec-ond Kc are common topics. For common topics,the probability or their occurrence does not de-pend on location of the document. In other words,p(tk|l) = p(tk), where l is location. Conversely,for spatially distinct topics, the probability of theiroccurrence is dependent on the document location.We illustrate such a setup in Figure 2, where thereare 4 spatially distinct topics generated within 4different circular regions and 2 common topics oc-curring equally likely over the whole square re-gion. In this example, the probability that a dis-tinct topic is generated within its assigned circle isconstant and is zero outside.

GivenD, the objective is to find the distinct top-ics. In the following we will argue that documentaggregation enables computationally efficient dis-covery of the distinct topics.

3.2 Effect of Spatial Aggregation on RankIn this section we will explain why spatial aggre-gation of documents facilitates discovery of spa-tially distinct topics. If we select a subset Xk ofall documents from X generated by topic tk, thebest rank-1 approximation of Xk is proportionalto nk · hk, where nk is a column vector of length

N whose i-th element is the sum of all words ini-th document and hk is a row vector of lengthV whose j-th element hkj equals p(wj |tk). Letus denote this rank-1 approximation as Xk

1. Ifwe sort the document-word matrix X by topics,we can approximate it by vertically concatenatingrank-1 matrices Xk

1. The rank of the resultingmatrix X1 will be less than or equal to J .

We observe that the rank of matrix X can be ashigh as V >> J and that matrix factorization ofX into product W · H cannot guarantee success-ful topic discovery. On the other hand, we observethat factorization of X1 can easily result in dis-covery of the underlying J topics. Unfortunately,generating matrix X1 is as difficult as the topicdiscovery problem itself. We argue in the follow-ing that aggregation based on location results ingeneration of a matrix closely related to X1. Assuch, we demonstrate that spatial aggregation isvery useful for discovery of spatially distinct top-ics.

Let us define binary matrix Q with L rows andN columns as spatial aggregation matrix whichmerges the N original documents into L aggre-gated documents, where Ql,i = 1 if document xibelongs to l-th location and Ql,i = 0 otherwise.We construct the aggregated document-word ma-trix of size L × V as Xa = Q ·X . The expectedvalue of l-th row of matrix Xa equals:

E(Xal ) =

∑k(nlk · hk), (1)

where, nlk is a scalar equal to the number of wordsgenerated from topic tk in documents from l-th lo-cation and hk is a row vector defined in the firstparagraph of this subsection. If the number of doc-uments at l-th location is large, the observed Xa

l

will be close to E(Xal ). Since based on equation

(1) each row ofXa can be approximated as the lin-ear combination of K topic vectors hk, it followsthat matrixXa is approximately of rankK or less.We can thus closely approximate Xa as productW a ·Ha, where k-th row of matrix Ha equals hkand (l, k)-th element of matrix W a equals nlk.

We will now show that W a ·Ha has rank lowerthan K. Since the Kc common topics are assumedto be location independent, the number of docu-ments generated by k-th common topic is approx-imately the same at every location. Thus, we canapproximate nlk = nk for each of theKc commontopics. Therefore, the last Kc columns of matrixW a are constant. As a result, the rank of matrix

Page 4: Spatial Aggregation Facilitates Discovery of Spatial Topicsvucetic/documents/maiti2019acl.pdf · Spatial aggregation refers to merging of doc-uments created at the same spatial location

255

W a · Ha is Kd + 1 or less, where the Kc com-mon topics increase the rank by only one. As aresult, we can replace the last Kc columns of W a

with a single column equal to the sum of the lastKc columns of W a and replace the last Kc rowsof Ha with a single row equal to the sum of thelast Kc rows of Ha. The resulting topic matrixHa is of dimension (Kd + 1)× V , where the lastrow is a sum of word probabilities over all com-mon topics, while the first Kd rows are reservedfor each of the Kd spatially distinct topics. Thisis a significant result showing that spatial aggrega-tion facilitates discovery of spatially distinct topicswhile it collapses all documents generated by thecommon topics into a matrix that can be closelyapproximated by a rank-1 matrix.

3.3 NMF and LDA on Aggregated DataIn the previous section we did not specify aparticular algorithm for matrix factorization andtopic discovery. NMF is a popular matrix fac-torization algorithm for nonnegative matrices suchas document-word matrices. NMF finds non-negative and sparse matrices W and H whoseproduct approximates the original matrix. It solvesthe following optimization problem:

F (W,H) =1

2||X−W ·H||2Fro+α ·ρ · ||W ||1+

α · ρ · ||H||1 +1

2α(1− ρ) · ||W ||2Fro+

1

2α(1− ρ) · ||H||2Fro. (2)

Here, the Frobenius norm of a matrixA is denotedby ||A||Fro and α and ρ are regularization parame-ters. Rows ofW of sizeN×K represent the topicmixture within a particular document where K isthe number of topics. Rows of H of size K × Vrepresent the word distribution within a particu-lar topic. The NMF optimization problem is typi-cally solved iteratively and the algorithm becomesexpensive for large data sets. NMF is also sensi-tive on collections of short documents such as mi-croblogs. NMF favors commonly occurring top-ics and commonly ocurring words, which makesfinding rare spatially distinct topics very difficult.Document aggregation based on metadata suchas location directly addresses the aforementionedNMF issues.

The arguments in the previous sections demon-strate the benefit of aggregation through matrix

Figure 2: Spatially distinct topics on simulated data

factorization. However, our assumptions made in3.1 closely resemble the generating process usedin LDA, where each document is a mixture overlatent topics, and each topic is characterized by adistribution over words. From the corpus, LDAlearns the topic distribution over documents andword distribution over topics. While, in theory,LDA should be able to discover topics directlyfrom the original matrix X , it suffers from thesame shortcomings as NMF: it is slow, fragile, andsensitive to sparse documents. As will be demon-strated in the experiments, document aggregationhas very similar effects on both NMF and LDA.

To summarize, the resulting distinct topicdiscovery procedure has the following steps:

1. Construct document-word matrix X .2. Construct spatial aggregation matrix Q frommetadata.3. Perform NMF on aggregated matrix Q ·X tofind spatially distinct topics.If we wish to identify spatial-temporal topics,

we may additionally aggregate the data based ontime. First, the entire time span can be dividedinto smaller intervals. Then, all documents in eachspace-time cell are aggregated into a single docu-ment. Although we do not show it in our experi-ments, our major insight about the effect of docu-ment aggregation extends to other forms of aggre-gation such as author- or hashtag-based.

4 Experiments on Simulated Dataset

In this section, we use synthetic data to study theeffect of document aggregation on topic discovery.

Following the setup provided in Section 3.1,we created a dataset using a simplistic generativemodel. Words in each document in the dataset aregenerated from two common topics (C1 and C2)

Page 5: Spatial Aggregation Facilitates Discovery of Spatial Topicsvucetic/documents/maiti2019acl.pdf · Spatial aggregation refers to merging of doc-uments created at the same spatial location

256

1 2 3 4 5NMF on original dataset

0

20

40

60

80

100

120

Coun

t of w

ords in

NMF

Top

ics

C1 C2 D1 D2 D3 D4

1 2 3 4 5NMF on aggregated dataset

0

20

40

60

80

100

120

Coun

t of w

ords in

NMF

Top

ics

C1 C2 D1 D2 D3 D4

Figure 3: Five topics discovered by NMF on non-aggregated and aggregated documents

and four spatially distinct topics (D1, D2, D3 andD4). Each common and distinct topic uses a vo-cabulary with 100 words. Each document is as-sociated with a single topic. To generate a doc-ument, a topic is chosen first, then 10 words aresampled randomly from the 100 words associatedwith that particular topic. Documents generatedfrom the common topics are distributed randomlywithin the square. For each distinct topic, a cir-cular region is defined within the square and thedocuments associated with that topic are placed byuniformly sampling within the circle. The place-ment of the circular regions is shown in Figure-2. A total of 10, 000 documents are generated foreach common topic and 1, 000 documents for eachspatially distinct topic. We call this dataset thenon-aggregated dataset. To demonstrate how ag-gregation affects the topic discovery, we dividedthe entire region in 4 × 4 small squares. Then wemerged all the documents from each small squareinto a single aggregated document. In this way,we constructed 16 aggregated documents. We callthis dataset the aggregated dataset.

NMF set to find 5 topics was applied to the non-aggregated and the aggregated datasets. In Figure3, we show the distribution of words in each of the5 identified topic. For example, the first bin in theleft subplot shows that discovered topic 1 has 91unique words, all belonging to common topic C1.On the other hand, the first bin in the right sub-plot shows that discovered topic 1 has 100 uniquewords, 38 belonging to common topic C1 and 58to common topic C2. We can see that none of thespatially distinct topics are discovered when weapply NMF on the non-aggregated data. All fiveidentified topics contain words from the 2 com-mon topics. On the other hand, in the aggregateddataset, the first identified topic contains a mixtureof words from the 2 common topics, while the re-maining 4 are almost entirely comprised of wordsfrom the 4 spatially distinct topics. This result ex-

1 2 3 4 5 6 7 8 9 10NMF on original dataset

0

20

40

60

80

100

Count of words in NMF Topics C1 C2 D1 D2 D3 D4

1 2 3 4 5 6 7 8 9 10NMF on aggregated dataset

0

20

40

60

80

100

120

Count of words in NMF Topics C1 C2 D1 D2 D3 D4

Figure 4: Ten topics identified by NMF on non-aggregated and aggregated documents

1 2 3 4 5NMF on original dataset

0

20

40

60

80

100

120

Coun

t of w

ords in

NMF To

pics C1 C2 D1 D2 D3 D4

1 2 3 4 5NMF on aggregated dataset

0

20

40

60

80

100

120

Coun

t of w

ords in

NMF To

pics C1 C2 D1 D2 D3 D4

Figure 5: Five topics identified by NMF on original andaggregated data using a smaller set of documents

perimentally supports our insight about the impactof spatial aggregation presented in section 3.2.

4.1 Effect of Number of Topics in NMF

We repeated the NMF experiment, but this time weset the number of NMF topics to 10. We can seefrom Figure 4 that all 10 topics found on the non-aggregated data are still one of the two commontopics. On the other hand, after applying NMF onthe aggregated data, 4 of the discovered topics di-rectly correspond to the 4 spatially distinct topics,while the remaining 6 discovered topics are a mix-ture of the 2 common topics.

4.2 Effect of Number of Documents

We repeated the experiments on a smaller corpusto see its effect on topic discovery. We generated1, 000 documents for each common topic and 150documents for each distinct topic. The result issummarized in Figure 5. As compared to Figure3, we can see a slight deterioration of the qualityof discovered spatially distinct topics from the ag-gregated data. In particular, all of the 4 discoveredspatial topics are corrupted with more words fromthe common topics, which is particularly visiblefrom the rightmost bin containing and an almostequal mixture of words from topics D1, C1, andC2. We observe that topic D1 corresponds to thelargest circle.

4.3 Effect of Grid Density

We repeated the previous experiment on thesmaller dataset with 1, 000 documents for each

Page 6: Spatial Aggregation Facilitates Discovery of Spatial Topicsvucetic/documents/maiti2019acl.pdf · Spatial aggregation refers to merging of doc-uments created at the same spatial location

257

1 2 3 4 5NMF on original dataset

0

20

40

60

80

100

120

Coun

t of w

ords in

NMF To

pics C1 C2 D1 D2 D3 D4

1 2 3 4 5NMF on aggregated dataset

0

20

40

60

80

100

120

Coun

t of w

ords in

NMF To

pics C1 C2 D1 D2 D3 D4

Figure 6: Five topics identified by NMF using densespatial grid 64× 64

common topic and 150 documents for each spa-tially distinct topic, but this time with gradually in-creasing aggregation density. In Figure 6 we showresults of applying NMF set to discover 5 top-ics for the spatial aggregation scheme with a gridsize 64 × 64. As expected, the results look moresimilar to topic discovery from the non-aggregateddataset. Interestingly, despite the vary coarse ag-gregation (many spatial blocks were empty or witha single document), we still discovered topics D3and D4, which correspond to the smaller circles.

5 Experiments on Real Life Data

Identifying spatially distinct topics in a real lifedataset is a challenging task. As we will demon-strate, we found that the aggregation scheme isquite successful in identifying distinct topics. Weperformed our experiments on Hurricane SandyTwitter corpus downloaded through Twitter searchAPI1 using the tweet IDs released in (Wang et al.,2015). The downloaded corpus contains 4.7 mil-lion tweets that temporally span 12 days surround-ing the Hurricane Sandy and a few other distin-guishable events between October 22nd, 2012 andNovember 2nd, 2012. Every tweet in the datasetis also geotagged to one of 13 states along theEast Coast of the U.S. During preprocessing wetransformed all characters to lowercase and re-moved stopwords and special characters. We alsoexcluded repetitive letters that convey enthusiasm(e.g., birthdayy, birthdayyy, birthdayyyy). Finally,TF-IDF document-word matrix is constructed us-ing the 20, 000 most frequent words in the corpus.

Since the spatial distribution of tweets is highlyimbalanced, we decided not to use a regular spa-tial grid. Instead, we employed k-means cluster-ing on the latitude and longitude information foreach tweet to identify 200 cluster centers in space.Each tweet is assigned to its nearest cluster cen-ter for spatial aggregation. Figure 7 shows differ-ent clusters on 50, 000 tweets randomly sampled

1https://developer.twitter.com/

Figure 8: State specific distinct topics

from the corpus. We can observe that the densityof clusters is much larger within heavily populatedurban areas along the East Coast.

Figure 7: K-means cluster for spatial aggregation

NMF was employed to find 500 topics withα = 0.1 and ρ = 0.5. Only 107 rows of H werefound to have at least one nonzero entry. Appli-cation of NMF on the 200 aggregated documentsidentifies some spatially distinct topics coveringregions of varying size. Figure 8, shows wordclouds for two large state-specific distinct topics.We also found that large metropolitan areas suchas New York City, Philadelphia, and Pittsburghare represented as separate spatially distinct top-ics. One such example is shown in Figure 9. Al-most all the words in this topic are related to NewYork City airports.

In addition to spatial aggregation, we also per-formed experiments by aggregating data in spaceand time. In addition to the k = 200 spatial clus-ters we divided the time interval into 12 days, re-sulting in a total of 2, 400 spatio-temporally aggre-gated documents. As expected, this aggregationreveals distinct spatio-temporal topics.

We identified several purely temporal topics inthis way, including the Halloween topic shown inFigure 10. It is interesting to observe that this topicalso contains words related to the season opening

Page 7: Spatial Aggregation Facilitates Discovery of Spatial Topicsvucetic/documents/maiti2019acl.pdf · Spatial aggregation refers to merging of doc-uments created at the same spatial location

258

Figure 9: NYC airport-specific distinct topic

Figure 10: Halloween and CMA temporal topics

NBA game between L.A. Lakers and Miami Heatthat occurred on the same day. Figure 10 also con-tains another temporally distinct topic associatedwith the 2012 Country Music Association (CMA)Award event that happened on the same day.

To better illustrate this CMA-related topic, inTable 1 we show several representative tweets.These tweets were randomly selected from tweetscontaining at least one of the most frequent 10words in the CMA-related topic.

5.1 Evaluation: Space-Time Scan Statistics

Looking at word clouds is a descriptive way toevaluate the quality of discovered topics. In thissubsection we will present experimental results at-tempting to quantitatively evaluate the quality ofthe discovered topics. To achieve this we usethe space-time scan statistics implemented in theSaTScan software (Kulldorff, 2010). We selectedthe 10 most frequent words in each discoveredtopic and labeled each tweet from the corpus basedon the presence of these words. If a tweet con-tains any of the 10 words it is assigned to the cor-responding topic. We call all tweets assigned tothe given topic the positive tweets. If the topic

Table 1: Tweets related to CMA awards

Anyone know what channel the cma is on?Can’t wait for the cma awardsEveryone get prepared for a bunch of cma awardstweetsTomorrow is 46 cma awards so watching that!!carrie underwood is amazingHunter hayes is perfectNot sure why Taylor Swift is taking over the countrycharts...her music is more of a mix now between coun-try and popLuke bryan on the CMAS omg omg !!!

is strongly spatial, we would expect the assignedtweets to be strongly spatially clustered. If thetopic is strongly spatio-temporal, we would expectthe assigned tweets to cluster within a particularspatio-temporal area. The space-time scan statis-tic is employed to measure enrichment by posi-tive tweets of cylindrical windows covering a cir-cular spatial region and a temporal interval. Thecylindrical window is moved in space and time tosearch for the statistically strongest clusters (Kull-dorff, 2010). The cylinder with the strongest en-richment of positive tweets (e.g., based on the ratiobetween positive tweets and all tweets within thecylinder) is a potential candidate for the significantspatio-temporal cluster.

Distributional properties of scan statistics canbe used to evaluate the statistical significance ofthe strongest cylinder (Dwass, 1957). This isdone by permuting the labels of tweets multipletimes (999 times in this study) and calculating thescore of the strongest cylinder in each permutation(Block, 2007). The p-value is then calculated bycounting the fraction of the permuted scores largerthan the score on the actual data. The p-value re-ported in this experiment can be thought of as ameasure of the spatio-temporal distinctiveness ofthe identified topic.

Characterization of distinct topics using p-valuehas some limitations. We observed that manydistinct topics discovered through aggregation re-ceive p-value equal to zero, making it impossibleto identify the strongest distinct topic. For thisreason, we used deviation (∆), which measureshow many standard deviations apart is the score ofthe best cylinder observed on the actual data com-pared to the scores of the best cylinders observedon the permuted data.

Table 2: Evaluation of the topic quality using SaTScan

Topic General Theme Deviation (∆) Topic TypePower 26504.53 TemporalNYC 25282.17 SpatialNFL 12275.18 Temporal

Presidential Debate* 11089.34 TemporalSnow 8624.95 Temporal

New Jersey* 8355.10 SpatialHalloween* 7679.58 Temporal

Pennsylvania* 6728.94 SpatialNYC Airport* 6424.54 Spatial

Weather 2220.64 Temporal

In Table 2, we show the strongest topics basedon the deviation (∆). In each case, the p-valuewas 0. For topics labeled with stars in Table 2,

Page 8: Spatial Aggregation Facilitates Discovery of Spatial Topicsvucetic/documents/maiti2019acl.pdf · Spatial aggregation refers to merging of doc-uments created at the same spatial location

259

Figure 11: Positive (red) and negative (blue) examplesand the position of the cluster identified by SaTScanfor topic : Power Outage

the corresponding word clouds were shown in Fig-ures 1, 8, 9, 10. For the remaining topics, thetop ten words are presented in Table 3. It maybe noted that New York City, being a very largemetropolitan area, has multiple identified topics.One such topic, called NYC airport, was previ-ously presented in Figure 9. Another such topic,called NYC, is presented in Table 2. The spatio-temporal region called Power outage is shown inFigure 11. 10, 000 tweets in this figure are labeledas positive or negative based on the presence orabsence of the keywords of this topic. This topiccorresponds to multiple power outages in the af-termath of the Sandy Hurricane.

Table 3: General theme of topics and related words

Topics WordsPower power sandy generator trees electricity tree open

lights safe hurricaneNYC york brooklyn nyc park manhattan city square

mta island halloweenNFL cowboys steelers romo giants harden church red-

skins touchdown eagles partySnow snow snowing cold weather delay boone wind

blizzard snowed outsideWeather barometer humidity temperature mph wind rain

blacksburg steady wnw rising† Offensive words are removed

5.2 Comparison between LDA and LSAPrevious studies indicate that NMF on Twitter dataworks better than other available topic modelingtechniques (Klinczak and Kaestner, 2015), (God-frey et al., 2014). This may be attributed to aslightly better robustness of NMF to the short doc-ument lengths. This problem is ameliorated in this

0 20 40 60 80 100Topics

0

5000

10000

15000

20000

25000

30000

35000

Distanc

e from

mea

n (in std) Comparison of NMF, LDA and LSA

NMFLSALDA

Figure 12: Comparison of NMF, LDA and LSA

study through aggregation. In view of this, it isexpected that other topic modeling approaches arealso able to identify distinct topics in the aggre-gated data. To verify this, we tried two other popu-lar algorithms, LSA and LDA.2 LSA is a truncatedsingular value decomposition technique. LDA isa generative probabilistic model. For LDA andLSA, the number of topics are taken to be 100 tobe comparable to the number of topics identifiedby NMF. Document topic prior and topic word pri-ors in LDA were set to 0.01.

We found that LDA and LSA identify distincttopics comparable to NMF when applied to thespatio-temporally aggregated data. Some of thesimilar topics are selected manually from the LDAand NMF topic lists and shown in Table 4 for com-parison.

Table 4: Identified LDA topics similar to NMF

Topics WordsNYC york park brooklyn city pic nyc hal-

loween st th center square street barPower power sandy hurricane storm safe

phone wind stay rain closed openWeather wind mb mph rain humidity cb

barometer temp slowly cam mid-night falling relative

Presidential Debate romney obama debate class mittpresident world vote talking weekpolicy

† Offensive words are removed

It is difficult to draw one-to-one correspondenceamong all the topics identified by the three meth-ods. We see from Table 4 that some topics are verysimilar in both NMF and LDA. However, whileNMF discovers a topic related to the CMA, LDAand LSA do not. For this reason, instead of com-paring the corresponding topics one at a time, thefollowing strategy is applied. The topics in the

2python scikit-learn package is used for all three methods

Page 9: Spatial Aggregation Facilitates Discovery of Spatial Topicsvucetic/documents/maiti2019acl.pdf · Spatial aggregation refers to merging of doc-uments created at the same spatial location

260

three methods are first sorted based on the devi-ation (∆) scores and plotted in Figure 12. Themost significant topics identified by all three algo-rithms exhibit similar scores. For the top 20 top-ics, performance of NMF is only slightly better.The average score of the top 20 topics for NMF is3,823, while the average scores for LSA and LDAare 3,638 and 3,390 respectively.

5.2.1 Common Topics in LDA and NMF

In Section 5.2, we mentioned that topic discoveryalgorithms such as LDA, NMF, and LSA are capa-ble of finding distinct topics from aggregated doc-uments. When non-aggregated data is used, thesealgorithms find common topics associated withday to day conversations. In Table 5, words as-sociated with several common topics identified byLDA and NMF on a sample of the non-aggregatedtweets are shown. It can be seen that the words inthe identified topics do not correspond to a specificspace or time.

Table 5: Common topics from non-aggregated data

LDA NMFcold shot dry blessedsmoking wonderful

cold weather hot room hun-gry feet world

making sounds coffeerunning

fun sounds making lot timessafe games looks

talking saw anymorewest facebook

twitter goodmorning jailfacebook instagram

guy past means throwstart

guys girl safe play awesomestay

† Offensive words, informal words and internet short formof the words are removed

5.2.2 Influence of Aggregation Strategies andRandomization

Our experiments with the simulated data in Sec-tion 4.3 revealed that topic discovery is impactedby the aggregation grid density. To see if thebehavior transfers to Twitter data, we varied thenumber of clusters from 100 and 1, 000. As thenumber of clusters increased, we observed thatsome of the distinct topics discovered by NMF fork = 200 disappeared when k was increased to500 or 1, 000. For example, the CMA topic disap-peared with those larger numbers of clusters. Wealso observed relatively small changes in discov-ered topics for different runs of the clustering forthe same value of k. We conclude that clusteringused for aggregation has a modest impact on topicdiscovery.

Figure 13: Visualization of temporal trends of topics

5.3 Temporal Trends in Topics

SaTScan reports the significant space-time cylin-ders for each topic. It is possible to categorizethose cylinders as spatial or temporal by inspect-ing the their size. As an alternative, we can usematrix W obtained by NMF to identify tempo-ral clusters. Let W ∗ be the matrix which is con-structed from W by summing all the rows cor-responding to the same time interval. W ∗ thenrepresents a purely temporal description of topicdistribution. By inspecting the columns of W ∗,shown in Figure 13, we can obtain an additionalinsight into the nature of temporal topics. We canobserve that only a small fraction of the identifiedtopics are strongly temporal in nature.

6 Conclusion

In this work, we showed that spatial aggregationof documents leads to discovery of spatially dis-tinct topics. We performed an extensive study onsynthetic and real data and demonstrated that spa-tial and spatio-temporal aggregation indeed leadsto discovery of spatial and spatio-temporal distincttopics. To evaluate the quality of the discoveredtopics we proposed a metric based on space-timescan statistics. Our results show that aggregationis a very powerful and computationally efficientmethod for discovery of distinct topics. While ourstudy focused on spatial aggregation, aggregationon other types of metadata such as authors, hash-tags, or communities is expected to work equallywell and discover other types of distinct topicsfrom large collections of documents.

Page 10: Spatial Aggregation Facilitates Discovery of Spatial Topicsvucetic/documents/maiti2019acl.pdf · Spatial aggregation refers to merging of doc-uments created at the same spatial location

261

ReferencesDavid Alvarez-Melis and Martin Saveski. 2016. Topic

modeling in twitter: Aggregating tweets by conver-sations. ICWSM, 2016:519–522.

David M Blei, Andrew Y Ng, and Michael I Jordan.2003. Latent dirichlet allocation. Journal of ma-chine Learning research, 3(Jan):993–1022.

Richard Block. 2007. Scanning for clusters in spaceand time:: A tutorial review of satscan. Social Sci-ence Computer Review.

Xueqi Cheng, Xiaohui Yan, Yanyan Lan, and JiafengGuo. 2014. Btm: Topic modeling over short texts.IEEE Transactions on Knowledge and Data Engi-neering, 26(12):2928–2941.

Minsuk Choi, Sungbok Shin, Jinho Choi, ScottLangevin, Christopher Bethune, Philippe Horne,Nathan Kronenfeld, Ramakrishnan Kannan, BarryDrake, Haesun Park, et al. 2018. Topicontiles: Tile-based spatio-temporal event analytics via exclusivetopic modeling on social media. In Proceedingsof the 2018 CHI Conference on Human Factors inComputing Systems, page 583. ACM.

Meyer Dwass. 1957. Modified randomization tests fornonparametric hypotheses. The Annals of Mathe-matical Statistics, pages 181–187.

Salvatore Giorgi, Daniel Preotiuc-Pietro, Anneke Buf-fone, Daniel Rieman, Lyle H Ungar, and H AndrewSchwartz. 2018. The remarkable benefit of user-level aggregation for lexical-based population-levelpredictions. arXiv preprint arXiv:1808.09600.

Daniel Godfrey, Caley Johns, Carl Meyer, Shaina Race,and Carol Sadek. 2014. A case study in text min-ing: Interpreting twitter data from world cup tweets.arXiv preprint arXiv:1408.5427.

Liangjie Hong and Brian D Davison. 2010. Empiricalstudy of topic modeling in twitter. In Proceedings ofthe first workshop on social media analytics, pages80–88. ACM.

Bo Hu, Mohsen Jamali, and Martin Ester. 2013.Spatio-temporal topic modeling in mobile social me-dia for location recommendation. In Data Mining(ICDM), 2013 IEEE 13th International Conferenceon, pages 1073–1078. IEEE.

Hannah Kim, Jaegul Choo, Jingu Kim, Chandan KReddy, and Haesun Park. 2015. Simultaneous dis-covery of common and discriminative topics viajoint nonnegative matrix factorization. In Proceed-ings of the 21th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining,pages 567–576. ACM.

Marjori NM Klinczak and Celso AA Kaestner. 2015. Astudy on topics identification on twitter using clus-tering algorithms. In Computational Intelligence(LA-CCI), 2015 Latin America Congress on, pages1–6. IEEE.

M Kulldorff. 2010. Satscan user guide for version9.0. Department of Ambulatory Care and Preven-tion, Harvard Medical School, Boston, MA.

Rishabh Mehrotra, Scott Sanner, Wray Buntine, andLexing Xie. 2013. Improving lda topic models formicroblogs via tweet pooling and automatic label-ing. In Proceedings of the 36th international ACMSIGIR conference on Research and development ininformation retrieval, pages 889–892. ACM.

Robertus Nugroho, Weiliang Zhao, Jian Yang, CecileParis, and Surya Nepal. 2017. Using time-sensitiveinteractions to improve topic derivation in twitter.World Wide Web, 20(1):61–87.

Xuan-Hieu Phan, Le-Minh Nguyen, and SusumuHoriguchi. 2008. Learning to classify short andsparse text & web with hidden topics from large-scale data collections. In Proceedings of the 17thinternational conference on World Wide Web, pages91–100. ACM.

Dear Sungbok Shin, Minsuk Choi, Jinho Choi, ScottLangevin, Christopher Bethune, Philippe Horne,Nathan Kronenfeld, Ramakrishnan Kannan, BarryDrake, Haesun Park, et al. 2017. Stexnmf: Spatio-temporally exclusive topic discovery for anomalousevent detection. In Data Mining (ICDM), 2017IEEE International Conference on, pages 435–444.IEEE.

Enrico Steiger, Joao Porto De Albuquerque, andAlexander Zipf. 2015. An advanced systematic lit-erature review on spatiotemporal analyses of twitterdata. Transactions in GIS, 19(6):809–834.

Asbjørn Steinskog, Jonas Therkelsen, and BjornGamback. 2017. Twitter topic modeling by tweetaggregation. In Proceedings of the 21st Nordic Con-ference on Computational Linguistics, pages 77–86.

Haoyu Wang, Eduard H Hovy, and Mark Dredze. 2015.The hurricane sandy twitter corpus. In AAAI Work-shop: WWW and Public Health Intelligence.

Xidao Wen, Yu-Ru Lin, and Konstantinos Pelechrinis.2016. Pairfac: Event analytics through discrim-inant tensor factorization. In Proceedings of the25th ACM International on Conference on Informa-tion and Knowledge Management, pages 519–528.ACM.

Jianshu Weng, Ee-Peng Lim, Jing Jiang, and Qi He.2010. Twitterrank: finding topic-sensitive influen-tial twitterers. In Proceedings of the third ACM in-ternational conference on Web search and data min-ing, pages 261–270. ACM.

Xiaohui Yan, Jiafeng Guo, Shenghua Liu, Xue-qiCheng, and Yanfeng Wang. 2012. Clustering shorttext using ncut-weighted non-negative matrix fac-torization. In Proceedings of the 21st ACM inter-national conference on Information and knowledgemanagement, pages 2259–2262. ACM.

Page 11: Spatial Aggregation Facilitates Discovery of Spatial Topicsvucetic/documents/maiti2019acl.pdf · Spatial aggregation refers to merging of doc-uments created at the same spatial location

262

Xiaohui Yan, Jiafeng Guo, Shenghua Liu, XueqiCheng, and Yanfeng Wang. 2013. Learning topicsin short texts by non-negative matrix factorizationon term correlation matrix. In Proceedings of the2013 SIAM International Conference on Data Min-ing, pages 749–757. SIAM.

Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing He,Ee-Peng Lim, Hongfei Yan, and Xiaoming Li. 2011.Comparing twitter and traditional media using topicmodels. In European Conference on InformationRetrieval, pages 338–349. Springer.