On-line Clustering for Real-Time Topic Detection in Social Media Streaming …ceur-ws.org/Vol-1150/popovici.pdf · 2014-04-24 · On-line Clustering for Real-Time Topic Detection

On-line Clustering for Real-Time Topic Detection

in Social Media Streaming Data

Robert Popovici, Andreas Weiler, and Michael GrossniklausDatabase and Information Systems Group, University of Konstanz

P.O. Box 188, 78457 Konstanz, [email protected]

Abstract

The continuous growth of social networks andthe active use of social media services resultin massive amounts of user-generated data.Worldwide, more and more people report anddistribute up-to-date information about al-most any topic. At the same time, there isan increasing interest in information that canbe gathered from this data. The popular-ity of new services and technologies that pro-duce and consume data streams imposes newchallenges on the analysis, namely, in termsof handling high volumes of noisy data inreal-time. Since social media analysis is con-cerned with investigating current topics andactual events around the world, there is a pro-nounced need to detect topics in the data andto directly display their occurrence to analystsor other users. In this paper, we present anon-line clustering approach, which builds ontraditional data mining methods to addressthe new requirements of data stream mining:(a) fast incremental processing of incomingstream objects, (b) compactness of data rep-resentation, and (c) efficient identification ofchanges in evolving clustering models.

1 Introduction and Motivation

The social network platform Twitter is a main pro-ducer of large volumes of data as a continuous stream.Over 140 million registered users and about 340 million

Copyright c⃝ by the paper’s authors. Copying permitted onlyfor private and academic purposes.

In: S. Papadopoulos, D. Corney, L. Aiello (eds.): Proceedingsof the SNOW 2014 Data Challenge, Seoul, Korea, 08-04-2014,published at http://ceur-ws.org

short messages, called “tweets”, per day make Twitterthe undisputed market leader in social microbloggingtoday. In its initial stages, Twitter was intended tobe a service where people could update their statusby posting short messages. Twitter prompted usersto answer a simple question “What are you doing?”and thus the users reported their actual activities, feel-ings, and experiences of their everyday life. As Twit-ter gained significance and users started exchangingmatters reaching beyond one’s personal status, it wasdecided in November 2009 to change the question to amore general one “What’s happening?”1. The inten-tion of the new question is to engage users in report-ing and publishing current news and events happeningin the world. The consequence of this change is thatTwitter has developed into a vast source of informa-tion that contains a mixture of all kinds of data.

Due to the diversity of the information provided,Twitter even plays an increasingly important role asa source for news agencies. In fact, news agencies useTwitter for two important functionalities in their dailyactivity. First, it is used as a publication and distri-bution platform for current news articles with a highthroughput rate. For example, any reproduction of atweet (“retweet”) reaches an average of about 1,000users [Kwa10]. Second, news agencies, such as BBC2,are constantly increasing the usage of Twitter as a ref-erence in their daily news reports [Ton12].

A further characteristic of Twitter is its vibrantuser community with a wide range of different per-sonalities from all over the world. It has been shownthat this whole spectrum can be sub-divided into afew categories of Twitter usage patterns, such as dailychatter, information and URL sharing, or news report-ing [Jav07].

Further research undertaken has discovered that the

1http://blog.twitter.com/2009/11/whats-happening.

html2http://www.bbc.com/

majority of users publish messages focusing on theirpersonal concerns and matters, whereas a smaller setof users publish for information sharing [Naa10]. Thisvariety of content in the information flow leads tothe primary task of detecting significant messages inthe clutter of tweets. Because of the fast broadcast-ing manner of Twitter, important news spread rapidlythrough the social network.

2 Topic Detection

Most traditional data mining methods such as K-means, DBSCAN, or OPTICS are not designed to beapplied directly to data streams because of their infi-nite nature and the requirement for single pass evolu-tionary processing. In this paper we focus on a newstream mining method based on traditional data min-ing methods to address the new requirements of datastream mining: (a) fast incremental processing of in-coming stream objects, (b) compactness of data rep-resentation, and (c) efficient identification of changesin evolving clustering models.

The proposed algorithmic idea relies on an extendedconcept of density-based clustering over an evolvingdata stream with noise (DenStream [Cao06]) with en-hanced applicability for categorical data. We designedthe on-line component of the extended DenStream al-gorithm to include the major ideas of the classical Den-Stream algorithm and added some new features andfunctionalities.

Similar to the ideas of the classical DenStream algo-rithm, a set of core and outlier micro-clusters is main-tained incrementally with the role of outlier and coremicro-clusters being often exchanged as a consequenceof outdated micro-clusters fading into outliers and newmicro-clusters being formed. To speed up processing,an outlier buffer is used to separate the processing ofcore micro-clusters and the outliers (micro-clusters at-tracting very few data objects for extensive time inter-vals). We also extended the general macro-clusteringapproach with a lightweight variant of the DBSCANalgorithm, which is applied on the micro-clusters asvirtual points.

With a view toward achieving a both efficient andaccurate estimate of the centroid of the clustering wepropose a new approach that uses cluster feature vec-tors with sufficient summary statistics as components.We use POS tagging to extract a number of relevantfeatures per cluster, the set of selected features consist-ing mainly of common and proper noun structures.Inorder to be able to detect new trends in a steadilyevolving stream, an incoming data object is assignedto the nearest cluster based on the average of the clos-est similarity values to the cluster summaries attainedby previous objects in the stream. Since the tweet

��

�.)3%$ �3!3%2 %70%,2 �%.%94%,!. $)0,/-!32 3)3�&/1�

3!3

��

�!3#( %.)3 �/1422)! �/13-4.$ �/!,2 �)'(,)'(32 �� /4�4"%

��

�%.*!-). �%3!.8!(4 #!232 4.&/134.!3% 2(!$/6 .'%,!

�%1+%,

��

�.5/8 �),,%$ �81)! �)5!, �%"%, �1/40

��

!.3)��!.4+/58#( 01/3%23%1 0!22%2 "!11)#!$% �+1!).%

01/3%232 0)#341%2

��

-/23�6!.3%$ $14' +).'0). �(!0/ #!0341%$ �%7)#/ ��8%!1

-!.(4.3

��

�43). �1$%12 �/-"!3 �1//02 ,%13 �%!1 �+1!).%

��

�%!$ -%1)#!. -%$)! �81)! 0/04,!1 1%5/,43)/. &/1#).'

��

�)#(!%, $%"/,!*/ 2%.3%.#% -41$%1 �)'"8 $%"/6!,% -).)-4- *!), 3%1- 8%!12

��

�!22)5% #1/6$ �!)$!. !6!)3).' !../4.#%-%.3 '/5%1.-%.3

Figure 1: Timeline of 10 sample topics

objects can be very small in size, vector componentsconsist of the inverse cluster frequency of a selectedfeature combined with the cluster frequency of thatsame feature. Effectively, the frequency of a selectedfeature in the cluster is offset by the frequency of thatsame feature across all documents in the cluster.Werefer to these vectors as CF-ICF vectors.

The number of selected vector components to bemonitored in a given cluster turns out to be exponen-tial to the number of selected unique features, andtherefore only a small subset which represents the fre-quent features needs to be kept. Infrequent featuresare removed from the vector representation by meansof dimensionality reduction to speed up the process-ing. This also avoids excessive storage and, at thesame time, simplifies and summarizes the incomingdata, achieving a convergence effect that contributesto reaching a steady distribution of topics. Compu-tation of similarity is done using the cosine similaritymetric.

Micro-clusters are maintained incrementally. Effec-tively, the number of points and the linear sum of termfrequencies of the micro-clusters are continuously up-dated.

We consider the problem of clustering a data streamin the damped window model, in which the weightof each stream object decreases exponentially withtime t via an exponential fading function f(t) = 2−α·t,where α is a constant called the decay factor andα > 0. The fading function controls the importance ofthe historical data compared to the most recent databy taking into account the timestamp of the last up-date to the clustering. The higher the value of α, thegreater emphasis is placed on the more recent data.

The overall weight W of all stream objects is nearlyconstant, verified by applying a geometric series to it

W = v ·tn∑t=0

t→∞

−−−→ v

1− 2−α,

where v is the speed of the stream.During the on-line part we distinguish between po-

tential core-micro-clusters, if w ≥ βµ and outliermicro-clusters, with w ≤ βµ, where β is the outlierthreshold and µ the minimum overall weight for a coremicro-cluster.

Effectively, for time interval t, if no points aremerged into a micro-cluster the weight decreases

MC = ( 2−α·tw,LS, tc),

where LS is the linear sum of the term frequencies andtc is the creation timestamp of the micro-cluster.

If a data point p is merged the updated micro-cluster is defined as

MC = ( w + 1, LS + 1, tc).

In order to be able to keep track of the evolutionof interesting sub-topics as part of a major topic, weintroduce the notion of sub-clusters that are incremen-tally maintained within core micro-clusters in a waysimilar to which micro-clusters are maintained. Specif-ically, incoming stream objects are reassigned to theclosest sub-cluster by comparing them to vector rep-resentations of the sub-cluster summaries.

The sub-cluster summaries consist of the numberof data points contained, the linear sums of a feature(LS), the linear sums of occurrences of a feature perwindow (LSW ) and the linear sums of co-occurrencesper feature (LSC).

We distinguish between potential core-sub-clusters(p-sub-cluster), if w ≥ βµ and N ≥ min and outliersub-clusters (o-sub-cluster), if w ≤ βµ, where N is thenumber of data objects in the sub-cluster, min theminimum number of objects required for a core sub-cluster,β is the outlier threshold and µ the minimumoverall weight for a core sub-cluster.

For time interval t, if no points are merged into asub-cluster, the weight decreases

SC = ( 2−α·tw,LS,LSW,LSC, tc).

If a data point p is merged the updated sub-clusteris defined as

SC = ( w + 1, LS + 1, LSW + 1, LSC + 1, tc).

If an outlier micro-cluster has attracted sufficientdata to be converted into a core micro-cluster, dataobjects that have been assigned to the latter are redis-tributed to underlying sub-clusters. This effectivelymeans that an incoming stream object that has beenassigned to a nearest micro-cluster is reassigned to itsnearest sub-cluster, unless the closest similarity valueis considerably lower than the values attained by pre-vious stream objects. In order to be able to determinewhether the closest similarity value is considerably be-low the one previously attained, the mean of the lastthree closest similarity values to the sub-cluster sum-maries is maintained. The similarity values are addi-tively maintained, to increase efficiency. Algorithm 1defines the extended merging procedure, which is alsovisualized in Figure 3.

The potentially unbounded nature and uncertainarriving speed of data streams along with the require-ment of single pass scanning imposes a limited space(memory) and a strict time constraint to the imple-mentation of the data stream processing. Therefore,a checking strategy is performed every Tp time steps,where Tp is defined as the minimal timespan for a clus-ter fading into an outlier. This ensures that outdatedclusters that have either received few data or have hadtheir weight reduced by the decay factor α are pruned.

Algorithm 1 Extended DenStream: Merging tech-nique

Require: ϵ1 ≥ 0 ≤ 1, ϵ2 ≥ 0 ≤ 1;{1}: Try to merge p into its nearest p-micro-clustercp;if dp (the closest similarity value) ¡ ϵ1 thenMerge p into cp;{2}: Try to merge p into the nearest p-sub-clustercsp of p-micro-cluster cp;if dsp (the closest similarity value) ¡ ϵ2 then

Merge p into csp;else

{3}:Try to merge p into the nearest o-sub-cluster cso of p-micro-cluster cp;if dso (the closest similarity value) ¡ ϵ2 thenMerge p into cso;

elseCreate a new o-sub-cluster containing p

end ifend if

else{4}:Try to merge p into its nearest o-micro-clusterco;if do (the closest similarity value) ¡ ϵ1 then

Merge p into co;if w (the new weight of co) ¿ βµ thenConvert co into a p-micro-cluster and createa new o-sub-cluster with all stream objects ofthe converted o-micro-cluster;

elseCreate a new o-micro-cluster containing p

end ifend if

end if

Otherwise, they will take up a lot of memory space,and either the clustering result may contain outdateddata with the immediate effect of lessening the evolv-ing character of the data stream or clusters consistingof outliers will combine data that should not be inthe same cluster into a same cluster in subsequentlymerging micro-clusters, thus decreasing clustering ef-ficiency.

The weight of outlier micro-clusters and outlier sub-clusters is compared against

θ(t) =2−α(t−tc+Tp) − 1

1− 2−α,

where tc is the creation timestamp of the outlier micro-cluster and Tp is the minimal timespan for a micro-cluster/sub-cluster fading into an outlier. Outlier sub-clusters that have been turned into core sub-clusterswill have a lifespan at least as long as the core micro-cluster to which they belong.

The longer an outlier micro-cluster or outlier sub-cluster exists, the higher its expected weight

limtc→∞

θ(tc) =1

1− 2−α·Tp= βµ.

Based on this assumption, the cumulated maximalnumber of micro-clusters and sub-clusters in mem-ory is W

β·µ , where W is the overall weight of the datastreams and β·µ acts as the filtering parameter. There-fore, the runtime complexity of the extended Den-stream algorithm is O( W

β·µ+x), where x is the length of

the stream and Wβ·µ is the maximal cumulated number

of core micro-clusters and core sub-clusters in mem-ory. As a consequence of the pruning strategy and thedimensionality reduction, memory increases only loga-rithmically with stream length. The pruning techniqueused by our algorithm is shown in Figure 2, where N isthe number of objects and min the minimum numberof objects.

To handle the case where micro-clusters created in-dependently might at some point during the clusteringturn out to contain topics that are semantically relatedwe implemented a modified lightweight variant of theDBSCAN algorithm that runs periodically (every Nstream objects, with N typically set to 10,000) on thelive set of micro-clusters as virtual points. The intu-ition behind this is based on the symmetric propertyof density-connectedness of the DBSCAN algorithm.

Apart from the immediate effect of semantic com-paction of the topic distribution, the macro-clusteringphase (see Figure 2) also effectively reduces the num-ber of micro-clusters to be processed. While the coresub-clusters of the merged micro-clusters are added tothe set of the existing core sub-clusters of the mergingmicro-cluster, the set of outlier sub-clusters effectively

Figure 2: Extended DenStream: pruning technique, DBSCAN macroclustering phasecontribute only their summary statistics to the merg-ing micro-cluster. The notion of distance translatesto the number of common top keywords defining thestable distribution of the micro-clusters at some pointduring the clustering, thus effectively replacing the ep-silon parameter required for the DBSCAN algorithm.

3 Evaluation

The SNOW challenge [Pap14] data set was collectedfor a 24-hour time frame from Tue 25 Feb 2014, 18:00GMT to Wed 26 Feb 2014, 18:00 GMT. The collectioncontains a total of 1,041,227 tweets with an average ofabout 723 tweets per minute and about 10,846 tweetsper 15-min window.

The evaluation was performed on a computer withIntel i7 CPU and 6 GB main memory running the 64-bit Eclipse Indigo Platform on 64-bit Windows 8. Weimplemented our algorithm3 in Java. We used a well-known language detector [Shu10], and a tokenizer andpart-of-speech tagger for Twitter [Gim11], with train-ing data of manually labeled annotated tweets and hi-erarchical word clusters from unlabeled tweets. Fur-thermore, we used a standard English stop word list toremove repeating terms and and simple plural stem-ming to match the different forms of terms with eachother. The n-gram signature of a topic consisted of the

3Available here: http://bit.ly/1qeGNry

top keywords within the range of at least 1.5 standarddeviations away from the mean derived from the listof term frequencies per cluster.

The tweet with the highest degree of semantic rel-evance within a cluster (that is, having the maximalsimilarity value to the cluster summaries) was selectedas the topic headline. The selected tweet was parsedusing a less restrictive configuration of the POS tag-ger, with the extracted tokens reassembled in the samesyntactic order in which they were originally processedto ensure minimum semantic coherence.

Micro-clusters pruned away by the exponential fad-ing function were written to a separate file and thenjoined with the list of active micro-clusters to producethe final result.

The final results consist of a total number of 210topics (see Figure 1 for a sample) over the whole 24-hour time frame with a significance factor of at least200 tweets per cluster. A detected topic had to be atleast 150–200 tweets in length or span over a 15 minuteinterval. The top keywords were derived from the termfrequency lists maintained per window interval. Thetop tweets were selected based on the closest similarityvalues between incoming tweets to the cluster vectorswithin the first window interval until convergence wasattained. The same procedure was applied for findingthe pictures, which are associated to a topic.

In particular, topics referring to the political up-

Figure 3: Extended DenStream: merging technique

heaval in Ukraine, the Bitcoin exchange shutdownsdue to alleged hacker theft, the clashes between rebelsand the Syrian government forces in Syria and theChampions League results were most prominent in thetopic distribution, containing more than 12,000 tweets.Since the major topics (mostly macro-clusters foundby the DBSCAN algorithm) spanned over large timeintervals yet contained a large diversity of sub-topicsthat were sufficiently different from each other, onlythe contained sub-topics were written to the result fileto meet the interval requirement. For each sub-cluster,15-minute intervals were output for which there was asignificant difference in the n-gram signature betweentwo successive windows of the respective sub-cluster(to avoid duplicates).

An example of a major topic with component sub-topics are the events revolving around the Syrian con-flict in general, e.g., the major ambush involving rebelsin Damascus, Germany monitoring jihadis in battle-hardened Syria, the photos of the Yarmouk refugeecamp in Syria, Syrian al Qaeda giving rival rebel groupan ultimatum. This kind of approach might prove use-ful in helping journalists gain more insight into ongoingevents and perhaps acquire a better understanding ofthe significance of more complex events by assessingtheir impact on a more global scale while at the sametime allowing them to maintain the focus on the moredetailed aspects of those events.

The official evaluation results of our method inthe Data Challenge are included in Papadopoulos etal. [Pap14].

References

[Cao06] Cao F., Ester M., Qian W., Zhou A.:Densitybased Clustering over an EvolvingData Stream with Noise. In Proc. Intl.SIAM Conf. on Data Mining (SDM) (2006),pp. 328–339.

[Gim11] Gimpel K., Schneider N., O’ConnorB., Das D., Mills D., Eisenstein J.,Heilman M., Yogatama D., FlaniganJ., Smith N. A.: Part-of-Speech Taggingfor Twitter: Annotation, Features, and Ex-periments. In Proc. Annual Meeting of theAssociation for Computational Linguistics:Human Language Technologies: Short Pa-pers (HLT) (2011), pp. 42–47.

[Jav07] Java A., Song X., Finin T., TsengB.: Why We Twitter: Understanding Mi-croblogging Usage and Communities. InProc. Intl. Workshop on Web Mining andSocial Network Analysis (2007), pp. 56–65.

[Kwa10] Kwak H., Lee C., Park H., Moon S.:What is Twitter, a Social Network or aNews Media? In Proc. Intl. Conf. on WorldWide Web (2010), pp. 591–600.

[Naa10] Naaman M., Boase J., Lai C.-H.: Is ItReally About Me?: Message Content in So-cial Awareness Streams. In Proc. Intl. Conf.on Computer Supported Cooperative Work(CSCW) (2010), pp. 189–192.

[Pap14] Papadopoulos S., Corney D., AielloL. M.: SNOW 2014 Data Challenge: As-sessing the Performance of News Topic De-tection Methods in Social Media. In Proc.SNOW 2014 Data Challenge (2014).

[Shu10] Shuyo N.: Language Detection Libraryfor Java. http://code.google.com/p/

language-detection/, 2010.

[Ton12] Tonkin E., Pfeiffer H. D., Tourte G.:Twitter, Information Sharing and the Lon-don Riots? Bulletin of the American Societyfor Information Science and Technology 38,2 (2012), 49–57.

Documents

On-line Clustering for Real-Time Topic Detection in Social Media Streaming …ceur-ws.org/Vol-1150/popovici.pdf · 2014-04-24 · On-line Clustering for Real-Time Topic Detection