SeqCluSum - Combining Sequential Clustering and Contextual … · 2016-02-19 · SeqCluSum: Combining Sequential Clustering and Contextual Importance Measuring to Summarize Developing

SeqCluSum: Combining Sequential Clustering and Contextual

Importance Measuring to Summarize Developing Events over Time

Markus Zopf

Research Training Group AIPHES / Knowledge Engineering GroupDepartment of Computer Science, Technische Universitat Darmstadt

Hochschulstraße 10, 64293 Darmstadt, [email protected]

Abstract

Unexpected events such as accidents, nat-ural disasters and terrorist attacks repre-sent an information situation where it isessential to give users access to impor-tant and non-redundant information as fastas possible. In this paper, we introduceSeqCluSum, a temporal summarizationsystem which combines sequential clus-tering to cluster sentences and a contex-tual importance measurement to weightthe created clusters and thereby to iden-tify important sentences. We participatedwith this system in the TREC TemporalSummarization track where systems haveto generate extractive summaries for de-veloping events by publishing sentence-length updates extracted from web docu-ments. Results show that our approach isvery well suited for this task by achiev-ing best results. We furthermore point outseveral improvement possibilities to showhow the system can further be enhanced.

1 Introduction

Events like accidents, natural disasters and terror-ist attacks provide a important information situa-tion. Shortly after the event occurred, the infor-mation situation is usually unclear. There mightbe some first vague information available, for ex-ample that an earthquake had occurred, but detailslike the magnitude, the epicenter, and if a tsunamihas to be expected are not known at this early pointin time. More and more information will becomeavailable later when details about the event arepublished by the media.

Traditional summarization approaches fail dueto this unique characteristic of these informationaccess problems. But especially during such cri-sis events the affected people urgently need infor-

mation. The fastest information channel for infor-mation is the internet, where information are dis-tributed via news sites, blogs, and microbloggingservices. Unfortunately the internet is not onlythe fastest, but also a highly redundant, bulky, andunreliable information source. This makes it im-possible for an end-user to monitor this stream ofnews to extract important information timely.

A system which aims to aid people in retriev-ing information out of the web site stream has todeal with several difficulties. Since there are vastamounts of documents in the web, a system hasto be highly efficient in order to process the doc-uments. A system has to filter out unrelated doc-uments, since the majority of the documents ap-pearing on the web in the investigated timespanwill not be related to the event and can be consid-ered as noise. But noisiness on the document levelis not the only problem. Even relevant documentscontain a lot of unimportant information like un-related news articles, advertisements, and naviga-tion elements. This within-document noise has tobe removed as well. Since information about bigevents are usually repeated on different web sitesseveral times, the system has to deal with a highlyredundant information situation. Another criticalaspect is the trustworthiness of the sources. Sev-eral news pages publish uncertain information andthis tendency is even higher in weblogs and on mi-croblogging sites. The last, but also one of themost crucial challenges during such events, is thetimely detection of new information. Since theusers urgently need information, new informationhas to be detected as fast as possible.

In this paper we present SeqCluSum, a sys-tem which is able to aid users to satisfy their in-formation needs in the previously described situa-tions. We mainly address the problems of within-document noise, redundancy, and timely detec-tion. Efficiency is only partially considered, sincewe assume a situation where our system is applied

s1

s2

s0

s1 s0

s1

s2

s3

s0

s0

s3

s

s

s

s

s

u-2, ...........

u-1, .......

u-1, .............

u, s

Figure 1: System overview. The timely-ordered documents d1, d2, . . . run through a boilerplate removaland stemming system in step 1. Only relevant sentences should pass this step. The documents are markedwith timestamps u, u + 1, . . .. In step 2, a sentence s is either added to the existing clusters c1 or c2 orcreates a new cluster depending on the similarity of s to c1 and c2. Step 3 illustrates the publishing of asentence. The sentence is added to the summary with time marker u and the cluster is set to published.

after an information retrieval step where irrelevantdocuments has already been filtered out. Our sys-tem is, like in the most summarization scenarios,not concerned with trustworthiness. Therefore, weassume that all documents which are passed to oursystem can be considered as trustworthy.

An overview about the system is given in Figure1. First, we use two preprocessing steps (Figure 1,step 1) which remove boilerplate content and stemall tokens. This step is further described in Section3. Second, we use sequential clustering describedin Section 4.1 to cluster similar sentences together(Figure 1, step 2). Last, we weight the clusters af-ter processing a document according to their con-textual importance and publish updates (Figure 1,step 3). This step is described in Section 4.2 indetail.

With this system, we participated in the Tem-poral Summarization track of the Twenty-FourthText REtrieval Conference (TREC). We describethe shared task in detail in Section 5 and our re-sults in Section 5.4.

Due du time limitations, we did not push allcomponents of our system to the limit. There-fore, we describe some improvement possibilities

in Section 7 from which we expect a further im-provement of performance.

2 Related Work

Early work on document summarization startedwith Luhn (1958), Baxendale (1958), and Ed-mundson (1969), who used word frequencies, po-sition features, and key phrases to detect importantsentences.

McKeown and Radev (1995) introduced theproblem of multi-document summarization. Tem-poral summarization in the TREC-TS challenge isstrongly related to extractive multi-document sum-marization (Nenkova and McKeown, 2011), sincethe systems are only allowed to extract sentencesfrom the source documents verbatim. Radev et al.(2004) presented an open-source multi-documentsummarization system, MEAD, which is able tosummarize large news topics by clustering sen-tences and finding centroids. Mihalcea and Ta-rau (2004) used a graph-based method to extracttext for summarization. Carbonell and Goldstein(1998) introduced the greedy approach MMR toextract topic related and non-redundant sentencesjointly.

For the particular task of temporal summariza-tion, we review some of the participating sys-tems from last years TREC-TS challenges. Thebest performing run in the challenge in 2014 wasthe “2APSal” run by team “cunlp” (Kedzie et al.,2014), who used affinity propagation clustering.Zhao et al. (2014) used a query expansion and in-formation retrieval step with the Lemur toolkit1

and a k-means clustering (Zhang et al., 1996) andsentence selection step. McCreadie et al. (2014)used a pipeline to filter out unrelevant documents,to classify sentence according to their relevance,and to filter out redundant sentence to build a real-time summarization system.

Our approach described in this paper com-bines the timely detection of a pipelining approachlike McCreadie et al. (2014) with the redundancyavoidance strategy of a clustering approach likeZhao et al. (2014) by using a sequential clustering.

3 Preprocessing

In this section, we describe two important pre-processing steps which are used to prepare theinput documents for the temporal summarizationsystem. Both preprocessing steps, boilerplate re-moval and word stemming, are well-known andcrucial for the performance of the subsequent sum-marization. Since they are self-contained systemsand independent from the subsequent summariza-tion system we did not investigate them in detailand expect further potential for improvement ofthe whole system if these steps are improved.

Documents retrieved from the web are usuallyvery noisy since they do not only contain one arti-cle, but also advertisements, navigation elements,text snippets from other websites, author informa-tion, templates, pictures, links to related articles,etc. This noise was described in the introductionas within-document noise and is commonly calledboilerplate text. Content which is unrelated to thetarget topic should never be included in the out-put of a summarization system. Therefore, thefirst preprocessing removes the within-documentnoise. To tackle this problem, we utilize the soft-ware Boilerpipe (Kohlschutter et al., 2010) whichuses shallow text features to remove unrelated textfragments. This preprocessing step is crucial fora good performance of the summarization systemdescribed below since it publishes frequent infor-mation without any knowledge of the actual query.

1http://www.lemurproject.org

Since navigation elements and advertisements arevery frequent on web sites the system would in-clude a lot of this unimportant content into thesummary.

In a second step, we stem all words in the sourcedocuments with the well-known Porter-stemmer(Porter, 1980). This step is assumed to help thesystem to detect semantically similar but syntacti-cally different words. Stemming will become im-portant in Section 4.1 and Section 4.2 when wecluster similar sentences to measure their impor-tance and to detect redundancy. More abstract se-mantic methods like word embeddings (Mikolovet al., 2013) are expected to model the seman-tic similarity of words better than stemming sincethey can, for example, not only detect a similaritybetween the words crash and crashed, but also be-tween crash and accident. Therefore, we assumethat a better representation could further improvethe performance of similarity and redundancy de-tection in the following summarization system.

4 Temporal Summarization with

SeqCluSum

After explaining the preprocessing of the docu-ments, we now describe the combination of twobuilding blocks, a sequential clustering algorithmwith a contextual importance measurement.

The first building block of the system, a sequen-tial clustering algorithm, is responsible to detectredundant information by clustering similar sen-tences together. Since the second block will selectat most one sentence per cluster, this will preventthe system to publish similar sentences (and there-fore similar information) multiple times. The se-quential clustering is described in Section 4.1.

To decide which sentences should be includedin the summary, we apply in a second block a con-textual importance measuring, which is further de-scribed in Section 4.2. This building block pub-lishes sentences depending on the weight of thesentences, which itself depends on the content ofthe sentences as well as on already published sen-tences (therefore contextual), and the weight of thepreviously generated clusters.

The both blocks are applied for each documentone after another and therefore work closely to-gether to jointly solve the problem of timely de-tection of important information and reduction ofredundancy. For convenience, we give pointers tothe pseudo-code of the system in Algorithm 1.

4.1 Sequential Clustering

The first building block of the temporal summa-rizer is a sequential clustering algorithm. Weapply the Basic Sequential Algorithm Scheme(BSAS) (Theodoridis and Koutroumbas, 2009),which is used to iterate over all (unpruned) sen-tences in the currently processed document (Al-gorithm 1, line 5). The algorithm searches forthe nearest existing cluster by using a similaritymeasure (line 7) for all sentences which were notpruned in the boilerplate removal step. We de-scribe the distance measure below in more detail.If the similarity to all existing clusters is lowerthan a fixed threshold ⇥ and the maximum numberof clusters I is not already reached, a new cluster iscreated and the sentence is added to the new clus-ter (line 9). Otherwise, the sentences is added tothe nearest existing cluster (line 11).

The similarity measure calculates the similaritybetween a sentence si and a cluster Cj by calcu-lating the similarity between the sentence si andthe first sentence of the cluster Cj1 . By doing so,the sentences to cluster similarity measurement isreduced to a sentences to sentence similarity mea-surement. This has several advantages in compar-ison to considering all sentences in the cluster.

First, it emphasizes the notion of a cluster as aset of sentences which contains one distinct infor-mation. In a scenario where the similarity func-tion works perfectly (i.e. equals 1 if the sentencematches the represented information by the clusterand equals 0 otherwise), there is no need to com-pare it to the other sentences. The result would bethe very same for each sentence. We choose thefirst sentences of each cluster to be the represen-tative of the cluster since each first sentence hasa special role. This special role derives from thefact that the first sentence of each cluster was thereason why the cluster had been created. The sen-tences did not fit to another cluster and thereforecreated its own, new cluster.

Second, the center of the cluster is fixed whenwe compare only with the first sentence of thecluster. This prevents the cluster from a topic drift,which was a serious issue when we compared tomore than one sentences. By adding more andmore sentences and also using the newly addedsentences for further distance calculations couldinstead change the initial notion of the cluster sig-nificantly since the center of the cluster wouldmove.

Third, the approach is computational efficient incomparison to an approach, where we would com-pute the similarity by incorporating all sentence ina particular cluster.

The actual similarity measure metric (line 7) isa linear combination of various well-known simi-larity measures. We use five different n-gram Jac-card measures, for n 2 {1, 2, 3, 4, 5}. The n-gramJaccard measure is defined in Equation 1, where⇧n(s) is the set of all n-grams in sentence s.

jaccardn(s1, s2) =|⇧n(s1) \⇧n(s2)||⇧n(s1) [⇧n(s2)| (1)

Furthermore, we use a longest common sub-sequence comparator. Additionally to the stringcomparison methods, we apply different variationsof cosine similarities based on inverse term fre-quency (ITF) vectors. To calculate the ITF val-ues, we use a small sample of 2,182 documents asbackground corpus B. The ITF value of a token t

equals the logarithm of the quotient of the numberof all tokens in the background corpus B, |B| andthe number of appearances of the token t in thebackground corpus B, |t 2 B| (Equation 2). Thisvalue measures how common a token is in a back-ground corpus and measures therefore how muchinformation a token provides in general.

ITF (t) = log

✓ |B||t 2 B|

◆(2)

All documents in the background corpus wereretrieved from documents which were created be-fore the event started to ensure that we do not useinformation which were not available at the timewhen the event happened.

We combine these different similarity measuresinto one similarity measure to cover different as-pects of similarity. The overall similarity measurefor two sentences s1 and s2 is a linear combinationas shown in Equation 3, where S is a set contain-ing all previously described similarity measuressimi.

sim (s1, s2) =1

|S| ·X

simi2Ssimi (3)

Some values for different similarity measuresare listed in Table 1. We show values of the 2-gramand 3-gram Jaccard measure, two different cosinesimilarity values, and the value of the longest com-mon subsequence comparator. All measures are

Measure s(s1, s2) s(s1, s3) s(s1, s4)2-gram J 0.097 0.400 0.1183-gram J 0.032 0.269 0.059cosine 1 1.000 0.078 0.024cosine 2 1.000 0.814 0.290LCSC 0.613 0.821 0.490mean 0.696 0.397 0.148

Table 1: Different similarity measures

implemented in the software DKPro Similarity2,described in Bar et al. (2013). In the example, sen-tence s2 is a permutation of sentence s1, sentences3 was clustered to the same cluster as sentence s1,and sentence s4 was assigned to a different clusterthan sentence s1. We see that the cosine similarityis not affected by the ordering of the words in com-parison to the n-gram Jaccard measure. Therefore,the cosine similarity properly detects when twosentences are using the same words, which signalsthat their content is similar. Nevertheless, incor-porating the Jaccard measures gives additional in-sights to the similarity of two sentences, since theordering of the words can be considered as an evenstronger signal for similarity.

4.2 Measuring Contextual Importance

After all unpruned sentences in a document areclustered with the previously described sequentialclustering algorithm, the system evaluates the cur-rent cluster landscape to detect important infor-mation in the document stream. To achieve this,all clusters are weighted using a weight measuresbased on a TF*ITF values. The score of a clusterCi equals the sum of the weights of all sentencesCij in the cluster (Equation 4).

weight(Ci) =X

Cij2Ci

weight(Cij ) (4)

Summarizing over all sentences in a cluster ad-dresses the assumed property of the documentstream that more important information is re-peated more frequently in the source documentssince bigger clusters get higher weight scores.

The weight of a sentence si is defined as thesum of the weights of the tokens sij contained insentence si (Equation 5).

2https://dkpro.github.io/dkpro-similarity

weight(si) =X

sij2siweight(sij ) (5)

In the following we define a context-free andtwo contextual metrics to measure the weight ofa single token. The context-free metric measureshow important a token is. But since we actu-ally want to know if we should publish a sen-tences or not, we have to consider already pub-lished sentences as well as the importance of thesentences because we don’t want to publish an im-portant sentence when a similar sentence has al-ready been published. We assume that certain to-kens are responsible for covering a particular in-formation nugget. Therefore, the contextual tokenweights incorporate the already published tokensin the computation.

The context-free weight of a token weightcf

is computed based on the temporal TF*ITF-basedmeasures of the token (Equation 6).

weightcf (t,D⌧ ) = TFcf (t,D⌧ ) · ITF (t) (6)

The computation of the ITF value is describedin Section 4.1. We use the ITF value not to mea-sure if a document is relevant to a topic (like thecommonly used IDF value does), but to calculateif a token is important in a stream of documents.Since we have no access to all documents in thestream, but only the documents which were pub-lished until a given timestamp ⌧ , we can only com-pute the temporal TF values for the set D⌧ . Wemeasure therefore, how salient a token i is in adocument stream D⌧ until the timestamp ⌧ . To doso, we count the number of appearances of tokent in the documents D⌧ and divide this quantity bythe total number of tokens in the documents |D⌧ |.This gives us an impression on how salient a tokenis in the document collection D⌧ . Equation 7 for-malizes the computation of the context-free tokenweights.

TFcf (t,D⌧ ) =|t 2 D⌧ ||D⌧ | (7)

Since the document collection D⌧ changes af-ter each time step ⌧ , the temporal TF values areconstantly updated when new documents are pro-cessed.

The contextual weight of a token models theweight of a token with respect to already publishedtokens. We reduce the context-free TF values for

Algorithm 1 SeqCluSumdocuments = ordered list of documents, I = maximum number of clusters

1: clusters {}, updates {}2: for each document 2 documents do

3: document pruneBoilerplate (document)

4: document stemWords (document)

5: for each sentence 2 document do . add new sentences to clusters6: if sentence is not pruned then

7: nearestCluster = argmaxc2clusterssimilarity(sentence, c)8: if i = 0 or (similarity(sentence, nearestCluster) > ⇥ and i < I) then

9: clusters clusters [ {sentence}; i i+ 1 . create new cluster10: else

11: nearestCluster nearestCluster [ sentence . add to nearest cluster12: end if

13: end if

14: end for

15: for each cluster 2 clusters do . evaluate clusters and generate updates16: if cluster is not published and weight(cluster) > µ then

17: bestSentence = argmaxc2clusterweight(s) . find best sentence in cluster

18: updates updates [ (document.timestamp, bestSentence)

19: set cluster to published20: end if

21: end for

22: end for

23: return Updates

tokens which have already been published. By im-plementing such a reduction of already publishedtokens, we achieve a contextual importance grad-ing of the tokens which leads to an avoidance ofredundancy.

We use two different strategies for the reductionof the TFcf values. In the first contextual strategy,cs1, the values of all already published tokens areset to 0, which leads to a very strict and extensiveavoidance of redundancy. In the second contextualstrategy, cs2, we reduced the values by dividingthe context-free value TFcf (t,D⌧ ) by the num-ber of appearances of token t in the already pub-lished updates. This leads to a more laxly filteringwhich is expected to increase recall at the cost ofprecision. The computation of the second contex-tual importance measure is described in Equation8, where |t 2 U | denotes the number of appear-ances of the token t in the current set of updatesU .

TFcs2 (t,D⌧ ) =TFcf (t,D⌧ )

|t 2 U | (8)

A cluster is considered as sufficiently impor-tant if a fixed threshold µ is exceeded (line 16).

This threshold can be varied in a productive en-vironment easily, to produce more or less verbosesummaries. For the experiments, we choose twovalues by hand without any sophisticated evalua-tion or optimization method. If a cluster exceedsthe threshold, the best sentence according to thesentence score described above is published. Thecluster is marked as published in this case. Thismeans, that the cluster will not be selected in thefuture to publish another sentence. Nevertheless,it is still possible to add new sentences to an pub-lished cluster. This is important because we willlikely see more sentences, which are similar to al-ready published sentences.

5 TREC Temporal Summarization Track

We participated with the previously described sys-tem, SeqCluSum, in the TREC 2015 TemporalSummarization track. In the following section wegive detailed information about the challenge, de-scribe the evaluation methodology and discuss ourperformance in the task.

5.1 Introduction

We described in Section 1 the task of temporalsummarization and outlined its importance andspecial properties in comparison to classical multi-document summarization. In this section, we de-scribe the Temporal Summarization track (Guo etal., 2013) of the Text REtrival Conference 2015(TREC), or TREC-TS 2015 for short. The goal ofthis track is to develop systems which can detectuseful, new, and timely sentence-length updatesabout a developing event.

The first major difference to classical summa-rization challenges like DUC3 and TAC4 is the no-tion of time during summarization. In TREC-TS itis not only crucial to detect and extract importantinformation, but to detect the important informa-tion as early as possible during the developing ofan event.

A second important difference compared toDUC is the evaluation methodology. In DUC,gold standard summaries were written by hu-man experts and the automatically generated sum-maries are compared to these summaries with theROUGE (Lin, 2004) score. In contrast, sum-maries produced by the TREC-TS participant sys-tems have to cover as many as possible fixed infor-mation nuggets. If a sentence contains an informa-tion is evaluated by hand for each sentence. Fur-thermore, it is, compared to DUC and TAC, solelyfocused on information retrieval and not on writ-ing quality.

5.2 Setup

The setup of the TREC-TS track is as follows.There are 3 subtasks in the track. Each subtaskconsists of 21 events. The first two subtasks, Fil-tering and Summarization, deal with high-volumestreams of news article documents and blog postswhich do not have to be related to the given event.In this scenario, the participants have to filter outirrelevant documents to be able to summarize therest of the stream. In the third subtask, Sum-marization only, the participants face low-volumestreams of pre-filtered documents. The corre-sponding corpus to this subtask is called TREC-TS-2015F-RelOnly5. In this scenario, the in-formation retrieval problem is considered to besolved on the document level. This means that all

3http://duc.nist.gov4http://www.nist.gov/tac5http://dcs.gla.ac.uk/˜richardm/

TREC-TS-2015RelOnly.aws.list

documents in the stream contain relevant informa-tion. We participated with our system in the lattersubtask.

All documents in a stream have an associatedtimestamp which equals the crawling time of thedocument. The documents have to be processedin temporal order. The result of the system is alist of updates. These updates are considered to besentence-length update messages, which could bepublished via a microblogging service like Twit-ter6 to keep an end-user up-to-date about the de-velopment of an event. When a system decidesthat a sentence from the stream should be pub-lished, this particular sentence is marked with thetimestamp of the currently processed document.The systems are not allowed to use informationfrom the future to make this decision. This means,incorporating information that was only availableafter the current timestamp is not allowed. One ex-ample for a not allowed usage would be the com-putation of TF*ITF values over the whole corpussince this would provide information about whichwords will become frequent in the future.

5.3 Evaluation

The evaluation of the submissions was based onWikipedia article revision histories of the cor-responding events. In a first step, informationnuggets were extracted from the revision histo-ries by the track organizers. The nuggets weremarked with a timestamp and classified with animportance score of 1, 2, or 3, where 1 is the leastand 3 the most important category. This times-tamp equals the time where the information wasincluded in the Wikipedia article the first time.Therefore, it represents the time when the infor-mation could be considered to be publicly avail-able. An example of such a nugget is VM26.002-1358344449-3-’On 16 January 2013,at around0800 GMT’, where VM26.002 is the nugget id,1358344449 is the timestamp7 when the informa-tion has been published in Wikipedia, 3 is theimportance of the nugget, and ’On 16 January2013,at around 0800 GMT’ is the actually infor-mation encoded by this nugget.

In a second step, sentences were pooled fromeach submission. For each submission, at most60 sentences were pooled due to the huge an-notation effort. These sentences were manually

6https://twitter.com7The timestampes equal to the number of seconds since

January 1, 1970

matched against the extracted information nuggetssince there is no system available which is ableto find these matchings precisely. A sentencecould thereby match multiple nuggets as well asno nugget. A sentence, which matches relevantinformation nuggets, which were not matched bysentences published earlier by a submission, wasconsidered to be a valuable sentence. If a sen-tence matches only already matched informationnuggets or no nuggets at all, it was considered tobe worthless. Since such sentences do not improvethe information content of the summary, the sys-tem got penalty points for being to verbose.

For the evaluation, the precision of the gener-ated summaries was evaluated with the (normal-ized) Expected Gain metric nEG(S). This met-ric measured to which degree the sentences in thesummary were on-topic and novel. The recall wasmeasured by the Comprehensiveness metric C(S).This metric measured how many of the informa-tion nuggets that could have been retrieved werecovered by the summary. A third measure, theExpected Latency E[Latency], measured to whichdegree the information in the summary was out-dated. A combined F-measure, H, based on theprecision and recall which also incorporates the la-tency was used as official target measure.

5.4 Results

We submitted four runs for evaluation at theTREC-TS shared task. Runs 1 and 2 used thestrict contextual strategy cs1. The runs 3 and 4used the more laxly contextual strategy cs2. Runs1, 2 and 4 had a maximum number of clusters of1,000 whereas run 3 had a maximum of 100. Theminimum cluster score, which had to be obtainedby the cluster to be published was set to 1 for theruns 1, 3, and 4. For run 2, this value was set to 3.Furthermore, we modified the boilerplate removalfor runs 3 and 4 slightly to prune less sentencesduring the preprocessing. The similarity functionused was the very same for all four runs. Thesechanges led to differently verbose runs, where run1 was the least verbose run followed by run 2. Run3 and run 4 were the runs which published mostupdates due to the more laxly contextual strategy.

Unfortunately, we introduced errors in the resultfiles which resulted in invalid sentence identifiers.Therefore, our sentences could not be included inthe pool of sentences which were matched againstthe information nuggets extracted from Wikipedia

as described in Section 5.3. Since other systemsselected at least some of the sentences, which wehad selected, we were at least able to get somepoints for the overlapping sentences. Since we didnot get points for sentences which were only se-lected by our system, this score represents onlya lower bound for the true performance of thesystem. We report the percentage of unevaluatedsentences in column unevaluated. The sentenceswhich were not evaluated were considered as ir-relevant or redundant without looking at them. Asexample, for run 1 only 41.59% of the top 60 up-dates were published by other systems as well andtherefore only this amount was evaluated. The re-maining 58.41% were considered to be irrelevantor redundant and we were penalized for them.

Nevertheless, run 1 and run 2 achieved consid-erable better results than the best other participat-ing system. Run 3 and run 4 performed better thanthe second best system.

Table 2 shows the described lower bounds forour runs, a comparison with the best 3 systemsand the average grading over all runs sorted by themain target metric H. Additionally, we providethe percentage of unevaluated sentences to makethe results better assessable. The evaluation resultsof the other systems are taken from the overviewpaper of the TREC-TS challenge.

6 Conclusions

We presented SeqCluSum, a system for tempo-ral summarization based on sequential clusteringand contextual importance measurement. The se-quential clustering arranges similar sentences to-gether to detect redundancy. The contextual im-portance measure weights the clusters accordingto their importance. During this step, we incor-porate the knowledge about previously publishedsentences to avoid redundancy.

With this system, we participated in the Sum-marization only subtask of the Temporal Sum-marization track at the Twenty-Fourth Text RE-trieval Conference. Unfortunately, we introducederrors in our submission and can only report lowerbounds of the performance of our runs. Never-theless, these lower bounds show already a betterperformance than the best fellow competitor run.Therefore, we conclude that our approach is wellsuited to handle the special difficulties in this chal-lenge.

TeamID RunID H unevaluated nEG(S) C(S) E[Latency]AIPHES Run1 0.1083 58.41% 0.1576 0.2854 0.4776AIPHES Run2 0.1016 59.97% 0.1260 0.2982 0.4607WaterlooClark UWCTSRun4 0.0853 - 0.1840 0.1710 0.3983AIPHES Run4 0.0781 63.34% 0.0897 0.4005 0.5679AIPHES Run3 0.0722 62.64% 0.0905 0.4164 0.5061BJUT DMSL2N2 0.0649 - 0.0645 0.6557 0.5606uogTr uogTrhEQR2 0.0639 - 0.0667 0.5459 0.5335Mean - 0.0472 - 0.0595 0.5627 0.5524

Table 2: Lower bound evaluation values for AIPHES runs 1-4, the 3 best non-AIPHES runs and averagescores of all participating runs sorted descending by H, the main metric of the challenge. The columsnEG(S), C(S), and E[Latency] show the results according to official evaluation metrics described inSection 5.3. The column unevaluated shows how many percent of the sentences were unevaluated.

7 Future Work

We propose in the following several possibilitiesto further enhance the system in order to improvethe performance of our system and subsystemswhich were not considered due to time limitations.

The sequential clustering depends strongly onthe accuracy of the used similarity measure. Sincewe used predefined standard similarity measureswithout any fine-tuning or adaption to the task, weassume that a more sophisticated selection of sim-ilarity measures could lead to an improvement ofthe performance.

As described in Section 3, a semantic represen-tation of the text could also help to detect similar-ities between sentences more accurately.

Another improvement of the similarity mea-sure would be to utilize the weight of the to-kens, which are used during the contextual mea-suring of importance also in the clustering. Ifwe weight important tokens higher in the simi-larity metrics, we could achieve that the clusterswould be more focused on a particular informa-tion nugget. For example, the token 15 shouldbe more influential to the clustering when therewere 15 people injured in an event and the tokenthe should not have a high influence on the sim-ilarity of sentences. Generally speaking, a sen-tence A = (a, x, a, a, a) should be more similarto the sentence B = (b, b, b, x, b) than to sentenceC = (a, a, a, a, a) when x is an important tokenin the current topic and a and b are two unimpor-tant tokens, although the tokens a and b are morefrequent in the sentences than token x.

The system described in this paper relies onthe assumption that important information is men-tioned frequently. This is in particular problematic

for temporal summarization, because it takes sometime (or documents) to detect that a word is irreg-ularly frequent in the corpus. For example, duringthe first mentioning of the word bomb the wordis not considered to be important for the corpus,because it occurred only once. Only when moredocuments are processed that contain the token thesystem is able to detect that this token is important.If we would leverage the fact that the mentioningof bomb is always important when summarizing inthe domain of terrorist attacks, we would be ableto detect the importance faster. For bomb this im-provement can be rather easily achieved, becausewhen a bomb explodes this is nearly always im-portant. But the target of the bomb will also bevery important. However, it is nearly impossibleto know beforehand whether the word cab, Quetta,or main station will be important.

Furthermore, it should be possible to add a sen-tence to multiple clusters because a sentence cancontain multiple information nuggets. The gen-eral idea of the clustering is that each particularinformation nugget is represented by exactly onecluster. In the current system, a sentence can onlybe part of one cluster.

Another improvement concerning the clusteringcould be a re-cluster step, which could split a clus-ter into multiple clusters when the system detectsthat one cluster contains too diverse sentences.This is currently prevented by the parameter ⇥.By enabling the system to execute a re-clusteringstep, this parameter could become needless. Thesame holds for the upper bound of clusters. Fur-thermore, the system could merge multiple clus-ters to one cluster, if it detects that the content ofthe clusters are more similar as the seed sentences

suggested. This situation can easily occur whensynonyms are used.

Acknowledgments

This work has been supported by the German Re-search Foundation (DFG) as part of the ResearchTraining Group Adaptive Preparation of Informa-tion from Heterogeneous Sources (AIPHES) un-der grant No. GRK 1994/1.

References

Daniel Bar, Torsten Zesch, and Iryna Gurevych. 2013.Dkpro similarity: An open source framework fortext similarity. In Proceedings of the 51st AnnualMeeting of the Association for Computational Lin-guistics: System Demonstrations, pages 121–126,Sofia, Bulgaria, August. Association for Computa-tional Linguistics.

Phyllis B Baxendale. 1958. Machine-made index fortechnical literature: an experiment. IBM Journal ofResearch and Development, 2(4):354–361.

Jaime Carbonell and Jade Goldstein. 1998. The use ofmmr, diversity-based reranking for reordering doc-uments and producing summaries. In Proceedingsof the 21st annual international ACM SIGIR confer-ence on Research and development in informationretrieval, pages 335–336. ACM.

Harold P Edmundson. 1969. New methods in au-tomatic extracting. Journal of the ACM (JACM),16(2):264–285.

Qi Guo, Fernando Diaz, and Elad Yom-Tov. 2013.Updating users about time critical events. In PavelSerdyukov, Pavel Braslavski, SergeiO. Kuznetsov,Jaap Kamps, Stefan Ruger, Eugene Agichtein, IlyaSegalovich, and Emine Yilmaz, editors, Advancesin Information Retrieval, volume 7814 of Lec-ture Notes in Computer Science, pages 483–494.Springer Berlin Heidelberg.

Chris Kedzie, Kathleen McKeown, and Fernando Diaz.2014. Summarizing disasters over time. In Proceed-ings of the Twenty-Third Text REtrieval Conference.

Christian Kohlschutter, Peter Fankhauser, and Wolf-gang Nejdl. 2010. Boilerplate detection using shal-low text features. In Proceedings of the 3rd ACMInternational Conference on Web Search and DataMining, pages 441–450.

Chin-Yew Lin. 2004. Rouge: A package for automaticevaluation of summaries. In Proceedings of the ACLText Summarization Workshop, pages 74–81.

Hans Peter Luhn. 1958. The automatic creation ofliterature abstracts. IBM Journal of Research andDevelopment, 2(2):159–165.

Richard McCreadie, Romain Deveaud, M-Dyaa Al-bakour, Stuart Mackie, Nut Limsopatham, CraigMacdonald, Iadh Ounis, Thibaut Thonet, andBekir Taner Dincer. 2014. University of glasgowat trec 2014: Experiments with terrier in contextualsuggestion, temporal summarisation and web tracks.In Proceedings of the Twenty-Third Text REtrievalConference.

Kathleen McKeown and Dragomir R Radev. 1995.Generating summaries of multiple news articles. InProceedings of the 18th annual international ACMSIGIR conference on Research and development ininformation retrieval, pages 74–82. ACM.

Rada Mihalcea and Paul Tarau. 2004. Textrank:Bringing order into texts. Association for Compu-tational Linguistics.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in neural information processingsystems, pages 3111–3119.

Ani Nenkova and Kathleen McKeown. 2011. Auto-matic summarization. Foundations and Trends inInformation Retrieval, 5(2-3):103–233.

Martin Porter. 1980. An algorithm for suffix stripping.Program, 14(3):130–137.

Dragomir R Radev, Hongyan Jing, Małgorzata Stys,and Daniel Tam. 2004. Centroid-based summariza-tion of multiple documents. Information Processing& Management, 40(6):919–938.

Sergios Theodoridis and Konstantinos Koutroumbas,2009. Pattern Recognition, chapter 12, pages 633–634. Elsevier Ltd, Oxford.

Tian Zhang, Raghu Ramakrishnan, and Miron Livny.1996. Birch: an efficient data clustering method forvery large databases. In ACM SIGMOD Record, vol-ume 25, pages 103–114. ACM.

Yun Zhao, Fei Yao, Huayang Sun, and Zhen Yang.2014. Bjut at trec 2014 temporal summarizationtrack. In Proceedings of the Twenty-Third Text RE-trieval Conference.