New STAPLE: Spatio-Temporal Precursor Learning for Event …people.cs.vt.edu/naren/papers/sdm18-submission-staple.pdf · 2018. 1. 23. · STAPLE: Spatio-Temporal Precursor Learning

STAPLE: Spatio-Temporal Precursor Learning for Event Forecasting

Yue Ning∗ Rongrong Tao∗ Chandan K. Reddy∗

Huzefa Rangwala† James C. Starz‡ Naren Ramakrishnan∗

Abstract

Large-scale societal events such as civil unrest move-ments occur due to a variety of factors including eco-nomics, politics, and security. Societal event detectioncan be modeled as a system of inter-connected locations,where each location is recording a set of time-dependentobservations. In order to detect event occurrence andautomatically reconstruct the precursors and signals, itis essential to model relationships between the differentlocations w.r.t. how events evolve over time. However,existing methods for precursor discovery do not cap-ture or exploit spatial and temporal correlations inher-ent in event occurrences. The absence of such modelingnot only creates shortcomings in the quality of inferencebut also curtails interpretation by human analysts. Fur-thermore, forecasting is inhibited when training data issparse. In this paper, we develop a novel multi-taskmodel with dynamic graph constraints within a multi-instance learning framework. Our model tackles theproblem of scarce data distribution and reinforces co-occurring location-specific precursors with augmentedrepresentations. Through studies on civil unrest move-ments in numerous countries, we demonstrate the effec-tiveness of the proposed method for precursor discoveryand event forecasting.Keywords: Multi-task learning; spatio-temporal pre-cursor learning; event correlation.

1 Introduction

While studying large scale societal events, policy makersand professionals aim to reconstruct precursors to suchevents to help understand causative attributes. Givena document reporting an event of interest (e.g., acivil protest), precursors are other documents publishedearlier than reported incidents or happenings and canbe viewed as influential in the lead up to the protest.Such analysis is typically done painstakingly with theaid of subject matter experts, but new algorithmic

∗Discovery Analytics Center, Computer Science department,Virginia Tech. {yning, rrtao, ckreddy, naren}@vt.edu†Computer Science department, George Mason University.

[email protected]‡Lockheed Martin ATL. [email protected]

Jan. 20, 2015: the National Assembly and the Senate Foreign Relations Committee had passed resolutions condemning the publication of the French magazine Jan. 23, 2015:

Jamaat-e-Islami (JI) Ameer Sirajul Haq on Friday asked the French government and the magazine to apologize

Jan. 24, 2015: Jamaatud Dawa (JuD) chief Hafiz Muhammad Saeed said the Muslims will launch a global movement against it

Jan. 27, 2015: Hundreds of students protested against a French magazine and stormed a school demanding it close

Bannu

Figure 1: Precursor event sequence discovered by ourmethod for a protest event.

tools [13] have recently emerged that support suchprecursor discovery. A key challenge that persists isthe incorporation of spatial and temporal correlationsinherent in large-scale societal event occurrences. Forexample, civil unrest (protests or strikes) in a city isoften influenced by happenings at nearby locations andwhile event counts might not be comparable, there issignificant temporal and spatial correlation across eventoccurrences.

Figure 1 shows a precursor event sequence discov-ered by our proposed model. The target event is a stu-dent protest against a French satirical magazine in thecity of Bannu in Pakistan. One week before this event,there were several highly related events that occurred inother cities and may have influenced this target eventof interest. For instance, the government of Pakistanpassed resolutions condemning the publication in thisFrench magazine. Later, different groups expressed con-cern in small gatherings in Islamabad and Lahore fol-lowed by protests in multiple cities including Bannu.

In this paper, we propose STAPLE, a multi-taskSpatio-TemporAl Precursor Learning and Event fore-casting framework for multiple cities, specifically de-signed to discover precursors across geolocations withimbalanced class distributions and partial labels. Theprimary datasets of interest are open source news arti-cles across the world encoded into events, where eachevent has vital information including a timestamp (ingranularity of days), a geolocation (at the city level),

Copyright c© 2018 by SIAM

Unauthorized reproduction of this article is prohibited

a description (plain text), and an event type (e.g.,protests). Our objective is to build forecasting modelsfor specific cities and to identify evidential precursorsfrom multiple cities in the past news articles.

This problem is non-trivial and poses several uniquechallenges: (i) Temporal ordering constraints on events.Events are often carefully sequenced in terms of theirprecursors and ignoring temporal information that is in-herent in event evolution leads to unsatisfactory results.(ii) Lack of class labels for precursor documents. Whileevents of interest can be manually (or automatically) de-tected and classified, labels for associated precursor doc-uments (which are larger in number) are not availableand are expensive to obtain. (iii) Data scarcity and im-balanced distribution in certain geolocations. Althougha few transfer learning algorithms [19, 16, 20] supportinference of the type considered here, none of themcan tackle the data insufficiency problem in the pres-ence of spatio-temporal correlations. (iv) Inadequacy ofstatic features. It has been demonstrated [22] that suc-cessful event forecasting requires moving beyond staticfeatures, e.g., combining keyword frequencies with dy-namic graph features. Thus the proposed forecastingmodels must support learning of appropriate represen-tations inherently within its model-building.

We observe that by taking advantage of spatio-temporal event correlations within a multi-task learningframework, about 86% of the cities in our dataset haveimproved F1 scores compared to the best state-of-the-art algorithm. 60% of cities have more than 20%improvements in F1 score, especially for cities with lesstraining data. We summarize the key contributions ofthis paper as follows:

• Dynamic graph constraints for precursorlearning and event forecasting: Our model ex-ploits event count correlations across multiple loca-tions under dynamic temporal constraints for jointlyforecasting events and identifying precursors. A fu-sion penalty is proposed to coordinate the forecastingtasks in related cities.

• Augmented representation learning for pre-cursors: By integrating document and entity embed-dings within a multiple instance learning framework,it assists the prediction model to track significant en-tities that are evident based on their estimated prob-abilities.

• Multi-task learning for precursor mining: It al-leviates the data insufficiency problem by simultane-ously learning multiple related tasks and restrictingall cities to share a common set of features with aconsensus model.

1/28/2017 Trend Chart (Area + Line)

http://localhost:8080/ 1/1

Aug 03 Aug 10 Aug 17 Aug 24 Aug 31 Sep 070

1

2

3

4

5

6

7

8

# N

ews

protestAug 10

protestAug 22

protestSep 3

1/28/2017 Trend Chart (Area + Line)

http://localhost:8080/ 1/1

Aug 03 Aug 10 Aug 17 Aug 24 Aug 31 Sep 070

2

4

6

8

10

12

14

16

# N

ews

protestAug 3

protestAug 6

protestAug 10

protestAug 15

protestAug 19

protestAug 22

protestAug 25

protestAug 27

protestSep 3

Aug. 1

…...

Aug. 10

Event

Aug. 2

Aug. 1

…...

Aug. 10Aug. 2

Multi-Instance Learning

Spatio-Temporal Correlation Graph

Augmented Entity Embeddings

Org

Person

Org

Person

Location

Location

Event

Figure 2: Overall System Framework.

• Comprehensive set of experiments in real-world data: We evaluate the proposed model onreal-world datasets collected from six countries andmore than one hundred cities. We conduct quanti-tative and qualitative analyses on the precursors in-ferred by the proposed model.

2 Problem Statement

Given multiple cities (or geolocations) within a country,we focus on the problem of predicting the occurrence ofa future protest within a target city using captured opensource news feeds. Figure 2 provides an overview of ourproposed approach. Specifically, we hypothesize thatthe correlations between events occurring across “space”and “time” lead us to effectively forecast future eventsof interest. Besides forecasting the protest, we also aimto identify specific news articles as precursors acrossdifferent locations for manual inspection. Our proposedSTAPLE model seeks to capture these correlations byjointly training the models across different cities withina nested multiple instance learning framework [13].

Formally, given a set of multiple cities K, each cityk, has a set of associated news articles and event indica-tors (i = 1, ..., Nk) which are denoted by (Xk, Y k).

Y ki =

{1 if an event occurs after Xk

i

0 otherwise(2.1)

The news articles published H days before an indicatorare given by Xk

i (bag). The proposed prediction modelseeks to estimate the probability of occurrence of afuture event in city k, P ki given Xk

i . Since instance-level labels are not provided, simple multi-instancelearning(MIL) structures or complicated layers of MILcan both be applied in this problem. Given the reportedperformance [13], we use the nested multiple instancelearning approach that allows for the transfer of classlabels (target events) to individual news articles. Itprovides probabilistic estimates per document, per day,and per event for each city. The key contributionof this work is to formalize a multi-task precursorlearning model to study dynamic temporal relationshipsof multiple geolocations. Also, we derive an optimizationmethod to solve this problem.

Copyright c© 2018 by SIAMUnauthorized reproduction of this article is prohibited

3 Methodology Design

3.1 The Proposed Model Given K cities, let us as-sume Θ = (θ1, ...,θk, ...,θK) be the model parametersto be learned for each of the cities, where θk ∈ Rm, andm is the dimension of the feature space representingeach document. We model the document-level proba-bility estimates phj for a news article j published onday h in city k to associate with the event of inter-est using a logistic function given by phj = σ(xThjθ

k)

where σ(a) = 1/(1+e−a). Document embeddings (xhj)are described in detail in Section 3.3. A specific docu-ment is considered to be more related to the event ofinterest when the probability estimate is high. Usingthe learned θk for a city, the nested multiple instancelearning framework provides probabilistic estimates byaveraging at the day-level (intermediate level) and atthe event-level which considers a set of consecutive daysbefore the target event.

The STAPLE model seeks to jointly learn thedifferent model vectors, Θ, across the K cities usingthe multi-task learning paradigm, given by:

minΘ

∑k∈K

NkNL(θk) + λR(Θ)(3.2)

where Nk is the number of training examples availablefor city k and N is the total number of training ex-amples across all the cities. L is the loss function andR is the regularization function. The two-level multi-ple instance loss function is designed to minimize theprediction error associated with forecasting events, toenforce consistency in probabilistic estimates obtainedfor consecutive days, and also to maximize the differencebetween positive and negative instances using an unsu-pervised hinge loss function. The regularization termexplicitly captures the spatio-temporal correlations be-tween the occurrence of events across different cities. Amore specific form of the objective function correspond-ing to Eq. (3.2) is given below:

minΘ

∑k∈K

(NkNL(θk) +

λ1

2

Nk∑i

∑l∈Gt

αtik,l(θk − θl

)2(3.3)

+λ2

2||θ − θk||22 +

λ3

2||θk||22

)where k, l are the indices for cities, θk is the modelparameter for city k, θ represents the global model, ti isthe time index for the current event indicator, αtik,l is theweight between city k and city l at time ti, λ1, λ2, λ3 arehyperparameters. Different types of penalty functionsallow us to enforce different behaviors in the evolutionof the event across multiple geolocations.

Within the MTL framework the regularization termis designed to enforce different penalties such thatrelated tasks share similar features or model parameters.The STAPLE model explicitly enforces pairs of cities

Mar 20-23. 2015

A

CB

D

• Protest • April 1, 2015

Type 1

Type 2A

A

CBType 1

A

BType 1 D

AType 2 C

12

Mar 24-27. 2015 Mar 28-31. 2015

BType 3 A1 1

1 2

2

21

A

B

D

C

Figure 3: An example of the spatio-temporal correlationgraph. Each node (A,B,C,D) represents a city. Anedge between two nodes indicates that the same typeof events occurred in the same time window for thesetwo cities.

within a country that have seen similar events occurin the past learn similar model vectors. Gti is thecorrelation graph for time ti that determines the tasks(cities) with similar event profiles to each other.

3.2 Event Correlation Graph We demonstratethe concept of the spatio-temporal correlation graph inFigure 3. Each node represents a city in a country. Topredict the occurrence of a protest in city A on April 1,2015, the model analyzes the past few days of data todiscover if there is an event in city A and also in othercities (i.e., city B, C and D) from Mar. 20 to Mar.31, 2015. The weight on the edge denotes the minimumnumber of common events between the two cities. FromMar. 20-23, the neighbor network for city A consists ofthree cities B,C,D with two types of events. αtik,l is thenormalized weight on the edge between city k and cityl, given by:

αtik,l =(∑

c

ti∑t=ti−H

min(Ekt (c), Elt(c)))′

+( 1

dist(k, l)

)′(3.4)

Here c is the event type (such as a protest), andEkt (c) is the number of events of type c in city k thatoccur within time window t. We scale the value ofevent count and 1/dist() by feature scaling function

(x)′ = (x−xmin)xmax−xmin

into a range of (0.0, 1.0). Giventhe spatio-temporal correlation graph G and the edgeweights, it is reasonable to assume that two citieswill share several common edges, as they tend tobe influenced by the same set of covariates. Whena city has large number of training examples, theempirical loss helps to reduce the prediction error.The spatio-temporal correlation constraint captures thetask relatedness between multiple cities within a giventime period. The dist function returns the distance



Org

Person

Loc

NER

Paragraph Vector Output

Figure 4: Learning representations. Top portion cor-responds to the document embeddings and the bottomportion corresponds to the entity embeddings.

between city k and city l. We assume that twocities that are far away from each other have fewercorrelations/similarities in their models.

Besides the spatio-temporal constraints we alsoenforce the learned individual model vectors to notdeviate from the global average along with a l2-normconstraint on the weight vectors. The regularizationparameters control the model complexity by enhancingrobustness and the MTL constraints alleviate the datainsufficiency problem for each individual task (if learnedseparately).

3.3 Learning Representations One of the chal-lenges in precursor mining for event forecasting is toselect informative and related documents. A precur-sor event is not necessarily similar in semantics to theevent of interest. In this work, we make use of aug-mented distributed representations of the documents todiscover progression of precursors towards the targetevents. More specifically, we study the following aspectsof representations for each historical news document x:

Document Embeddings: Recent work has shownthat the semantic relationships of words can be effec-tively captured using the geometry of a continuous em-bedding space [8, 12]. For the articles in each country,we learn distributed representations (vectors) [8] for textdocuments and utilize these embeddings in our experi-ments.

Entity Embeddings: In many societal events, theroles of significant entities such as government officersor large organizations are substantial and sometimeseven influence the progression of events. We focus onlocation names, person names, and organization namesas the primary entities of interest. Entity and relationembeddings have been studied in structured learningand knowledge graph modeling [10]. More specifically,we use the Stanford Named Entity Recognizer (NER) [5]to extract a set of entities for English and use a seriesof language enrichment steps (see [14] for details) forprocessing. We apply the Continuous Bag-of-Wordmodel (CBOW) [11] to each dataset to obtain word level

Algorithm 3.1 STAPLE Model Learning

1: Input: Dataset(Xk, Y k), k ∈ [1, ...K], τ

2: Output: Θ = [θ,θk]3: for τ iterations do4: randomize(Xk, Y k)5: for city k in K do6: for i ∈ [1, 2, ..., Nk] do7: . training examples for city k8: graph construct Gti9: calculate αtik,l . Eq. (3.4)

10: calculate gradient ∇(Θ) . Eq. (3.5, 3.6)11: update Θ using ∇(Θ) based on SGD

return Θ

embeddings. Finally, we aggregate the embeddings ofentities into the representation for text documents asshown in Figure 4.

3.4 Optimization The optimization problem issolved using the mini-batch gradient descent algorithmfor each of the model parameters, θk, described in Algo-rithm 3.1. We update the weight vector using an adap-tive learning rate θkτ+1 = θkτ − η∇(θk) where η is thelearning rate at the current iteration. More specifically,

θk ←θk − η[∂L(θk)

θk+ λ1

∑l∈Gti

αtik,l(θk − θl)(3.5)

− λ2(θ − θk) + λ3θk]

θ ←θ − ηλ2(θ − θk)(3.6)For each city, we update the global model parametersθ and θk in an alternate manner. The spatio-temporalcorrelation graph and the weights on edges are precom-puted only using the training set.

3.5 STAPLE Model for Precursor Mining Theprecursor documents are identified based on their esti-mated probabilities from the learned model of each cityas described in Algorithm 3.2 from lines 3 to 11. We alsodiscover precursors for each city in the past time win-dow from the neighboring cities (lines 12 to 17). For in-stance, in Figure 3, the precursors for the protest eventin city A are also explored in the news articles geolo-cated at its neighboring cities, B to D using the learnedmodel vectors for each of the cities. If the estimatedprobability is above a certain threshold, we select it intothe precursor candidate pool for the target event in cityA. The time complexity is dependent on the numberof examples in the training dataset and the number ofhistorical documents for each event. Based on the esti-mated probability of instances for events, the precursordocuments and entities can be selected in linear time


Algorithm 3.2 STAPLE for Precursor Discovery

1: Input: Dataset (Xk, Y k), k ∈ [1, ...,K]2: Output: the estimated probabilities set Q, the

discovered precursor set O.3: Θ← call Algorithm 3.1 for training4: for city k in K do5: for i ∈ [1, 2, ..., Nk] do6: for Xh ∈ Xk

i do7: for xhj ∈ X kh do

8: estimate phj using θk

9: if phj ≥ 0.5 then10: Oki ← xhj , Q

ki ← phj

11: estimate P ki by Avg(phj)12: get “neighbor” cities for k from Gti13: for city l in the “neighbor” cities do14: for h = [1, ...,H] do15: for xhj ∈ X lh do

16: phj = σ(θl,xhj)17: Oki ← xhj if phj ≥ ξ

return O,Q

with respect to the number of historical documents foreach event (lines 7 to 10).

4 Experimental Setup

4.1 Datasets We evaluated our models on eventencoded data from six countries. Among them,three countries, Columbia (CO), Paraguay (PY), andVenezuela (VE) are from a labeled set called Gold Stan-dard Report (GSR) [14] from January 2014 to April2015. The GSR is a manually curated dataset thatrecords civil unrest events from the ten most signifi-cant news outlets as ranked by the International Me-dia and Newspapers in each country. An example ofa recorded event is given by its city, state, country,date, and event type. The other three datasets, Pak-istan (PK), Iran (IR) and Afghanistan (AF) are fromthe ICEWS dataset [4] which stands for Integrated Cri-sis Early Warning System. It contains news articlespublished all over the world with the goal of evaluat-ing national and international crisis events. Events areautomatically identified and extracted by the BBN AC-CENT event encoder. In our experiments, we only usedata from 2015 and 2016. Each news article is labeled inone of the 20 categories. We use “protest” events as ourpositive examples and no protest days (or other types ofevents) as negative examples. We observe that the eventdistribution varies significantly across cities, with largecities having relatively more event occurrences com-pared to small cities. Statistics about these datasets areshown in Table 1. We use protests in the aforementioned

Table 1: Datasets used in our experiments. CO: Columbia,PY: Paraguay, VE: Venezuela, PK: Pakistan, IR: Iran, AF:Afghanistan

CO PY VE PK IR AF

# news 8,386 7,879 9,390 64,868 38,113 27,786# events 604 971 1,911 1,291 1,084 713

Latin American and Middle East/Asian countries torigorously evaluate the performance of our frameworkand also consider other countries (e.g., France) to pro-vide case studies of how our framework works becausethese countries feature more well-known protests (e.g.,climate protests in France, see Table. 3).

4.2 Experimental Protocol We learn 300-dimensional representations for documents and100-dimensional embeddings for words. Each docu-ment is represented by concatenating the documentand entity embeddings. The entity embedding is anaggregation of each entity within the document. TheGSR and ICEWS datasets record protest events ona given day at a specific location (city level). Toevaluate our proposed model, for each protest event,we extract all the published reports (news articles) forup to two weeks before the occurrence of the specificevent of interest. This ordered collection of per-daynews documents to the protest day are considered aspositive super bags. For negative examples, we identifyconsecutive sets of days within our studied time periodsfor each city when no protest event was reported.

4.3 Comparative Methods (1) Multi-InstanceSVM (MI-SVM [2]): The MI-SVM model extendsthe notion of a margin from individual patterns tobags. The margin is defined between the “most posi-tive” instance of the positive bag and the “least neg-ative” instance of the negative bag. We collapse thenews articles from the different historical days into onebag and apply this standard MIL formulation basedon the SVM in scikit-learn 1. (2) Relaxed MILModel (rMILavg [18]): We estimate the probabil-ity of each instance in a bag using a logistic func-tion. We use the average of instance-level probabili-ties to model the probabilities for bags. (3) NestedMIL (nMIL [13]): This approach applies a nested levelof multi-instance learning for the event forecasting prob-lem. It learns a “global” model for all cities in a country.(4) Transfer model (STAPLE-tx): This method isa simpler variant of the proposed model. Event fore-casting for each city is considered as one task. Wefirst apply our objective function to cities with “rich”

1http://scikit-learn.org/stable/modules/svm.html



0.4 0.5 0.6 0.7 0.8 0.9

1

Recall Precision F1VE

0.4 0.5 0.6 0.7 0.8 0.9

1

Recall Precision F1CO

0.4 0.5 0.6 0.7 0.8 0.9

1

Recall Precision F1PK

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1

Recall Precision F1IR

0.4 0.5 0.6 0.7 0.8 0.9

1

Recall Precision F1AF

0.4 0.5 0.6 0.7 0.8 0.9

1 1.1

Recall Precision F1PY

0.4

0.6

0.8

1

1.2

1.4

Recall Precision F1AF

MISVM Relaxed Nested STAPLE-tx STAPLE

Figure 5: Prediction performance (Recall, Precision,and F1 scores) on six datasets for the comparisonmethods.

datasets (more event examples). After learning modelsfor these source cities, we incorporate these models intoan average model with mean and standard deviationµ, σ and transfer them to the cities that have “sparse”datasets [15]. The target cities will then learn theirmodels by assuming that their model parameters aredrawn from a Gaussian distribution N(µ, σ) where Sdenotes the set of source cities:

µ =1

K − 1

∑k∈S

θk, σ =

√1

|S| − 1

∑k∈S

(θ − µ)2(4.7)

5 Experimental Results

We perform a comprehensive empirical study to evalu-ate the performance of the proposed models in terms ofevent recall, event precision, and F1 score. We also ana-lyze the quality of precursors with quantitative metricsand detailed case studies that highlight the strengths ofthe STAPLE model.

5.1 How well does STAPLE forecast? Fig-ure 5 shows the prediction performance of MI-SVM,rMILavg, nMIL and STAPLE methods in terms ofrecall, precision, and F1 scores in the six datasets con-sidered. The model hyperparameters, λ1, λ2 and, λ3 aredetermined using a validation set for each country. m0

and p0 are set to 0.5 following [13]. For other compari-son models, hyperparameters are chosen following theircriterion in their papers. The STAPLE method out-performs the best state-of-the-art approach (nMIL) by

Table 2: F1 evaluation for the proposed methods (TextEmbeddings vs Text+Entity Embeddings) .

Text Text+Entity

Data Model F1 F1

AFSTAPLE-tx 0.772 0.783STAPLE 0.910 0.910

IRSTAPLE-tx 0.880 0.893STAPLE 0.895 0.904

PKSTAPLE-tx 0.729 0.725STAPLE 0.730 0.713

-0.5 0

0.5 1

1.5 2

2.5 3

3.5 4

0 100 200 300 400 500 600 700 800 900

F1 L

ift

#Events

(a) ICEWS

0

2

4

6

8

10

0 100 200 300 400 500 600

F1 L

ift

#Events

(b) GSR

Figure 6: F1 lift per city using STAPLE comparedto state-of-the-art model, nMIL. X-axis denotes thenumber of events at the city level. Y-axis denotes theF1 lift from the STAPLE model.

16% to 35% for AF, IR, and PK and 5% to 15% forCO and PY, in term of F1 scores. For CO and VE,the STAPLE-tx model outperforms other methods interms of F1 score. The results in Table 2 demonstratethat the methods that take into account the entity em-beddings (second column) perform better than that ofusing the document embeddings only (first column) forAF and IR datasets but not for PK in terms of F1 score.

5.2 Does the Spatio-Temporal Event Correla-tion Graph help? Figure 6 depicts the improvementsof F1 scores of the STAPLE model compared to thenMIL model (F1 Lift = STAPLEF1−nMILF1

nMILF1) versus

the number of events per city. We observe that the mostsignificant improvements are for cities such as Qom andKerman in Iran which are relatively small cities in theirrespective countries.

5.3 How good are the precursors? Figure 7 showsthe average normalized Jaccard Index of precursor doc-uments and non-precursor documents (with respect tothe target document) discovered by the STAPLE andnMIL models. The normalized Jaccard index is com-puted as the pairwise Jaccard index, scaled relative toeach event of interest. It is clear that the precursor doc-uments have higher text-based similarity to the targetevents compared to the non-precursor documents as in


0

0.2

0.4

0.6

0.8

1

AF IR PK CO PY VE

Avg.

Nor

m. J

acca

rd D

iff nMILSTAPLE

(a) nMIL vs. STAPLE

0 0.2 0.4 0.6 0.8

1 1.2

AF IR PK CO PY VE

Avg.

Nor

m. J

acca

rd PrecursorNon-Precursor

(b) Precusor vs. The rest

Figure 7: Comparison of the average normalized Jac-card index for precursor news and non-precursor newsarticles. (a) Shows the difference between Jaccardscores for the precursor and non-precursor documentsfor STAPLE and nMIL models. (b) Shows aver-age normalized Jaccard index for precusor and non-precursor documents for STAPLE.

Figure 7b. Figure 7a shows that the STAPLE modeldiscovers more semantically related documents to thetarget events compared to the nMIL model [13] formost of the countries.

5.4 Case Studies Table 3 demonstrates a case studyon precursor story lines identified by the STAPLE ap-proach. The key strength of STAPLE is its abilityto leverage correlated events occurring across differentcities. For each protest event, e, in a city A, we forma neighborhood city set based on the spatio-temporalcorrelation graph. Then for each city in the graph, weestimate the probabilities of news in the past week usingits model for event e. As long as the probability is abovea certain threshold, we select the news to be a precur-sor candidate for this event: OAe ← x if p(x) ≥ ξ. Westudied ICEWS dataset from France as a case study. Inthe case of a protest event in Paris, France; thousandsof activists started a protest pleading leaders to stopglobal warming near the site of a former terror attack.Our model successfully captured the following relatedevents leading up to this protest: A week earlier, peo-ple in Toulouse marched for the devastating attacks inParis. Many countries announced they will be attend-ing the upcoming Climate Change Summit (COP21) inParis. Two days before, Australia kicked off climaterallies ahead of these global talks. One day before, ac-tivists claimed the need to march for climate changeamid terror threats.

In all cases, our model captured the key changepoints in the precursor story lines for protest events andthe key entity names (such as government officials) thatplayed a crucial role in the development of these events.

5.5 How early can the STAPLE model fore-cast? Leadtime indicates the number of days in ad-vance that the model makes predictions and historicaldays denotes the number of days over which the news

0.7 0.75

0.8 0.85

0.9 0.95

1 1.05

1 2 3 4 5

Test

Acc

urac

y

LeadTime (day)

nMILSTAPLE-tx

STAPLE

(a) PY

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5

Test

Acc

urac

y

LeadTime (day)

nMILSTAPLE-tx

STAPLE

(b) IR

Figure 8: Test accuracy for two countries over leadtimeranging from 1 to 5 days. x-axis denotes the leadtimeand y-axis denotes the accuracy score over the testdataset.

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Test

Acc

urac

y

λ1

PKIRAF

(a) λ1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Test

Acc

urac

y

λ2

PKIRAF

(b) λ2

Figure 9: Sensitivity analysis on hyper-parameters λ1

and λ2 (= λ3). X-axis represents the varying values ofλ1, λ2 and Y-axis denotes the test accuracy.

articles are extracted as input to the prediction algo-rithms. In order to study the changes in performancewith and without the spatio-temporal event correlationgraph constraints, we present the accuracy (ACC) scorewith varying lead times from 1 to 5 days for the STA-PLE and the nMIL models in Figure 8. Due to spaceconstraints, we only depict results for PY and IR. TheSTAPLE model is stable and consistently performsbetter compared to others with varying values of lead-time.

5.6 Sensitivity Analysis Figure 9 illustrates theaccuracy of the proposed STAPLE model on three testdataset by varying the hyper parameters of λ1 and λ2

(= λ3). λ2 and λ3 are chosen to be the same accordingto our parameter search result. The performance ofdifferent values is relatively stable and is shown for only:AF, PK, and IR countries due to the space constraints.

6 Related Work

Spatial Event Forecasting Predicting futureevents of interest (e.g., protests) from large, heteroge-neous, open source feeds is an important and activearea of research [21]. Over the years, supervised and



Table 3: Automatically discovered precursor news stories and relevant entities for one protest event of interest.(i)Precursors for an event in one city are inferred across other cities as well. (ii) Key entities participating in theevents provide explanatory power to the precursors. (iii) Probability of protest events gradually increases withthe accumulation of precursor events.

Date Location Precursor News Summary Entity Prob

2015-11-22 ToulouseP1.More than 10,000 people marched Saturday in the French city ofToulouse for peace and against “barbarity” a week after the devastatingattacks in the capital.

French, Toulouse 0.81

2015-11-23 ParisP1.Wealthy governments and other donors need to invest more toreduce carbon emissions stemming from agriculture, said a study issuedahead of U.N. climate talks in Paris next week.

UN, Paris 0.68

2015-11-24 ParisP1. China and US have vowed to join hands with France and otherparties to work toward success at the UN climate summit in Paris.

China, US, UN, Paris 0.81

2015-11-25 ParisP1. The international community must secure a binding deal againstclimate change at key UN talks in Paris next week, German ChancellorAngela Merkel said Wednesday.

German, AngelaMerkel, UN, French,Aquino, Paris

0.83

2015-11-26 Paris

P1. It comes days ahead of a major UN climate summit in Paris whichaims to forge an international deal to stop global warming.P2. Pope Francis visited the world’s poorest continent to issue a clarioncall for the COP21.

UN, Pope Francis,COP21, Paris

0.90

2015-11-27 ParisP1. Leaders from Russia Germany, and Europe may meet around theupcoming UN climate change conference in Paris.P2. Australia kicks off climate rallies ahead of global talks.

Russia, Germany,Europe, Australia,Paris

0.94

2015-11-28 ParisP1. Activists plan to join arms and form a “human chain” in Paris onSunday to urge action on global warming, in a muted rally after attackson the city.

UN, Paris 0.92

2015-11-29 Paris Protest in Paris, France: Around 4,500 activists had earlier linked hands in a peaceful protestnear the site of the deadliest of the attacks, pleading for leaders to curb global warming.

unsupervised learning approaches have been developedto tackle this problem in different domains. Advancedtechniques which use a combination of sophisticated fea-tures, such as topic related keywords, as input to sup-port vector machines, LASSO, and multi-task learningapproaches [17, 22] have also been studied. Ramakr-ishnan et al. [14] designed a comprehensive framework(EMBERS) for predicting civil unrest events in differentlocations, using a combination of machine learning mod-els ingested with heterogeneous input sources rangingfrom social media to satellite images. Laxman et al. [7]designed a generative model for categorical event predic-tion in streaming data by identifying frequent episodes.A recent work [19] has studied a multi-modal transferlearning method (FLORAL) to transfer knowledge froma city where there is sufficient multi-modal data and la-bels, to other cities and locations.

Precursor Discovery In the multiple instancelearning (MIL) paradigm [23, 9], labels are associatedwith sets of instances commonly referred to as bagsor groups instead of individual instances. Individualinstance-level labels are unknown or missing. MIL turnsout to be a natural fit for the precursor mining problembecause the labels are only associated with the eventsof interest but not the precursor documents in history.Traditional MIL formulations make strong assumptions,e.g., that the aggregation function over instance labels

is a noisy-OR function; i.e., the positive bags contain atleast one positive instance and the negative bags containonly negative instances. A recent work [6] has developedinstance-level predictions from group labels (GICF)which allows for general aggregation functions. Othermultiple instance learning approaches and applicationsare surveyed in [1]. A nested multi-instance learning(nMIL) framework [13] has been proposed to forecastcivil unrest events and detect precursor news articlesfor these events. However, this approach does notaccount for spatial dependencies and does not performsatisfactorily in the presence of limited amount of data.

Representation Learning In practice, findinggood feature representations to model news articlesis not a trivial problem. Traditionally, the bag-of-words representation allows for easy interpretation butrequires preprocessing and feature selection. Severalresearchers have developed efficient and effective neuralnetwork based representations for language models [3,12]. Entity recognition has also been widely applied innatural language processing tasks [10, 5]. Most societalevents are related to or even caused by known entitiessuch as persons, organizations, and geolocations. Inthis paper, we combine document level embeddings withrecognized entity embeddings for better explanation ofprecursor story lines and event forecasts.

The methods proposed in this paper can be viewed



as complementary to the prior work discussed above,casting the spatial event forecasting and precursor dis-covery problems into a multi-instance learning frame-work with a fusion penalty based on spatio-temporalcorrelation graphs and augmented representations.

7 Conclusion

We presented STAPLE, a multi-task spatio-temporalcorrelation graph model based on a two-level multi-instance learning (MIL) framework for precursor miningcoupled with event forecasting. Multiple models forcities are jointly learned together and proven to beeffective at both forecasting events and discoveringprecursors. The richness of the identified precursorevents demonstrates that STAPLE will be a usefultool for a better understanding of event happenings.For future work, we plan to study narrative generationand entity knowledge graph extraction from with multi-source datasets.

Acknowledgements

This work was partially funded by the Office of NavalResearch under contract N00014-16-C-1054 and by theU.S. Department of Homeland Security under GrantAward Number 2017-ST-061-CINA01. The views andconclusions contained in this document are those of theauthors and should not be interpreted as necessarilyrepresenting the official policies, either expressed orimplied, of the Office of Naval Research or the U.S.Department of Homeland Security.

References

[1] Jaume Amores. Multiple instance classification: Re-view, taxonomy and comparative study. Artificial In-telligence, 201:81–105, 2013.

[2] Stuart Andrews, Ioannis Tsochantaridis, and ThomasHofmann. Support vector machines for multiple-instance learning. In NIPS, pages 561–568, 2002.

[3] Yoshua Bengio, Rejean Ducharme, Pascal Vincent,and Christian Janvin. A neural probabilistic languagemodel. J. Mach. Learn. Res., 3:1137–1155, March 2003.

[4] Elizabeth Boschee, Jennifer Lautenschlager, SeanO’Brien, Steve Shellman, James Starz, and MichaelWard. Icews coded event data, 2016.

[5] Jenny Rose Finkel, Trond Grenager, and ChristopherManning. Incorporating non-local information intoinformation extraction systems by gibbs sampling. InACL, pages 363–370, 2005.

[6] Dimitrios Kotzias, Misha Denil, Nando de Freitas, andPadhraic Smyth. From group to individual labels usingdeep features. In KDD, pages 597–606, 2015.

[7] Srivatsan Laxman, Vikram Tankasali, and Ryen W.White. Stream prediction using a generative modelbased on frequent episodes in event sequences. In KDD,pages 453–461, 2008.

[8] Quoc V. Le and Tomas Mikolov. Distributed represen-tations of sentences and documents. In ICML, pages1188–1196, 2014.

[9] Yu-Feng Li, Ju-Hua Hu, Yuan Jiang, and Zhi-HuaZhou. Towards discovering what patterns trigger whatlabels. In AAAI, pages 1012–1018, 2012.

[10] Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, andXuan Zhu. Learning entity and relation embeddingsfor knowledge graph completion. In AAAI, pages 2181–2187, 2015.

[11] Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. Efficient estimation of word representations invector space. CoRR, abs/1301.3781, 2013.

[12] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-rado, and Jeffrey Dean. Distributed representations ofwords and phrases and their compositionality. CoRR,2013.

[13] Yue Ning, Sathappan Muthiah, Huzefa Rangwala, andNaren Ramakrishnan. Modeling precursors for eventforecasting via nested multi-instance learning. In KDD,pages 1095–1104, 2016.

[14] Naren Ramakrishnan, Patrick Butler, SathappanMuthiah, and et al. “Beating the news” with EM-BERS: Forecasting civil unrest using open source in-dicators. In KDD, pages 1799–1808, 2014.

[15] Michael Rosenstein, Zvika Marx, Tom Dietterich, andLeslie Pack Kaelbling. Transfer learning with anensemble of background tasks. In NIPS Workshop onInductive Transfer, 2005.

[16] Ben Tan, Evan Wei Xiang, Qiang Yang, and ErhengZhong. Multi-transfer: Transfer learning with multipleviews and multiple sources. In SDM, pages 243–251,2013.

[17] Xiaofeng Wang, Matthew S. Gerber, and Donald E.Brown. Automatic crime prediction using events ex-tracted from twitter posts. In SBP, pages 231–238,2012.

[18] Xinggang Wang, Zhuotun Zhu, Cong Yao, and XiangBai. Relaxed multiple-instance SVM with applicationto object discovery. CoRR, 2015.

[19] Ying Wei, Yu Zheng, and Qiang Yang. Transferknowledge between cities. In KDD, pages 1905–1914.ACM, 2016.

[20] Pei Yang and Wei Gao. Multi-view discriminanttransfer learning. In IJCAI, pages 1848–1854, 2013.

[21] Chao Zhang, Guangyu Zhou, Quan Yuan, HongleiZhuang, Yu Zheng, Lance Kaplan, Shaowen Wang, andJiawei Han. Geoburst: Real-time local event detectionin geo-tagged tweet streams. In SIGIR, pages 513–522,2016.

[22] Liang Zhao, Qian Sun, Jieping Ye, Feng Chen, Chang-Tien Lu, and Naren Ramakrishnan. Multi-task learn-ing for spatio-temporal event forecasting. In KDD,pages 1503–1512, 2015.

[23] Zhi-Hua Zhou and Jun-Ming Xu. On the relationbetween multi-instance learning and semi-supervisedlearning. In Zoubin Ghahramani, editor, ICML, vol-ume 227, pages 1167–1174, 2007.


Documents

New STAPLE: Spatio-Temporal Precursor Learning for Event …people.cs.vt.edu/naren/papers/sdm18-submission-staple.pdf · 2018. 1. 23. · STAPLE: Spatio-Temporal Precursor Learning