6
A Method for Analyzing Terrorist Attacks Ibrahim Toure Department of Information Systems University of Maryland Baltimore County (UMBC) Baltimore, Maryland 21250 Email: [email protected] Aryya Gangopadhyay Department of Information Systems University of Maryland Baltimore County (UMBC) Baltimore, Maryland 21250 Email: [email protected] Abstract—Analyzing terrorist attacks is important for home- land security. Analyses of past records can provide important information on those attacks and enable appropriate actions to prevent similar attacks in the future. In this research, we present a novel method based on Latent Dirichlet Allocation to analyze data collected by START (Study of Terrorism and Responses to Terrorism) from 1970 to 2010. The first step in our method consists of generating topic models from the data. We then identify the most frequent terms occurring across various topic distributions. Moreover, we study the evolution of different kinds of attacks that occurred over time. The results show that a distinct change in attack patterns emerges over the past four decades. I. I NTRODUCTION Since 9/11, homeland security has become a top defense priority in the United States. It is important for homeland security to be prepared at all times to counter or prevent attacks such as the one on 9/11. It is also important to find out reasons for these attacks. In order to achieve this, one needs methods and techniques that would identify patterns as well as changes in those patterns over time. Such information can not only help prevent future terrorist attacks, but also identify the reasons behind those attacks so that appropriate countermeasures can be taken. This study focuses on iden- tifying patterns from historical data on terrorist attacks by discovering hidden themes or topics from attack descriptions. We then present our findings in a text-based summary. The data set used for this study was collected by START (Study of Terrorism and Responses to Terrorism) from 1970 to 2010 [1]. We preprocessed the data set, generated the topics, and then analyzed the results. The results we obtained demonstrate that our method can be effectively used on datasets on terrorist attacks in order to discover useful information. Moreover, the information gleaned by using our method can be updated from data collected on future terrorist attacks. II. RELATED WORK Systematic research on analyzing terrorist attacks have been few and far between. In this section we discuss a few of the existing work in this area. A. Crime Prediction Model Fatih Ozgul et al [2] developed a novel method called the CMP (Crime Prediction Model) to analyze and identify terrorist groups of unsolved attacks. CMP learns from terrorist attacks, matches them based on the similarities of their prop- erties, and then cluster them into groups. This method was applied to a real life terrorist attacks data occurred in Turkey between 1970 and 2005. The predictions of CMP gave a good precision value for big terrorist groups and provided a good enough recall value for small terrorist groups. B. Investigative Data Mining (IDM) The purpose of IDM is to apply SNA (Social Network Analysis) and other data mining methods to a terrorist network to point out the links between the actors and the importance of each actor. Muhammad Akram Shaikh and Wang Jiaxin [3] did a study on Identifying Key Nodes in Terrorist Networks. The idea is to mine a network of terrorist, identify who the most influent actors in the network (leaders) are, and the coordinators of transactions (gatekeepers) for activities such as passing weapons. SNA methods were used on the EB dataset of U.S. embassy bombing in Tanzania, to identify the main actors in the network. First, an adjacency matrix was constructed with the dataset of 16 terrorists, where the value 1 means there is a connection between the two actors, and 0 otherwise. Second, the key actors are identified based on the computed results of the Degree of centrality, Betweennes, Closeness, and Eigenvector Centrality. Finally, the results were displayed in graph of connections of actors. The main actors had more connections. After identifying the main actors, appropriate actions can be carried out by either removing or isolating the key actors to destabilize the network, thus potentially slowing down their interactions and plans. III. METHODOLOGY A. Latent Dirichlet Allocation Latent Dirichlet Allocation (LDA) is a mixed membership model that has been used to discover hidden themes or “topics” in a document corpus. For more details on topic models and LDA please see [4], [5]. In the context of a corpus of text documents, a topic model captures the underlying themes or topics that exhibit themselves in different proportions in the documents [6], [7]. The topics themselves are distributions over words or terms that appear in the corpus. Given that the only observable parameters are the words that appear in the documents, the challenge is to estimate the hidden parameters such as the word distributions in topics, the topic proportions in documents, and the word assignments to generate the 660 U.S. Government work not protected by U.S. copyright

[IEEE 2012 IEEE International Conference on Technologies for Homeland Security (HST) - Waltham, MA, USA (2012.11.13-2012.11.15)] 2012 IEEE Conference on Technologies for Homeland Security

  • Upload
    aryya

  • View
    213

  • Download
    1

Embed Size (px)

Citation preview

Page 1: [IEEE 2012 IEEE International Conference on Technologies for Homeland Security (HST) - Waltham, MA, USA (2012.11.13-2012.11.15)] 2012 IEEE Conference on Technologies for Homeland Security

A Method for Analyzing Terrorist AttacksIbrahim Toure

Department of Information SystemsUniversity of Maryland Baltimore County (UMBC)

Baltimore, Maryland 21250Email: [email protected]

Aryya GangopadhyayDepartment of Information Systems

University of Maryland Baltimore County (UMBC)Baltimore, Maryland 21250Email: [email protected]

Abstract—Analyzing terrorist attacks is important for home-land security. Analyses of past records can provide importantinformation on those attacks and enable appropriate actionsto prevent similar attacks in the future. In this research, wepresent a novel method based on Latent Dirichlet Allocationto analyze data collected by START (Study of Terrorism andResponses to Terrorism) from 1970 to 2010. The first step in ourmethod consists of generating topic models from the data. Wethen identify the most frequent terms occurring across varioustopic distributions. Moreover, we study the evolution of differentkinds of attacks that occurred over time. The results show thata distinct change in attack patterns emerges over the past fourdecades.

I. INTRODUCTION

Since 9/11, homeland security has become a top defensepriority in the United States. It is important for homelandsecurity to be prepared at all times to counter or preventattacks such as the one on 9/11. It is also important to findout reasons for these attacks. In order to achieve this, oneneeds methods and techniques that would identify patterns aswell as changes in those patterns over time. Such informationcan not only help prevent future terrorist attacks, but alsoidentify the reasons behind those attacks so that appropriatecountermeasures can be taken. This study focuses on iden-tifying patterns from historical data on terrorist attacks bydiscovering hidden themes or topics from attack descriptions.We then present our findings in a text-based summary. Thedata set used for this study was collected by START (Studyof Terrorism and Responses to Terrorism) from 1970 to 2010[1]. We preprocessed the data set, generated the topics, andthen analyzed the results. The results we obtained demonstratethat our method can be effectively used on datasets on terroristattacks in order to discover useful information. Moreover, theinformation gleaned by using our method can be updated fromdata collected on future terrorist attacks.

II. RELATED WORK

Systematic research on analyzing terrorist attacks have beenfew and far between. In this section we discuss a few of theexisting work in this area.

A. Crime Prediction Model

Fatih Ozgul et al [2] developed a novel method calledthe CMP (Crime Prediction Model) to analyze and identifyterrorist groups of unsolved attacks. CMP learns from terrorist

attacks, matches them based on the similarities of their prop-erties, and then cluster them into groups. This method wasapplied to a real life terrorist attacks data occurred in Turkeybetween 1970 and 2005. The predictions of CMP gave a goodprecision value for big terrorist groups and provided a goodenough recall value for small terrorist groups.

B. Investigative Data Mining (IDM)

The purpose of IDM is to apply SNA (Social NetworkAnalysis) and other data mining methods to a terrorist networkto point out the links between the actors and the importanceof each actor. Muhammad Akram Shaikh and Wang Jiaxin [3]did a study on Identifying Key Nodes in Terrorist Networks.The idea is to mine a network of terrorist, identify who themost influent actors in the network (leaders) are, and thecoordinators of transactions (gatekeepers) for activities suchas passing weapons. SNA methods were used on the EBdataset of U.S. embassy bombing in Tanzania, to identify themain actors in the network. First, an adjacency matrix wasconstructed with the dataset of 16 terrorists, where the value1 means there is a connection between the two actors, and0 otherwise. Second, the key actors are identified based onthe computed results of the Degree of centrality, Betweennes,Closeness, and Eigenvector Centrality. Finally, the results weredisplayed in graph of connections of actors. The main actorshad more connections. After identifying the main actors,appropriate actions can be carried out by either removingor isolating the key actors to destabilize the network, thuspotentially slowing down their interactions and plans.

III. METHODOLOGY

A. Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a mixed membershipmodel that has been used to discover hidden themes or “topics”in a document corpus. For more details on topic models andLDA please see [4], [5]. In the context of a corpus of textdocuments, a topic model captures the underlying themes ortopics that exhibit themselves in different proportions in thedocuments [6], [7]. The topics themselves are distributionsover words or terms that appear in the corpus. Given that theonly observable parameters are the words that appear in thedocuments, the challenge is to estimate the hidden parameterssuch as the word distributions in topics, the topic proportionsin documents, and the word assignments to generate the

660U.S. Government work not protected by U.S. copyright

Page 2: [IEEE 2012 IEEE International Conference on Technologies for Homeland Security (HST) - Waltham, MA, USA (2012.11.13-2012.11.15)] 2012 IEEE Conference on Technologies for Homeland Security

documents. The LDA is a mixed membership model thatgenerates each document in a corpus as a bag of words giventhe hidden parameters. The challenge is to estimate the hiddenparameters given the observable data. Several methods havebeen proposed for the parameter estimation of the LDA, andwe follow the mean variational methods [8] in this work. TheLDA model estimation and inference were done using the lda-c implementation [9].

B. Method

The input data to our method is a set of texts describingeach attack. These texts can be obtained from the news mediaincluding the Internet, print media, as well as other sourcessuch as agencies tracking information on terrorist attacks.Our goal in this methodology is to discover the “hidden”themes in these texts. The first step is to generate a term-document matrix (D), which consists of the terms that occurin the rows and the documents in the columns. Each documentcorresponds to the textual information on each attack. Eachentry in D corresponds to the frequency of occurrence of eachterm corresponding to each attack document. Prior to creatingthe matrix D, we remove common stop words such as article,prepositions, etc. The matrix D is used to create a number oftopic models with different number of topics.

Our proposed method is presented in Algorithm 1. The firstinput parameter to the algorithm is the term-document matrix,which is described above. The second input parameter, N , issimply a list of numbers that represent the number of topicsto be created for each topic model. In our case this was {50,100, 300, 350, 400}. These numbers are somewhat arbitrary asthere is no known method for determining the “ideal” numberof topics. However, having too few topics will capture thebroad-brush patterns and too many will overfit the model to thedata. The third input parameter k is the number of top termsfor each topic. Each topic model Ti consists of ni number oftopics. Each topic is a list of terms where the jth topic in Ti isdenoted by Tij . Each Tij is truncated to only the top-k terms(line 4 in Algorithm 1). Next we take the intersection of thetruncated topics to create the truncated top-k terms Ti (line 6in Algorithm 1). The final ordered list of terms T is createdby taking the intersection of all truncated ordered sets (line8 in Algorithm 1). The summary, shown in lines 9 and 10 inAlgorithm 1 can be generated manually, or by using naturallanguage generation [10], which is not discussed in this paper.

IV. EXPERIMENTAL RESULTS

A. Platform

We used Text to Matrix Generator (TMG) [11], a toolboxrun in MATLAB to convert a corpus of text data to a term-document matrix and a dictionary of terms. The topic modelswere generated using the C implementation of LDA [9].We used an AMD Opteron machine with 47 processors (12cores each) and 504 gigabytes of physical memory for ourexperiments.

In our experiments, we removed all non-textual data suchas numbers and dates as we are only interested in the textual

Algorithm 1 Algorithm for Analyzing Terrorist AttacksInput: D: an m × n Term-Document matrix;{n1, n2, . . . , nl} ∈ N : set of number of topics; k: number ofsignificant terms in a topicOutput: Summary of attack patternsS.

1: for i = 1 to l do2: Create a topic model Ti corresponding to the number

of topics ni ∈ N3: for j = 1 to ni do4: Truncate topic Tij ∈ Ti to the top-k topics5: end for6: Ti =

⋂ni

j=1 Tij

7: end for8: T =

⋂li=1 Ti

9: Generate summary S from T10: Return S

descriptions. We included a stop word list [12] to removecommon English words in order to avoid biasing the resultswith non-content terms.

B. Data

We used the data collected by START (Study of Terror-ism and Responses to Terrorism) from 1970 to 2010 [1].START is a national consortium based in the University ofMaryland. Its mission is to provide data-driven knowledge tohomeland security to help in counterterrorism and responsesto terrorism attacks. Moreover, the mission of START is toprovide information about human causes and consequences ofterrorism attacks. Knowledge discovered from such data canhelp policy and decision-makers to make appropriate decisionsfor homeland security.

The data collected has several attributes: year, month, dayor approximate date the attack occurred, relevant event dates,such as a kidnapped person getting released, information oncountries, regions, provinces/states, cities, and the specificlocation of the attacks. In addition the data contains detailedinformation about the attacks including news reports. The dataalso contains information about the category of crime commit-ted. The various categories include insurgency/Guerilla Action,Purely Criminal Act, Mass Murder or Internecine ConflictAction. In the data set, there are a total of nine (9) attack types:Assassination, Armed Assault, Bombing/Explosion, Hijacking,Hostage Taking (Barricade Incident), Hostage Taking (Kid-napping), Facility/Infrastructure Attack, Unarmed Assault, andUnknown, with one attack type provided for each attack. Threesupplemental fields are provided to indicate supplementaryinformation on an attack in case it can be classified undermore than one attack type . For example an attack can beclassified as both Assassination and Armed Assault.

A key information is about the targets. Targets are classifiedinto twenty-two (22) different categories including Business,Government (General), Police, Military, and Religious Fig-ures/Institutions. In addition to the target categories, the data

661

Page 3: [IEEE 2012 IEEE International Conference on Technologies for Homeland Security (HST) - Waltham, MA, USA (2012.11.13-2012.11.15)] 2012 IEEE Conference on Technologies for Homeland Security

Fig. 1. Term Frequencies across Topics for United States

contains details about subcategories within the target cate-gories. For example in an “Airline & Airport” corporation,the category “Trans World Airline” may have a target “Flight802 Boeing 707”. Moreover, the nationalities of the targets areavailable. For each attack the name of the group responsiblefor the attack, if it is known, is included in the data. Anexample of a terrorist group in the data set is “Al-Qaidaand Taliban”. Motives and claims of the attacks are analyzed.Some of the claims are made after the attacks by letter, call(post-incident) etc. In the data set, the weapons types aredivided into types and sub-types. There are a total of 13types of weapons in the data set. An example of type isExplosives/Bombs/Dynamite and a corresponding sub-type isPressure Trigger or time fuse. Information on the details ofthe weapons use in attacks are analyzed, for instance Molotovcocktail, firebomb etc. An estimation of the cost of damagescaused by the attacks and comments about them are alsoanalyzed. In case of kidnapping, the outcomes are analyzed.The data of those terrorist attacks is compiled from twenty-one(21) different source databases.

C. Analysis

In our analysis, the documents are the number of attacks andthe terms are the number of words. In order to decide on thebest number of topics, we generated 50, 100, 300, 350 and 400topics on the same dataset. 50 topics generated more varietyof terms and 12 of its most frequent words were found to besimilar to 20 of the most frequent terms found in the othertopics (100, 300, 350 and 400). The most frequent terms andtheir frequencies are shown in Figure 1. We took the count thefrequencies of the terms and exported the top 20 most frequentterms from a MySQL database.

After analyzing the top 20 most frequent terms, we extractedinformation about the problems, reasons, motivations, effects,and weapons used in terrorist attacks. We show the results inFigures 2-11, where the x-axis shows the frequencies of theterms. In the following discussion we ignore terms such as“unknown” and “type” that are not content rich.

Figure 2 shows the term frequencies from data involvingattacks on all nations in the four decades. Weapons such as“explosive”, “bombs”, “fireball’, and “dynamite” occur mostfrequently. Locations such as “Iraq” and “Southampton” fea-ture prominently. The cell phone company “Asiacell” occurs

Fig. 2. Attacks on all nations over Four Decades

Fig. 3. Attacks on the United States over Four Decades

frequently.Figure 3 shows the term frequencies in the descriptions of

attacks on the United States over all four decades. As canbe seen from Figure 3, “bombings” is the most frequentlyoccuring term in the attacks on the US over the four decades.Also, the term “abortionist” has significant frequency in theattacks on the US over the four decades.

In the attack data from the 1970s, both “bombs” and“explosives” occur prominently. However, “abortionist” doesnot appear as a frequently occurring term, as shown in Figure5. Comparing the attack data from the 1970’s for the US withthose for the entire world (Figure 4), we find that the terms“black” and “white” only occurs in the US and not in the rest

662

Page 4: [IEEE 2012 IEEE International Conference on Technologies for Homeland Security (HST) - Waltham, MA, USA (2012.11.13-2012.11.15)] 2012 IEEE Conference on Technologies for Homeland Security

Fig. 4. Attacks on all nations in the 1970’s

Fig. 5. Attacks on the United States in the 1970’s

of the world, which might indicate race issues in the US. Onthe other hand “firearms” occur frequently in the attack datafor the world, but not in the US in the 1970’s.

The terms “revolutionary”, “assassination”, and “dynamite”occur frequently in the world data (Figure 6), but these areabsent in the data for the US. In contrast, for the US inthe 1980’s, the terms “firearms” and “sabotage” feature moreprominently than “bombing”; also the term “jewish” appears,as can be seen in Figure 7.

In the 1990’s data (Figure 8), we find “explosives” featuringmost prominently. In addition terms such as “dynamite”,“bombs”, and “firearms” also occur frequently. The term“Africaine” (African in French) occurs prominently for the

Fig. 6. Attacks on all nations in the 1980’s

Fig. 7. Attacks on the United States in the 1980’s

first time. Locations such as “Southhampton” and “Northboro”indicates incidents in England. Other countries such as “Ar-menia” and “Colombia” start appearing prominently for thefirst time.

In the 1990’s, the American Liberation Front (“ALF”)occurs frequently in the data for the US, along with “animal”,“abortion”, and “liberation”, which indicates issues in variouspolitical movements at that time, as shown in Figure 9.

During the 2000’s, “explosions”, “bombings” occur muchmore frequently in the data on the whole world . New termssuch as “Asiacell” (a cell phone company in Iraq) occur forthe first time. “Armenia” continues to occur. Other termsindicating locations such as “South Asia”, ”India”, “Iraq”,

663

Page 5: [IEEE 2012 IEEE International Conference on Technologies for Homeland Security (HST) - Waltham, MA, USA (2012.11.13-2012.11.15)] 2012 IEEE Conference on Technologies for Homeland Security

Fig. 8. Attacks on all nations in the 1990’s

Fig. 9. Attacks on the United States in the 1990’s

and “Baghdad” start to appear. The Islamist extremist group“Grouphizb” occur during this time, as shown in Figure 10.Terms that occur most prominently in the data from 2000’s inthe US include “incendiary” and “fire” (see Figure 11). Otherterms such as “liberation” persists.

The following text summarizes the information that can begleaned from the dataset. Although we have created this textmanually it can be generated using natural language generationtechniques. The terms in boldface are extracted from the datashown in Figures 2-11.

“Terrorist attacks occurred in all the five continents, butmore frequently in the following countries: United States,Iraq, Armenia, Colombia, India, and locations such as

Fig. 10. Attacks on all nations in the 2000’s

Fig. 11. Attacks on the United States in the 2000’s

Southampton and Baghdad. Moreover, many attacks tookplace in Europe and in the Caribbeans. Bombs, dyna-mites, and firearms are most commonly used in terroristattacks.Asiacell was a victim of many terrorist attacks. Inaddition to damages caused to innocent civilians, businessessuffered the most in terrorist attacks. Some perpetrators re-mained unknown. In the 1980s, many assassinations occurred.In the 2000s, the group hizb featured in many attacks. Inthe United States, abortion was one of the major causesof terrorist attacks especially in the 1980s and 1990s. Mostof these attacks occurred in New York and California.Perpetrators put arson and sabot equipments in clinics. In the1970s, in Washington, San Francisco, Illinois and California,

664

Page 6: [IEEE 2012 IEEE International Conference on Technologies for Homeland Security (HST) - Waltham, MA, USA (2012.11.13-2012.11.15)] 2012 IEEE Conference on Technologies for Homeland Security

ethnicity issues based on black and white were a majorproblem and caused many attacks. In the 1990s and 2000s,terrorist attacked mostly for respect of their rights and forliberation. ALF (Animal Liberation Front) was responsible ofmany attacks in the 1990s; on the other hand, many attacksinvolved the Jewish people. Moreover, in general in the UnitedStates, minorities struggled for their rights. Finally, bombs,dynamites, firearms, and incendiary were the most frequentweapons used in terrorist attacks.’’

V. CONCLUSION

In this paper we have described a novel method basedon the Latent Dirichlet Allocation (LDA) to analyze textdata on terrorists attacks occurred between 1970 to 2010.We segmented the data set into decadal sub-sets, and thenperformed LDA analysis on each decade as well as the entiredata. The results of this study have importance for homelandsecurity, because they can be used to make decisions in dealingwith future terrorist attacks. Moreover, it provides informationfor homeland security on the reasons behind these attacks.

Our proposed method can be applied to any text data togenerate topic models as patterns and subsequently terms thatoccur frequently in the topic models. A simple term-frequencybased method using the raw text will not provide this infor-mation. Such information can be useful in other domains suchas patents and product documentations for agencies such asthe Food and Drug Administration (FDA) and patent officesthat have to deal with large corpora of text. Methods such asthese are also useful in domains such as healthcare where anabundance of text data exists.

One limitation of our proposed method is that in the laststep we generate the summary text manually. We are currentlyworking on automating the generation of the summary textfrom the output of our proposed algorithm.

REFERENCES

[1] “Study of terrorism and responses to terrorism, obtained fromhttp://www.start.umd.edu/start/.”

[2] F. Ozgul, Z. Erdem, and C. Bowerman, “Prediction of past unsolved ter-rorist attacks,” in International Conference on Intelligence and SecurityInformatics, ISI ’09. IEEE, 2009, pp. 37–42.

[3] M. Shaikh and W. J., “Investigative data mining: Identifying key nodes interrorist networks,” in Proceedings of the Multitopic Conference, 2006.INMIC ’06. IEEE, 2006, pp. 23–24.

[4] D. M. Blei, A. Ng, and M. Jordan, “Latent dirichlet allocation,” JMLR,vol. 3, pp. 993–1022, 2003.

[5] D. M. Blei, “Introduction to probabilistic topic models,”Communications of the ACM, 2011. [Online]. Available:http://www.cs.princeton.edu/ blei/papers/Blei2011.pdf

[6] M. Steyvers and T. Griffiths, Probabilistic topic models. Springer, 2006,pp. 187–210.

[7] T. Griffiths, M. Steyvers, and J. Tenenbaum, “Topics in semanticrepresentation,” Psychological Review, vol. 114, no. 2, pp. 211–244,2007.

[8] A. Cherkaev, Variational Methods for Structural Optimization, ser.Applied mathematical sciences, 2000, vol. 140. [Online]. Available:http://www.math.utah.edu/book/vmso

[9] D. Blei, “http://www.cs.princeton.edu/ blei/lda-c/.”[10] D. Galanis, G. Karakatsiotis, G. Lampouras, and I. Androutsopoulos,

“An open-source natural language generator for owl ontologies and itsuse in protege and second life,” in Proceedings of the EACL 2009Demonstrations Session, 2009, pp. 17–20.

[11] D. Zeimpekis and E. Gallopoulos, TMG: A MATLAB toolbox forgenerating term-document matrices from text collections. Springer,2006, pp. 187–210.

[12] MySQL, “Stopwords. full-text stopwords, mysql documentation.”

665