9
Open Source Social Media as Sensors for Enabling Government Identification, Prediction and Response Applications Swati Agarwal, Ph.D. Scholar, IIIT-Delhi Department of Computer Science and Engineering E-mail: [email protected], Homepage: www.swati-agarwal.in Advisor: Dr. Ashish Sureka, ABB Corporate Research, India Co-Advisor: Dr. Vikram Goyal, Associate Professor, IIIT-Delhi, India 1 Introduction The emergence of Internet and Web 2.0 platforms facilitated people with many such features that allow them to post their opinions about anything. Social media platforms such as Tumblr 1 , Twitter 2 (micro-blogging website) and YouTube 3 (video sharing website) contains information which is publicly available or open-source. Open-Source Social Media INTelligence (OSSMINT) is a field comprising of techniques and applications to analyze and mine these public posts for extracting actionable information and useful insights [1]. Social media users take leverage of freedom-of-speech and low publication barriers for posting their beliefs and opinions about several topics without revealing their identity [2, 3]. Research shows that several consumer based applications can be built by mining open source data available on social media platforms. For example, mining public reviews on products and services. However, the focus of the work presented in my dissertation is on novel applications and techniques of OSSMINT in the government sector. Figure 1 demonstrates a high-level presentation of several applications of OSSINT that are useful for the government, local civil force, and law enforcement agencies. Figure 1 displays the use-cases and applications implemented by me as part of my Ph.D. dissertation. Due to the presence of anonymity, low-publication barriers, wide reachability, global connectivity, people create posts and discuss several sensitive topics that are a major topic of concern for the government and security analysts [4]. However, manual analysis and gaining insights from open source data is infeasible due to the high velocity and volume of the tweets posted [1]. In particular, the study presented in my dissertation is motivated by the following facts: 1. Recently, there has been an increasing trend and adoption of social media by government organizations for not just disseminating information but also collecting information such as complaints and grievances from citizens (a phenomenon referred to as citizensourcing)[5][6][7]. Further, in order to manually identify the type of the report and resolve it, many complaints remain unaddressed. I present solutions, tools, and techniques for analyzing data from micro-blogging website to analyze citizen complaints and grievances and corruption reports in the public sectors (railway ministry of India, income tax department, traffic police, city police, the ministry of road, transport and highway). 2. With the unexpected emergence of religion and faith among people has also led to the discrimination and violence against rival religious groups [8]. It is seen that the people use various different platforms (chat groups, forums, blogs, social media) to share their beliefs and opinions about their religion [1] and also outburst their extremist and hateful views towards other religions [9][10]. Monitoring such content on social media has become an important problem for the government as it leads to the hate promotion, recruitment in online radicalization communities and unrest [4, 2, 11]. The study presented in my dissertation presents solutions for enabling law enforcement agencies to combat online radicalization and extremism by mining data from Tumblr, Twitter and YouTube websites. I further present solutions to identify contrast in different opinions and emotions (defensive, harsh, outrageous and naive) on religious posts leading to religious conflicts within social media community. 1 https://www.tumblr.com 2 https://twitter.com 3 https://twitter.com PhD Synopsis Submission, Swati Agarwal (2017), IIIT-Delhi, India.

synopsis

Embed Size (px)

Citation preview

Page 1: synopsis

Open Source Social Media as Sensors for EnablingGovernment Identification, Prediction and Response

Applications

Swati Agarwal, Ph.D. Scholar, IIIT-DelhiDepartment of Computer Science and Engineering

E-mail: [email protected], Homepage: www.swati-agarwal.inAdvisor: Dr. Ashish Sureka, ABB Corporate Research, India

Co-Advisor: Dr. Vikram Goyal, Associate Professor, IIIT-Delhi, India

1 Introduction

The emergence of Internet and Web 2.0 platforms facilitated people with many such features that allow themto post their opinions about anything. Social media platforms such as Tumblr1, Twitter2 (micro-bloggingwebsite) and YouTube3 (video sharing website) contains information which is publicly available or open-source.Open-Source Social Media INTelligence (OSSMINT) is a field comprising of techniques and applicationsto analyze and mine these public posts for extracting actionable information and useful insights [1]. Socialmedia users take leverage of freedom-of-speech and low publication barriers for posting their beliefs andopinions about several topics without revealing their identity [2, 3]. Research shows that several consumerbased applications can be built by mining open source data available on social media platforms. For example,mining public reviews on products and services. However, the focus of the work presented in my dissertationis on novel applications and techniques of OSSMINT in the government sector. Figure 1 demonstrates ahigh-level presentation of several applications of OSSINT that are useful for the government, local civil force,and law enforcement agencies. Figure 1 displays the use-cases and applications implemented by me as part ofmy Ph.D. dissertation. Due to the presence of anonymity, low-publication barriers, wide reachability, globalconnectivity, people create posts and discuss several sensitive topics that are a major topic of concern for thegovernment and security analysts [4]. However, manual analysis and gaining insights from open source data isinfeasible due to the high velocity and volume of the tweets posted [1]. In particular, the study presented in mydissertation is motivated by the following facts:

1. Recently, there has been an increasing trend and adoption of social media by government organizations fornot just disseminating information but also collecting information such as complaints and grievances fromcitizens (a phenomenon referred to as citizensourcing) [5][6][7]. Further, in order to manually identify the typeof the report and resolve it, many complaints remain unaddressed. I present solutions, tools, and techniquesfor analyzing data from micro-blogging website to analyze citizen complaints and grievances and corruptionreports in the public sectors (railway ministry of India, income tax department, traffic police, city police, theministry of road, transport and highway).

2. With the unexpected emergence of religion and faith among people has also led to the discrimination andviolence against rival religious groups [8]. It is seen that the people use various different platforms (chat groups,forums, blogs, social media) to share their beliefs and opinions about their religion [1] and also outburst theirextremist and hateful views towards other religions [9][10]. Monitoring such content on social media hasbecome an important problem for the government as it leads to the hate promotion, recruitment in onlineradicalization communities and unrest [4, 2, 11]. The study presented in my dissertation presents solutionsfor enabling law enforcement agencies to combat online radicalization and extremism by mining data fromTumblr, Twitter and YouTube websites. I further present solutions to identify contrast in different opinions andemotions (defensive, harsh, outrageous and naive) on religious posts leading to religious conflicts within socialmedia community.

1https://www.tumblr.com2https://twitter.com3https://twitter.com

PhD Synopsis Submission, Swati Agarwal (2017), IIIT-Delhi, India.

Page 2: synopsis

POPULAR OPEN SOURCE SOCIAL MEDIA WEBSITES

Public Issues and Complaints

Discussion about Protests

Hateful and Persuasive Comments

UPLOADED CONTENT

Automatic Identification of •  fraud, bribe and corruption complaints and grievances • daily issues and challenges faced by public citizens

Early Prediction of • Planning and Mobilization of Protest Events • Organizing Civil Riots and Unrest

Automatic Identification and Detection of •  Radicalized Content, Users and Communities •  Recruitment and Hate Promotion on Social Media •  Religion or Race Difference and Conflicts within Society

OPEN SOURCE SOCIAL MEDIA INTELLIGENCE BASED GOVERNMENT APPLICATIONS

Figure 1: Demonstrating Several Applications of Open Source Social Media Intelligence useful for theGovernment, Local Civil Force and Security Analysts (Law Enforcement Agencies)

Raw Data

PRE-PROCESSING AND ENRICHMENT

Joint Words Expansion

Spell Error Correction

Acronym & Slang Treatment

Language Identification and Translation

Sentence Segmentation

FEATURES EXTRACTION AND SELECTION

Topic Modeling LIWC

N-grams

Sentiment, Emotion Analysis

Named Entities

Conceptual Similarity

Social, Writing Tendencies

Spatio-Temporal Features

Dimensionality Reduction

MACHINE LEARNING TECHNIQUES

Naïve Bayes SVM

KNN

Social Network Analysis

Ensemble Learning

Probabilistic Modeling

Semi-Supervised Learning

Clustering Decision Trees

Graph Traversal

PERFORMANCE EVALUATION

Precision

F-Score

Recall

ROC Curve

Information Visualization

Performance Boosting

Statistics Measures

Cross Validation

Figure 2: A General Framework Illustrating Various Data Maining and Information Retrieval Tech-niques used in my Work, Primarily Divided into 4 Categories- 1) Data Pre-Processing, Enhancementand Enrichment Techniques, 2) Features Extraction and Selection Techniques, 3) Machine LearningTechniques for Classification and Network Analysis and 4) Performance Evaluation Metrics

3. Research shows that various people and organizations use Twitter as a platform to share information on civilunrest related events [12][13]. In countries like USA and Australia where protests are legal, early detection orforecasting of such events is valuable for government, tourism and law enforcement agencies [1][14]. Theresearch presented in my dissertation also describes my work on analyzing data from Twitter for performingan early prediction of civil unrest and protest.

It is technically challenging to analyze social media data due to issues such as incorrect grammar, spellingmistakes, term obfuscation and usage of abbreviation and short-forms. In my dissertation, I present severaltechniques for data pre-processing, text classification, and term obfuscation detection and information extrac-tion for overcoming the noisy data problem. Figure 2 shows a block diagram framework demonstrating severaldata mining and machine learning applications used by me in my work to enrich the raw experimental datasets,extracting features and performing classification and prediction tasks for building desired applications. Thecentral component of my proposed solution approach is the application of information retrieval and machinelearning based techniques and algorithms. My study consists of experimenting with a wide range of machinelearning algorithms such as Support Vector Machines, Naive Bayes, Random Forest and Decisions Trees.I also employ several techniques such as boosting and ensemble classifiers to improve the accuracy of thebaseline statistical models. I make the processed dataset used in my experiments publicly available for otherresearchers to replicate my experiments and benchmark against my proposed techniques. Data visualizationis one of the major components of data analysis and interpretation. The study employs several basic andadvanced data visualization techniques to present data in an intuitive manner for the end user.

2

Page 3: synopsis

2 Related Work and Research Contributions

Figure 1 provides an overview of the government applications and use-cases in the domain of open-sourcesocial media intelligence (OSSINT) implemented in my dissertation. Figure 2 provides a list of machinelearning, information retrieval, data mining based tools and techniques used by me in my solution approaches.In this section, I present closely related work to my dissertation and also list the specific novel researchcontributions in context to existing work. The contribution of my dissertation is both on the application side(novel use-cases and applications) and the algorithm side (novel techniques and methods).

1. Mining Twitter to Extract Information on Public Citizens’ Complaints and Grievances: Previousstudies [5, 15, 16, 17, 18] focus on conducting empirical analysis on online and offline data and examine theuse of Twitter for client feedback and sentiments about the services provided by police departments. Similarly,some studies [19, 20, 21, 22, 23, 24] performs Twitter stream analysis and traffic news websites analysis forbuilding early prediction based model for determining road safety hazards and real-time traffic congestion.However, I address the challenge of free-form text in tweets and capture the dependencies between noisy textand semantics [25]. I propose a text analysis based ensemble classifier for identifying the complaints andgrievances reports posted by public citizens on official Twitter handlers of public agencies and the Indiangovernment. I propose to use contextual and linguistic features that indicate the relation between a complaintand concerned department. I also identify features that are strong indicators of differentiating a complaintreport from non-complaint tweets [25]. A similar idea has been explored in [26], it investigates the efficacyof spatial (geographical location metadata) and linguistic features to discover insights from less informativecomplaints on killer roads.

2. Mining Social Media to Identify Radicalized and Racist Content, Users and Communities: Currentstate-of-the-art reveals that text classification (automatic and semi-supervised learning), Clustering (unsuper-vised learning), Exploratory Data Analysis and Keyword-Based-Flagging approaches are commonly used foridentifying extremist content on social media [27, 28]. While, link analysis technique is used for crawlingthrough navigation links and identifying similar users and locating hidden communities on social mediawebsites [29, 30]. In this work, I investigate the application of topical crawler-based approach for locatingextremist content, bloggers on various social media platforms (Twitter, Tumblr, and YouTube). The specificcontribution of this work is to demonstrate the effectiveness of contextual metadata such as the content of thebody (or description), tags and caption (or title) of a post (text or video) [2, 4, 31, 32, 33]. I conduct a series ofexperiments to train an n-gram model, one-class SVM and KNN classifiers [32, 33] and test their effectivenessfor the given classification task. I further demonstrate the effectiveness of various relationships among userse.g. like, reblog, follower, subscription and graph traversal algorithms for identifying the hidden communitiese.g. best first search, shark search and random walk [2, 4, 31].

3. Intent Based Classification of Racist and Radicalized Posts: I further extend the idea of radicalized andracist posts identification to include not only the content of the post but also the intention or objective ofthe author. Quantitative text analysis and keyword-spotting based approaches are most commonly used inbuilding identification and predictive models in prior studies to detect persuasion behavior and differentiateradical groups from non-radical groups [34, 35]. However, the dependency on the user behavior and authorspersonality traits has not been considered in previous studies. I contribute to the current state-of-the-art ofintent classification of racist posts by 1) identifying the focused and targeted topic of posts, 2) emphasizingnegative target-specific emotions e.g. anger, disgust, sadness, joy, fear, social tendencies (personality traits ofthe Big Five personality theory4) e.g. openness, conscientiousness, extraversion, agreeableness and emotionalrange and author’s language & writing cues e.g. analytical, confident and tentative style of writing and 3)identifying the semantic role of each term present in the content and identifying the hidden phrases playingmajor role in the post [10, 36].

4. Leveraging the Power of Social Media for Identifying Religious Conflicts within Society: Currentstate-of-the-art reveals that various social science researchers have been conducting offline surveys foridentifying religious conflicts within society [37, 38]. However, the immense amount of data available onsocial media in form of comments, communications, discussions has been largely ignored in existing worksand raises three limitations: 1) subjectivity of opinions in the surveys, 2) generalized claim of conflicts and3) identity of people participating in the surveys. I extend my previous work [10] by identifying variouslinguistic features from social media posts such as emotions, social mentions, perceptual process, certainty,time orientation, personal concerns that are discriminatory for identifying contrast in different opinions onreligious posts [9]. I propose a multi-class semi-supervised classifier across various dimensionality reduction

4http://pages.uoregon.edu/sanjay/bigfive.html

3

Page 4: synopsis

techniques e.g. principal component analysis, attribute selection correlation for classifying social media postsinto various dimensions of conflicts such as information sharing, query, off-topic (not a religion based post),disbelief, defensive, annoyance, insult, disappointment, sarcasm, ashamed and disgust [3].

5. Detecting Term Obfuscation in Adversarial Communication: Probabilistic or distributional models,Sentence Oddity Measures (SMO) and Pointwise Mutual Information (PMI) are the commonly used approachesfor identifying out-of-context terms in a given sentence [39, 40, 41]. However, the major limitation of theexisting approaches is that they are able to predict the suspicious sentence only if the first noun of the sentenceis substituted and therefore, the substituted term is already known to the analysts. I instead consider a sentenceas a bag of words and investigate the application of ConceptNet, a lexical resource and common senseknowledge-base for identifying any term that has been obfuscated in the sentence [42].

6. Social Media as Sensors for Predicting Civil Unrest Events: Keyword-based flagging, probabilisticmodels, clustering, named entity recognizers, logistic regression, and dynamic query expansion are thecommonly used techniques to predict upcoming events related to civil unrest or protest [12, 14, 43, 44].I address the challenge of noisy content present in real time stream data by performing a content-basedcharacterization and semantic enrichment on raw tweets to classify crowd-buzz & commentary and mobilization& planning microposts related to a given a protest or civil disobedience [11]. I investigate the efficiency ofspatiotemporal features and named entities for predicting a civil protest event. I further enhance my baselineapproach by using trend analysis (captured along the sliding window) of the data for early prediction [11].

7. Publishing Research Datasets: In order to conduct my experiments, I collect open source datasets usingwebsite APIs and web crawling. I publish several of my datasets publicly for the research community andfurther with the aim of my results to be used for benchmarking, comparison and further extension. Agarwal etal. [45] consists of the largest dataset of Tumblr posts and bloggers collected for identifying religious conflictson social media [3, 9]. Agarwal et al. [46] consists of the semantically analyzed metadata of Tumblr posts andbloggers collected for identifying intent based racist and radicalized posts on social media [10]. Agarwal et al.[47] consists of enhanced dataset of tweets posted on Indian public agencies’ accounts on twitter collected forextracting information on public citizens’ complaints and grievances [25].

3 Applications, Approach and Results

In this Section, I present the high level overview of each paper published during my PhD. Articles followthe same pattern discussed in the research contribution and cover the key motivation, application, proposedsolution approach and results acquired in each paper.

3.1 Got a Complaint? Keep Calm and Tweet It! [25]

Research shows that many public service agencies use Twitter to share information and reach out to thepublic. Recently, Twitter is also being used as a platform to collect complaints from citizens and resolvethem in an efficient time and manner. However, due to the dynamic nature of the website and presence offree-form-text, manual identification of complaint posts is overwhelmingly impractical. We formulate theproblem of complaint identification as an ensemble classification problem. We perform several text enrichmentprocesses such as hashtag expansion, spell correction and slang conversion on raw tweets for identifyinglinguistic features. We implement a one-class SVM classification and evaluate the performance of variouskernel functions for identifying complaint tweets. In order to conduct our experiments, we collect data from 4Indian public agencies’ official Twitter handlers. We select @DelhiPolice (Delhi Police), @dtptraffic (DelhiTraffic Police), @RailMinIndia (Railway Ministry of India) and @IncomeTaxIndia (Income Tax Departmentof India) since all departments are focused towards different types of complaints and show the generalizabilityof our approach. Our result shows that linear kernel SVM outperforms polynomial and RBF kernel functionsand the proposed approach classifies the complaint tweets with an overall precision of 76%. We boost theaccuracy of our approach by performing an ensemble on all three kernels. The result shows that one-classparallel ensemble SVM classifier outperforms cascaded ensemble learning with a margin of approximately20%. By comparing the performance of each kernel against ensemble classifier, we provide an efficient methodto classify complaint reports.

3.2 Mining Twitter to Extract Information on Killer Roads [26]

In this paper, we extend the idea presented in our previous study Mittal et al. [25] by extracting usefulinformation and insights on killer road complaints. The complaints on killer roads contain the information

4

Page 5: synopsis

about road irregularities and other issues causing high risks and discomfort to the citizens. We investigate theefficacy of spatial, contextual and linguistic features for identifying useful information from complaint posts.We build a text-analysis based model to enrich spatial features (geographical location metadata) in a tweet thatcan be used to discover insights from less informative reports. We propose a content-disambiguation modelto enrich our linguistic features for identifying exact problem reported in the tweets. Our results show thatthe proposed approach is effective and syntactic enrichment of tweets is an important phase for boosting theaccuracy of features extraction and classification.

3.3 A Focused Crawler for Mining Hate and Extremism Promoting Users, Content andCommunities on Social Media [2, 4, 31, 48]

Online social media platforms such as YouTube (most popular video sharing website) and Tumblr (second mostpopular micro-blogging website) contain several posts and users promoting hate and extremism. Due to thelow barrier to publication and anonymity, social media websites are misused by some users and communities tocreate negative posts disseminating hatred against a particular religion, country or person. Manual identificationof such posts and communities is overwhelmingly impractical due to a large amount of posts and blogs beingpublished every day. We formulate the problem of identification of such malicious content as a search problemand present a focused crawler-based approach consisting of various components performing several tasks:search strategy or algorithm, node similarity computation metric, learning from exemplary profiles servingas training data, stopping criterion, node classifier, and queue manager. We implement three versions of thefocused crawler: best-first search, shark search, and random walk graph traversal algorithms. We conduct aseries of experiments on large real-world dataset by varying the seed channel, the number of n-grams in thelanguage model based compare, similarity threshold for the classifier and present the results of the experimentsusing standard Information Retrieval metrics such as precision, recall and F-measure. The accuracy of theproposed solution on the sample dataset of YouTube is 69% and 74% for the best-first and shark searchrespectively. While the proposed approach (random walk algorithm) has an F-score of 0.80 for Tumblr data.We further present the result of applying Social Network Analysis based measures to extract communities andidentify core and influential users.

3.4 Using kNN and SVM based One-Class Classifier for Detecting Online Radicalization on Twitter[32, 33]

Twitter is the largest and most popular micro-blogging website on Internet. Due to the low publication barrier,anonymity and wide penetration, Twitter has become an easy target or platform for extremists to disseminatetheir ideologies and opinions by posting hate and extremism promoting tweets. Millions of tweets are postedon Twitter every day and it is practically impossible for Twitter moderators or an intelligence and securityanalyst to manually identify such tweets, users, and communities. However, automatic classification of tweetsinto pre-defined categories is a non-trivial problem due to the short text of the tweet (the maximum length ofa tweet can be 140 characters) and noisy content e.g. incorrect grammar, spelling mistakes, the presence ofstandard and non-standard abbreviations and slangs. We frame the problem of hate and extremism promotingtweet detection as a one-class or unary-class categorization problem by learning a statistical model froma training set containing only the objects of one class. We propose several linguistic features such as thepresence of terms like war, religion, negative emotions and offensive terms to discriminate hate and extremismpromoting tweets from other tweets. We employ a single-class SVM and KNN algorithm for the one-classclassification task. We conduct a case-study on jihad, perform a characterization study of the tweets andmeasure the precision and recall of the machine-learning based classifier. Experimental results on large andreal-world dataset demonstrate that the proposed approach is effective with F-score of 0.60 and 0.83 for theKNN and SVM classifier respectively.

3.5 Intent Classification of Racist Posts on Tumblr [10]

Research shows that many like-minded people use popular microblogging websites for posting hateful speechagainst various religions and race. Automatic identification of racist and hate promoting posts is required forbuilding social media intelligence and security informatics based solutions. However, just keyword spottingbased techniques cannot be used to accurately identify the intent of a post. In this paper, we address thechallenge of the presence of ambiguity in such posts by identifying the intent of the author. We conduct ourstudy on Tumblr microblogging website and develop a cascaded ensemble learning classifier for identifyingthe posts having racist or radicalized intent. We train our model by identifying various semantic, sentiment andlinguistic features from free-form text. Our experimental results show that the proposed approach is effective

5

Page 6: synopsis

and the emotion tone, social tendencies, language cues and personality traits of a narrative are discriminatoryfeatures for identifying the racist intent behind a post.

3.6 Investigating the Dynamics of Religious Conflicts by Mining Public Opinions on Social Media [3]

The powerful emergence of religious faith and beliefs within political and social groups, now leading todiscrimination and violence against other communities has become an important problem for the governmentand law enforcement agencies. In this paper, we address the challenges and gaps of offline surveys by miningthe public opinions, sentiments and beliefs shared about various religions and communities. Due to the presenceof descriptive posts, we conduct our experiments on Tumblr website- the second most popular microbloggingservice. Based on our survey among 3 different groups of 60 people, we define 11 dimensions of public opinionand beliefs that can identify the contrast of conflict in religious posts. We identify various linguistic features ofTumblr posts using topic modeling and linguistic inquiry and word count. We investigate the efficiency ofdimensionality reduction techniques and semi-supervised classification methods for classifying the posts intovarious dimensions of conflicts. Our results reveal that linguistic features such as such as emotions, languagevariables, personality traits, social process, informal language are the discriminatory features for identifyingthe dynamics of conflict in religious posts.

3.7 Investigating the Application of Common-Sense Knowledge-Base for Identifying TermObfuscation in Adversarial Communication [42]

Word obfuscation or substitution means replacing one word with another word in a sentence to conceal thetextual content or communication. Word obfuscation is used in adversarial communication by terrorist orcriminals for conveying their messages without getting red-flagged by security and intelligence agenciesintercepting or scanning messages (such as emails and telephone conversations). ConceptNet is a freelyavailable semantic network represented as a directed graph consisting of nodes as concepts and edges asassertions of common sense about these concepts. We present a solution approach exploiting the vast amountof semantic knowledge in ConceptNet for addressing the technically challenging problem of word substitutionin adversarial communication. We frame the given problem as a textual reasoning and context inferencetask and utilize ConceptNet’s natural-language-processing toolkit for determining word substitution. We useConceptNet to compute the conceptual similarity between any two given terms and define a Mean AverageConceptual Similarity (MACS) metric to identify out-of-context terms. The test-bed to evaluate our proposedapproach consists of Enron email dataset (having over 600000 emails generated by 158 employees of EnronCorporation) and Brown corpus (totaling about a million words drawn from a wide variety of sources). Weimplement word substitution techniques used by previous studies to generate a test dataset. We conducta series of experiments consisting of word substitution methods used in the past to evaluate our approach.Experimental results reveal that the proposed approach is effective.

3.8 Investigating the Potential of Aggregated Tweets as Surrogate Data for Forecasting Civil Protests[11]

Micro-blogging Social Media websites like Twitter are being used as a real-time platform for informationsharing and communication during planning and mobilization of civil unrest events. We conduct a study ofmore than 1.5 million English Tweets spanning 5 months on the topic of Immigration and found evidenceof Twitter being used as a platform for planning and mobilization of protests and demonstrations. Webelieve that Twitter data can be used as a surrogate and open-source precursor for forecasting civil unrestand investigate Machine Learning based techniques for building a prediction model. We present our solutionapproach consisting of various components such as location and temporal expression extractors, named-entity recognizers, planning & mobilization and crowd-buzz & commentary classifiers, location-time-topiccorrelation miner. We conduct a series of experiments on a real-world and large dataset and demonstrate theeffectiveness of our approach.

4 Conclusions

I conclude that the open-source and publicly available data on social media platforms like Tumblr, Twitterand YouTube provides immense potential and opportunities for the government agencies in terms of miningthe large volumes of user-generated unstructured data and extracting actionable information from it. Inmy dissertation, I identified few use-cases and applications where the government agencies are the end-

6

Page 7: synopsis

users and developed it. The applications developed by me spans across several topics such as (1) hate andextremism detection (2) forecasting civil protest and mobilization (3) religious conflict detection (3) miningcitizen complaints and grievances. I draw several conclusions from each of the application developed by me.Following are some of the main conclusions:

1. Due to the free-form nature of the user-generated text, it is important to make the text noise-freebefore identifying linguistic features. Proposed text enrichment algorithm is a generalized approachand can be applied to any type of textual content [25].

2. Despite the presence of noise and ambiguity in content, linguistic features are discriminatory featuresfor identifying the dynamics of religious conflicts. Furthermore, identifying the topic prior to theidentification linguistic features can be used to disambiguate the sentiments of the author about aspecific religion [9, 3].

3. Linguistic and contextual features can be used identifying complaints reports tweets. There arevarious features that are strong indicators of a tweet to certainly not to be a complaint report. Aone-class ensemble learning technique can be effectively used to classify complaint and grievancestweets from unknown or non-complaint tweets [25].

4. Despite the structure of a sentence (formal, informal and free-form-text), conceptual and semanticsimilarity measures can be used to identify out-of-context terms in a sentence. Commonsenseknowledge-base and lexical resources can be used to compute the conceptual similarity and identifyany word that has been substitution in a sentence.

5. While most existing studies are limited to sentiment analysis; emphasizing negative radicalizationand racism-specific emotions, e.g., joy, sadness, anger, disgust, fear, authors’ big five personalitytraits and language writing cues can be used to identify posts created with foul intent [10].

References[1] Swati Agarwal, Ashish Sureka, and Vikram Goyal. Open source social media analytics for intelligence and security

informatics applications. In Big Data Analytics, pages 21–37. Springer, 2015.[2] Agarwal Swati and Sureka Ashish. A topical crawler for uncovering hidden communities of extremist micro-bloggers

on tumblr. In 5th International Workshop on Making Sense of Microposts, Big things come in small packages(Microposts), Co-located with WWW, Florence, Italy, 2015.

[3] Agarwal Swati and Sureka Ashish. Mining public opinions on micro-blogging websites for identifying religiousconflicts within society. In The Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Jeju,South Korea. Springer, 2017.

[4] Agarwal Swati and Sureka Ashish. A focused crawler for mining hate and extremism promoting users, videos andcommunities on youtube. In Proceedings of 25th ACM Conference on Hypertext and Social Media (HT), Santiago,Chile, 2014.

[5] Vanessa Frias-Martinez, Abson Sae-Tang, and Enrique Frias-Martinez. To call, or to tweet? understanding 3-1-1citizen complaint behaviors. In ASE BigData/SocialCom/CyberSecurity Conference, 2014.

[6] Gohar Feroz Khan, Bobby Swar, and Sang Kon Lee. Social media risks and benefits a public sector perspective.Social Science Computer Review, 32(5):606–627, 2014.

[7] Euripidis Loukis, Yannis Charalabidis, and Aggeliki Androutsopoulou. Evaluating a Passive Social Media Citizen-sourcing Innovation, pages 305–320. Springer International Publishing, 2015.

[8] Greg Acciaioli. Grounds of conflict, idioms of harmony: custom, religion, and nationalism in violence avoidance atthe lindu plain, central sulawesi. Indonesia, pages 81–114, 2001.

[9] Agarwal Swati and Sureka Ashish. A collision of beliefs: Investigating linguistic features for religious conflictsidentification on tumblr. In 13th International Conference on Distributed Computing and Internet Technology(ICDCIT), India. Springer, 2017.

[10] Agarwal Swati and Sureka Ashish. But i did not mean it!- intent classification of racist posts on tumblr. In 6th IEEEEuropean Intelligence & Security Informatics Conference (EISIC), Uppsala, Sweden. IEEE, 2016.

[11] Agarwal Swati and Sureka Ashish. Investigating the potential of aggregated tweets as surrogate data for forecastingcivil protests. In ACM India SIGKDD Conference on Data Sciences (CoDS), Pune, India, 2016.

[12] Naren Ramakrishnan, Patrick Butler, Sathappan Muthiah, and et. al. ’beating the news’ with embers: forecastingcivil unrest using open source indicators. In Proceedings of the 20th ACM SIGKDD international conference onKnowledge discovery and data mining. ACM, 2014.

[13] Liang Zhao, Feng Chen, and et. al. Unsupervised spatial event detection in targeted domains with applications tocivil unrest modeling. PloS one, 2014.

[14] Sathappan Muthiah, Bert Huang, Jaime Arredondo, and et. al. Planned protest modeling in news and social media. InProceedings of the 29th Association for the Advancement of Artificial Intelligence, AAAI ’15, Austin, Texas, USA,2015.

7

Page 8: synopsis

[15] Thomas Heverin and Lisl Zach. Twitter for city police department information sharing. Proceedings of the AmericanSociety for Information Science and Technology, 2010.

[16] Megan Anderson, Kieran Lewis, and Ozgur Dedehayir. Diffusion of innovation in the public sector: Twitter adoptionby municipal police departments in the us. In Portland International Conference on Management of Engineering andTechnology, 2015, 2015.

[17] Albert Jacob Meijer and René Torenvlied. Social media and the new organization of government communicationsan empirical analysis of twitter usage by the dutch police. The American Review of Public Administration, page0275074014551381, 2014.

[18] Arthur Edwards and Dennis Kool. Webcare in Public Services: Deliver Better with Less?, pages 151–166. SpringerInternational Publishing, Cham, 2015.

[19] Avinash Kumar, Miao Jiang, and Yi Fang. Where not to go?: detecting road hazards using twitter. In Proceedings ofthe 37th international ACM SIGIR conference on Research & development in information retrieval, pages 1223–1226.ACM, 2014.

[20] Yiming Gu, Zhen Sean Qian, and Feng Chen. From twitter to detector: Real-time traffic incident detection usingsocial media data. Transportation Research Part C: Emerging Technologies, 67:321–342, 2016.

[21] Kaiqun Fu, Rakesh Nune, and Jason X Tao. Social media data analysis for traffic incident detection and management.In Transportation Research Board 94th Annual Meeting, 2015.

[22] Eleonora D’Andrea, Pietro Ducange, Beatrice Lazzerini, and Francesco Marcelloni. Real-time detection of trafficfrom twitter stream analysis. IEEE Transactions on Intelligent Transportation Systems, 16(4):2269–2283, 2015.

[23] Axel Schulz, Petar Ristoski, and Heiko Paulheim. I see a car crash: Real-time detection of small scale incidents inmicroblogs. In Extended Semantic Web Conference, pages 22–33. Springer, 2013.

[24] Napong Wanichayapong, Wasawat Pruthipunyaskul, Wasan Pattara-Atikom, and Pimwadee Chaovalit. Social-based traffic information extraction and classification. In ITS Telecommunications (ITST), 2011 11th InternationalConference on, pages 107–112. IEEE, 2011.

[25] Mittal Nitish, Agarwal Swati, and Sureka Ashish. Got a complaint?- keep calm and tweet it! In 12th InternationalConference on Advanced Data Mining and Applications (ADMA), Gold Coast, Australia. Springer, 2016.

[26] Agarwal Swati, Nitish Mittal, and Sureka Ashish. Bad road conditions- mining twitter to extract information on killerroads. In The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery(ECML-PKDD). Springer, 2017.

[27] Anneli Botha. Assessing the vulnerability of kenyan youths to radicalisation and extremism. Institute for SecurityStudies Papers, pages 28–p, 2013.

[28] Shah Mahmood. Online social networks: The overt and covert communication channels for terrorists and beyond. InHomeland Security (HST), 2012 IEEE Conference on Technologies for, pages 574–579. IEEE, 2012.

[29] Yulei Zhang, Shuo Zeng, and Fan. Dark web forums portal: Searching and analyzing jihadist forums. In Proceedingsof the 2009 IEEE International Conference on Intelligence and Security Informatics, pages 71–76, USA, 2009.

[30] Yilu Zhou, Jialun Qin, Guanpi Lai, Edna Reid, and Hsinchun Chen. Exploring the dark side of the web: Collectionand analysis of us extremist online forums. In Intelligence and Security Informatics, pages 621–626. Springer, 2006.

[31] Agarwal Swati and Sureka Ashish. Topic-specific youtube crawling to detect online radicalization. In 10thInternational workshop on Databases in Networked Information Systems (DNIS), Fukushima, Japan, 2015.

[32] Agarwal Swati and Sureka Ashish. Learning to classify hate and extremism promoting tweets. In Proceedings ofIEEE Joint Intelligence and Security Informatics Conference (EISIC+ ISI), the Hague, the Netherlands, 2014.

[33] Agarwal Swati and Sureka Ashish. Using knn and svm based one-class classifier for detecting on-line radicalizationon twitter. In Proceedings of 11th International Conference on Distributed Computing and Internet Technology(ICDCIT), Odisha, India, 2015.

[34] Allison G Smith, Peter Suedfeld, Lucian G Conway III, and David G Winter. The language of violence: Distinguishingterrorist from nonterrorist groups by thematic content analysis. Dynamics of Asymmetric Conflict, 1(2):142–163,2008.

[35] Sheryl Prentice, Paul J Taylor, Paul Rayson, Andrew Hoskins, and Ben O?Loughlin. Analyzing the semantic contentand persuasive composition of extremist media: A case study of texts produced during the gaza conflict. InformationSystems Frontiers, 13(1):61–73, 2011.

[36] Agarwal Swati and Sureka Ashish. Role of authors personality traits for identifying intent based racist posts. In 6thIEEE European Intelligence & Security Informatics Conference (EISIC), Uppsala, Sweden. IEEE, 2016.

[37] William R Swinyard, Ah-Keng Kau, and Hui-Yin Phua. Happiness, materialism, and religious experience in the usand singapore. Journal of happiness studies, 2(1):13–32, 2001.

[38] Johannes Vüllers and Birte Pfeiffer. Measuring the ambivalence of religion: Introducing the religion and conflict indeveloping countries (rcdc) dataset. International Interactions, 41(5):857–881, 2015.

[39] SW. Fong, D. Roussinov, and D.B. Skillicorn. Detecting word substitutions in text. IEEE Transactions on Knowledgeand Data Engineering, 20(8):1067–1076, 2008.

[40] Sonal N. Deshmukh, Ratnadeep R. Deshmukh, and Sachin N. Deshmukh. Performance analysis of different sentenceoddity measures applied on google and google news repository for detection of substitution. International RefereedJournal of Engineering and Science (IRJES), 3(3):20–25, 2014.

[41] Ben Allison Sanaz Jabbari and Louise Guthrie. Using a probabilistic model of context to detect word obfuscation.Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), 2008.

8

Page 9: synopsis

[42] Agarwal Swati and Sureka Ashish. Using common-sense knowledge-base for detecting word obfuscation inadversarial communication. In Proceedings of Workshop on Future Information Security (FIS), co-located withCOMSNETS, Bangalore, India, 2015.

[43] Jiejun Xu, Tsai-Ching Lu, Ryan Compton, and David Allen. Civil unrest prediction: A tumblr-based exploration. InSocial Computing, Behavioral-Cultural Modeling and Prediction. Springer, 2014.

[44] Ryan Compton, Craig Lee, Tsai-Ching Lu, and et. al. Detecting future social unrest in unprocessed twitter data:emerging phenomena and big data. In ISI, 2013 IEEE International Conference On, 2013.

[45] Swati Agarwal and Ashish Sureka. Religious beliefs on social media: Large dataset of tumblr posts and bloggersconsisting of religion based tags, mendeley data, v1, http://dx.doi.org/10.17632/8hp39rknns.1, 2016.

[46] Swati Agarwal and Ashish Sureka. Semantically analyzed metadata of tumblr posts and bloggers, mendeley data, v1,http://dx.doi.org/10.17632/hd3b6v659v.1, 2016.

[47] Swati Agarwal, Nitish Mittal, and Ashish Sureka. Enhanced dataset of citizen centric complaints and grievances ontwitter, mendeley data, v1 http://dx.doi.org/10.17632/w2cp7h53s5.1, 2016.

[48] Swati Agarwal and Ashish Sureka. Spider and the flies: Focused crawling on tumblr to detect hate promotingcommunities. arXiv preprint arXiv:1603.09164, 2016.

9