32
Sentiment and Affect analysis of Dark Web Forums: Measuring Radicalization on the Internet Hsinchun Chen, Fellow, IEEE

Sentiment and Affect analysis of Dark Web Forums: Measuring Radicalization on the Internet Hsinchun Chen, Fellow, IEEE

Embed Size (px)

Citation preview

Sentiment and Affect analysis of Dark Web Forums: Measuring Radicalization on the

Internet

Hsinchun Chen, Fellow, IEEE

Introduction

• Web forums offer participants a medium to express their opinions and emotions freely in discussion.

• Extremist and terrorist groups also use web forums for community.– Expression and dissemination of their ideologies

and propaganda• Such forums are often referred to as being

part of Dark Web

Introduction

• Information contained within Dark Web forums represent a significant source of knowledge for security and intelligence organizations.

• The opinions and emotions expressed within these forums provide valuable insights:– the nature and position of the online community – Characterizing individual participants

• Manual analysis of the vast quantities of messages to measure the opinions and emotions expressed is often infeasible.

Introduction

• This paper presents an automated approach to sentiment and affect analysis of two Dark Web forums related to the Iraqi insurgency and Al-Qaeda.

• The automated approach utilizes a rich set of textual features and machine learning techniques.

Related Work

• Sentiment and affect analysis are related tasks in text mining that focus on directional text, containing opinions, emotions, and biases.

[5] M. A. Hearst, “Direction-based text interpretation as an information access refinement,” In Text-Based Intelligent Systems: Current Research and Practice in Information Extraction and Retrieval. Lawrence Erlbaum Associates, 1992.

[6] J. Wiebe, “Tracking point of view in narrative,” ComputationalLinguistics, vol. 20 (2), pg. 233-287, 1994.

Related Work

• Sentiment analysis attempt to identify, analyze, and measure opinions expressed in text.

• Affect analysis focuses on the emotional content of the communication.

R. Agrawal, S. Rajagopalan, R. Srikant, and Y. Xu, “Mining newsgroups using networks arising from social behavior,” Proc. of the 12th Int’l WWW Conf., 2003.

P. Subasic and A. Huettner, “Affect analysis of text using fuzzy semantic typing,” IEEE Trans. Fuzzy Systems, vol. 9 (4), pg. 483-496.

Related Work

• There are some important distinction between the two– Affect analysis evaluates the intensity of a number of

potential emotions, including happiness, sadness, anger, fear, etc

– Sentiment analysis considers the polarity of opinions along a positive-neutral-negative continuum.

– The words and phrases associated with sentiments are mutually exclusive.

– Segments of text can convey multiple affects

Related Work

• Researchers have utilized various machine learning approaches to perform automated sentiment and affect analysis.B. Pang, L. Lee, and S. Vaithyanathain, “Thumbs up? sentiment classification using machine learning techniques,” Proc. Empirical Methods in Natural Language Processing, pg. 79-86, 2002.

R. W. Picard, E. Vyzas, and J. Healey, “Toward machine emotional intelligence: analysis of affective physiological state,” IEEE Tran. Pattern Analysis and Machine Intelligence, vol. 23 (10), pg. 1179-1191, 2001.

Related Work

• In particular, the SVM learning approach has been shown to be particularly effective in determining whether a text segment contains expression of a particular affects class.

• Only for discrete label.

Y. H. Cho and K. J. Lee, “Automatic affect recognition using natural language processing techniques and manually built affect lexicon,” IEICE Tran. Information Systems, vol. E89 (12), pg. 2964-2971, 2006.

Related Work

• SVR is an alternate approach that is capable of predicting continuous sentiment and affect intensities while benefitting from the robustness of SVM. A. Webb, Statistical Pattern Recognition. John Wiley & Sons, 2002.

Research Questions

• In a recent book by Ryan, the author highlights the critical role that the Web forums play for militant Islamic radicalization on the Internet.

• Marc Sageman, an internationally renowned terrorism study consultant, also emphasizes the importance of the internet, especially forums.

• This paper presents our web mining research on sentiment and affect analysis of two large-scale, internal Jihadist forums.

Research Questions

• This study seeks to answer the following research questions:– How effective are automated methods of

sentiment and affect analysis in measuring the polarities of opinions and intensities of emotions in Dark Web forums?

– What insights into the Dark Web forums are gained by performing sentiment and affect analysis?

Data• Two Dark Web forums were selected for sentiment and affect

analysis– Al-Firdaws (www.alfirdaws.org/vb)– Montada (www.montada.com)

• Al-Firdaws – a more radical forum– considerable content dedicated to support of the Iraqi insurgency and

Al-Qaeda.• Montada

– Montada is a general discussion forum with content pertaining to a variety of social and religious issues.

– Domain experts consider Montada to be more moderate compared to Al-Firdaws, with less radical content.

Data

• Spidering programs were used to collect the content from the two web forums.

• A summary of the collection statistics is presented in Table I.

Data set is larger.1. An older forum2. Al-Firdaws is too radical

Data

• Both Al-Firdaws and Montada are major forums for their respective purposes and communities, with relatively high membership levels and numerous authors.

Data

• In both cases postings are more evenly distributed across web forum threads.

• Although the Montada forum has a larger average number of posts per thread compared to Al-Firdaws, the median number of posts per thread is nearly equal.

Data

• 500 sentences were selected from each web forum, and scored for the intensities of sentiments and affects expressed.

• The affects of interest in the study included those of most interest to security and intelligence organizations– including violence, anger, hate, and racism.

• These affects were measured on a continuous scale ranging from 0 to 1.

• The sentiment measurement was on a continuous scale from -1 to 1

Data

Methods

Methods

• Annotation step– Character, word, root, collocation n-grams

• Character and word n-grams are commonly used in text mining applications.

• To derive root level n-grams, Arabic words were converted to their roots using a clustering algorithm.

• Collocation n-grams included the Hapax and Dis collocations.

• Features with less than four occurrences in the test bed were excluded.

Methods

Methods

• The machine learning approach for identifying the presence and intensities of sentiments and affects in Dark Web forum sentences utilized a SVR ensemble.

• SVR was utilized to leverage the robustness of SVM, while accommodating the continuous intensities of sentiments and affects.

• Ensemble classifiers aggregate multiple independent classifiers built using different techniques or feature subsets– improving performance over a single classifier.

Methods

• For the analysis of the Al-Firdaws and Montada web forums, a separate classifier was developed for each of the five sentiment and affect classes

Methods

• Feature selection– Information gain (IG) heuristic

• Discretization of intensities were performed before IG could be applied and the relevant features selected.

• To compensate for the discretization, multiple iterations were performed varying the number of class bins for intensity between 2 and 10.

• The IG heuristic was used recursively to select relevant features in these iterations using recursive feature elimination (RFE).

Methods

'

IG:

Selected each feature with score above threshold

REF:

Removed half features each iteraction until only remained

where:

is the selected subset of features for class discr

x IG x IG RFE x

x

n

F F F F

F x

etization

and are the selected features for the class discretization when using IG and RFE, repectively

2,3,......10IG RFE xF F x

x

Methods

• The feature selection phase resulted in a subset of the features identified in the test bed selected for each of the 5 classifiers in the ensemble. Originally 7556 features.

Only 22% was selected

Methods

• Evaluation was performed using 10-fold cross validation

Results

• A sample of messages and their sentiment and affect intensities determined through automated analysis are presented inTable VII.

Results

• Results confirm the assessment of the forums by domain experts.

• The Al-Firdaws forum contained higher intensities of violence and hate affects with a more negative sentiment polarity

Results

• The percentage of postings containing intense levels of the four affects are greater in the Al-Firdaws forum compared to the Montada forum, as shown in Figs. 8 and 9.

Results

• The violence and hate affects were used by a relatively large percentage of Al-Firdaw authors

Results

• A time series analysis was performed to understand how forum affect intensities progressed over time