117
1 Intelligence and Security Informatics for International Security: Framework and Case Studies Hsinchun Chen, Ph.D. McClelland Professor of MIS Director, Artificial Intelligence Lab NSF COPLINK Center Management Information Systems Department Eller College of Management, University of Arizona

1 Intelligence and Security Informatics for International Security: Framework and Case Studies Hsinchun Chen, Ph.D. McClelland Professor of MIS Director,

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

1

Intelligence and Security Informatics for International Security:

Framework and Case Studies

Hsinchun Chen, Ph.D.

McClelland Professor of MISDirector, Artificial Intelligence Lab

NSF COPLINK CenterManagement Information Systems Department

Eller College of Management, University of Arizona

2

Building a New Discipline: ISI

3

Intelligence and Security Informatics (ISI)

• development of advanced information technologies, systems, algorithms, and databases for national security related applications, through an integrated technological, organizational, and policy-based approach” (Chen et al., 2003a)

Building a New DisciplineBuilding a New Discipline

4

Conferences and Workshops:• NSF/DOJ/CIA, ISI 2003, Tucson, AZ• NSF/CIA/DHS, ISI 2004, Tucson, AZ• IEEE NSF/CIA/DHS, IEEE ISI 2005, Atlanta,

Georgia• PAKDD ISI Workshop 2006, Singapore• IEEE NSF/CIA/DHS, IEEE ISI 2006, San

Diego, CA

• IEEE ISI 2007 (NJ); IEEE ISI 2008 (Taiwan)

Building a New DisciplineBuilding a New Discipline

5

Professional Societies:• IEEE Intelligent Transportation Systems

Society (ITSS): hosting IEEE ISI• President: Wang; VP: Zeng; BOG: Chen

• IEEE ITSS Technical Committee on Homeland Security (TCHS)

• IEEE Systems, Man, and Cybernetics Society (SMCS) Technical Committee on Homeland Security (TCHS)

Building a New DisciplineBuilding a New Discipline

6

Journal Special Issues (Appeared and In Press)• “Intelligence and Security Informatics,” Journal of the

American Society for Information Science and Technology, special issue on Intelligence and Security Informatics, Volume 56, Number 3, Pages 217-220, 2005.

• “Artificial Intelligence for Homeland Security,” IEEE Intelligent Systems, special issue on AI for Homeland Security, Volume 20, Number 5, Pages 12-16, 2005.

• “Intelligence and Security Informatics for Homeland Security: Information, Communication, and Transportation,” IEEE Transactions on Intelligent Transportation Systems, special section, 2006 (in press)

• “Intelligence and Security Informatics: Information Systems Perspective,” Decision Support Systems, special issue on Intelligence and Security Informatics, 2006 (in press).

Building a New DisciplineBuilding a New Discipline

7

Books:• H. Chen, “Intelligence and Security Informatics for

International Security: Information Sharing and Data Mining,” Springer, forthcoming, 2005.

• H. Chen, T. S. Raghu, R. Ramesh, A. Vinze, and D. Zeng, “Handbooks in Information Systems -- National Security,” Elsevier Scientific, forthcoming, 2006.

• H. Chen, E. Reid, and J. Sinai, “Terrorism Informatics,” Springer, forthcoming, 2006.

Building a New DisciplineBuilding a New Discipline

8

Call for Participation:• PAKDD ISI Workshop 2006, Singapore, April 9-10,

2006 (Springer LNCS)• IEEE NSF/CIA/DHS, IEEE ISI 2006, San Diego,

CA, April 22-24, 2006 (Springer LNCS)

• IEEE Transactions on Knowledge and Data Engineering (TKDE), special issue

• IEEE Transactions on Intelligent Transportation Systems (TITS), special issue

• IEEE Transactions on Systems, Man, and Cybernetics (TSMC), special issue

Building a New DisciplineBuilding a New Discipline

9

Call for Participation:• IEEE ITSS and SMC Technical Committee

(TC) involvement• IEEE annual ITSS and SMC conference

special sessions (IEEE SMC October 2006, Taipei, Taiwan)

• IEEE ISI 2008 in Taipei, Taiwan (pending approval)

Building a New DisciplineBuilding a New Discipline

10

Intelligence and Security Informatics for International Security:

Information Sharing and Data Mining

11

12

• Intelligence and Security Informatics (ISI): Challenges and Opportunities

• An Information Sharing and Data Mining Research Framework

• ISI Research: Literature Review• National Security Critical Mission Areas and Case Studies

– Intelligence and Warning– Border and Transportation Security– Domestic Counter-terrorism– Protecting Critical Infrastructure and Key Assets– Defending Against Catastrophic Terrorism– Emergency Preparedness and Responses

• The Partnership and Collaboration Framework

OutlineOutline

13

• Federal authorities are actively implementing comprehensive strategies and measures in order to achieve the three objectives – Preventing future terrorist attacks – Reducing the nation’s vulnerability– Minimizing the damage and recovering from

attacks that occur

• Science and technology have been identified in the “National Strategy for Homeland Security” report as the keys to win the new counter-terrorism war.

IntroductionIntroduction

14

• Six critical mission areas – Intelligence and Warning – Border and Transportation Security– Domestic Counter-terrorism– Protecting Critical Infrastructure and Key Assets – Defending Against Catastrophic Terrorism – Emergency Preparedness and Response

Information Technology and National SecurityInformation Technology and National Security

15

• Facing the critical missions of national security and various data and technical challenges we believe there is a pressing need to develop the science of “Intelligence and Security Informatics” (ISI)

Problems and ChallengesProblems and Challenges

16

ISI vs. Biomedical InformaticsISI vs. Biomedical Informatics

17

Federal Initiatives and Funding Opportunities in ISIFederal Initiatives and Funding Opportunities in ISI

• The abundant research and funding opportunities in ISI. – National Science Foundation (NSF), Information Technology Research

(ITR) Program

– Department of Homeland Security (DHS)

– National Institutes of Health (NIH), National Library of Medicine (NLM), Informatics for Disaster Management Program

– Center for Disease Control and Prevention (CDC), National Center for Infectious Diseases (NCID), Bioterrorism Extramural Research Grant Program

– Department of Defense (DOD), Advanced Research & Development Activity (ARDA) Program

– Department of Justice (DOJ), National Institute of Justice (NIJ)

18

Crime TypesCrime Types

Crime types and security concerns

19

• KDD techniques can play a central role in improving counter-terrorism and crime-fighting capabilities of intelligence, security, and law enforcement agencies by reducing the cognitive and information overload.

• Many of these KDD technologies could be applied in ISI studies (Chen et al., 2003a; Chen et al., 2004b). With the special characteristics of crimes, criminals, and crime-related data we categorize existing ISI technologies into six classes:

– information sharing and collaboration– crime association mining– crime classification and clustering– intelligence text mining– spatial and temporal crime mining– criminal network mining

An ISI Research FrameworkAn ISI Research Framework

20

A knowledge discovery research framework for ISI A knowledge discovery research framework for ISI

A knowledge discovery research framework for ISI

21

• The potential negative effects of intelligence gathering and analysis on the privacy and civil liberties of the public have been well publicized (Cook & Cook, 2003).

• There exist many laws, regulations, and agreements governing data collection, confidentiality, and reporting, which could directly impact the development and application of ISI technologies.

Caveats for Data MiningCaveats for Data Mining

22

• Framed in the context of domestic security surveillance, surveillance is considered as an important intelligence tool that has the potential to contribute significantly to national security but also to infringe civil liberties. (Strickland 2005)

• Data mining using public or private sector databases for national security purposes must proceed in caution:

– The search for general information must ensure anonymity.

– The acquisition of specific identity, if required, must by court authorized under appropriate standards or warrants.

– The peril of the “security-industrial complex” – marriage of private data and technology companies and government anti-terror initiatives. (R. O’Harrow, “No Place to Hide”)

Domestic Security, Civil Liberties, and Knowledge Discovery

Domestic Security, Civil Liberties, and Knowledge Discovery

23

• Information Sharing and Collaboration• Crime Association Mining• Crime Classification and Clustering• Intelligence Text Mining• Crime Spatial and Temporal Mining• Criminal Network Analysis

ISI Research: Literature Review ISI Research: Literature Review

24

• Information sharing across jurisdictional boundaries of intelligence and security agencies has been identified as one of the key foundations for securing national security (Office of Homeland Security, 2002).

• There are some difficulties of information sharing:

– Legal and cultural issues regarding information sharing

– Integrate and combine data that are organized in different schemas stored in different database systems running on different hardware platforms and operating systems

(Hasselbring, 2000).

Information Sharing and CollaborationInformation Sharing and Collaboration

25

• Three approaches to data integration have been proposed: (Garcia-Molina et al., 2002) – Federation: maintains data in their original, independent sources but

provides a uniformed data access mechanism (Buccella et al., 2003; Haas, 2002).

– Warehousing: an integrated system in which copies of data from different data sources are migrated and stored to provide uniform access

– Mediation: relies on “wrappers” to translate and pass queries from multiple data sources.

• These techniques are not mutually exclusive. All these techniques are dependent, to a great extent, on the matching between different databases

Approaches to data integrationApproaches to data integration

26

• The task of database matching can be broadly divided into schema-level and instance-level matching (Lim et al., 1996; Rahm & Bernstein, 2001). – Schema-level matching is preformed by aligning semantically

corresponding columns between two sources. – Instance-level or entity-level matching is to connect records describing a

particular object in one database to records describing the same object in another database.

– Instance-level matching is frequently performed after schema-level matching is completed.

• Information integration approaches have been used in law enforcement and intelligence agencies for investigation support.

• Information sharing has also been undertaken in intelligence and security agencies through cross-jurisdictional collaborative systems.– E.g. COPLINK (Chen et al., 2003b)

Database And ApplicationDatabase And Application

27

• One of most widely studied approaches is association rule mining, a process of discovering frequently occurring item sets in a database.

• An association is expressed as a rule X Y, indicating that item

set X and item set Y occur together in the same transaction (Agrawal et al., 1993).

• Each rule is evaluated using two probability measures, support and confidence, where support is defined as prob(XY) and confidence as prob(XY) / prob(X).

– E.g., “diaper milk with 60% support and 90% confidence” means that 60% of customers buy both diaper and milk in the same transaction and that 90% of the customers who buy diaper tend to also buy milk.

Crime Association MiningCrime Association Mining

28

• Crime association mining techniques can include incident association mining and entity association mining (Lin & Brown, 2003).

• Two approaches, similarity-based and outlier-based, have been developed for incident association mining– Similarity-based method detects associations between crime incidents

by comparing crimes’ features (O'Hara & O'Hara, 1980) – Outlier-based method focuses only on the distinctive features of a

crime (Lin & Brown, 2003)

• The task of finding and charting associations between crime entities such as persons, weapons, and organizations often is referred to as entity association mining (Lin & Brown, 2003) or link analysis.

Crime Association Mining TechniquesCrime Association Mining Techniques

29

• Three types of link analysis approaches have been suggested: heuristic-based, statistical-based, and template-based. – Heuristic-based approaches rely on decision rules used by domain

experts to determine whether two entities in question are related.

– Statistical-based approach E.g. Concept Space (Chen & Lynch, 1992). This approach measures the

weighted co-occurrence associations between records of entities (persons, organizations, vehicles, and locations) stored in crime databases.

– Template-based approach has been primarily used to identify associations between entities extracted from textual documents such as police report narratives.

Link Analysis ApproachesLink Analysis Approaches

30

• Classification is the process of mapping data items into one of several predefined categories based on attribute values of the items (Hand, 1981; Weiss & Kulikowski, 1991).

• Widely used classification techniques:– Discriminant analysis (Eisenbeis & Avery, 1972) – Bayesian models (Duda & Hart, 1973; Heckerman, 1995)– Decision trees (Quinlan, 1986, 1993)– Artificial neural networks (Rumelhart et al., 1986)– Support vector machines (SVM) (Vapnik, 1995)

• Several of these techniques have been applied in the intelligence and security domain to detect financial fraud and computer network intrusion.

Crime Classification and ClusteringCrime Classification and Clustering

31

• Clustering groups similar data items into clusters without knowing their class membership. The basic principle is to maximize intra-cluster similarity while minimizing inter-cluster similarity (Jain et al., 1999)

• Various clustering methods have been developed, including hierarchical approaches such as complete-link algorithms (Defays, 1977), partitional approaches such as k-means (Anderberg, 1973; Kohonen, 1995), and Self-Organizing Maps (SOM) (Kohonen, 1995).

• The use of clustering methods in the law enforcement and security domains can be categorized into two types: crime incident clustering and criminal clustering.

Crime Classification and ClusteringCrime Classification and Clustering

32

• Text mining has attracted increasing attention in recent years as the natural language processing capabilities advance (Chen, 2001). An important task of text mining is information extraction, a process of identifying and extracting from free text select types of information such as entities, relationships, and events (Grishman, 2003). The most widely studied information extraction subfield is named entity extraction.

• Four major named-entity extraction approaches have been proposed: – Lexical-lookup– Rule-based– Statistical model– Machine learning

• Intelligence text mining aims to identify people, organizations, locations, properties, and relationships of interest.

Intelligence Text MiningIntelligence Text Mining

33

• Most crimes, including terrorism, have significant spatial and temporal characteristics (Brantingham & Brantingham, 1981).

• Aims to gather intelligence about environmental factors that prevent or encourage crimes (Brantingham & Brantingham, 1981), identify geographic areas of high crime concentration (Levine, 2000), and detect trend of crimes (Schumacher & Leitner, 1999).

• Two major approaches for crime temporal pattern mining– Visualization

Present individual or aggregated temporal features of crimes using periodic view or timeline view

– Statistical approach Build statistical models from observations to capture the temporal patterns of

events.

Crime Spatial and Temporal MiningCrime Spatial and Temporal Mining

34

• Three approaches for crime spatial pattern mining :(Murray et al., 2001).

– Visual approach (crime mapping): Presents a city or region map annotated with various crime related information.

– Clustering approaches Has been used in hot spot analysis, a process of automatically identifying areas

with high crime concentration.

Partitional clustering algorithms such as the k-means methods are often used for finding hot spots of crimes. They usually require the user to predefine the number of clusters to be found

– Statistical approaches To conduct hot spot analysis or to test the significance of hot spots (Craglia et

al., 2000) To predict crime

Crime Spatial and Temporal MiningCrime Spatial and Temporal Mining

35

• Criminals seldom operate alone but instead interact with one another to carry out various illegal activities. Relationships between individual offenders form the basis for organized crime and are essential for the effective operation of a criminal enterprise.

• Criminal enterprises can be viewed as a network consisting of nodes (individual offenders) and links (relationships).

• Structural network patterns in terms of subgroups, between-group interactions, and individual roles thus are important to understanding the organization, structure, and operation of criminal enterprises.

Criminal Network AnalysisCriminal Network Analysis

36

• Social Network Analysis (SNA) provides a set of measures and approaches for structural network analysis (Wasserman & Faust, 1994).

• SNA is capable of – Subgroup detection– Central member identification – Discovery of patterns of interaction

• SNA also includes visualization methods that present networks graphically. – The Smallest Space Analysis (SSA) approach (Wasserman & Faust, 1994) is

used extensively in SNA to produce two-dimensional representations of social networks.

• Network Topological Analysis aims to identify topological characteristics of complex networks (e.g., random, small world, and scale-free networks) and their dynamics and guiding properties.

Criminal Network AnalysisCriminal Network Analysis

37

• The above-reviewed six classes of KDD techniques constitute the key components of our proposed ISI research framework. Our focus on the KDD methodology, however, does NOT exclude other approaches.

• Researchers from different disciplines can contribute to ISI. – DB, AI, data mining, algorithms, networking, and grid computing

researchers can contribute to core information infrastructure, integration, and analysis research of relevance to ISI

– IS and management science researchers could help develop the quantitative, system, and information theory based methodologies needed for the systematic study of national security.

– Cognitive science, behavioral research, and management and policy are critical to the understanding of the individual, group, organizational, and societal impacts and effective national security policies.

Conclusion and Future DirectionConclusion and Future Direction

38

• Intelligence and Warning • Border and Transportation Security • Domestic Counter-terrorism• Protecting Critical Infrastructure and Key

Assets • Defending Against Catastrophic Terrorism • Emergency Preparedness and Responses

National Security Critical Mission Areas and Case Studies

National Security Critical Mission Areas and Case Studies

39

• By analyzing the communication and activity patterns among terrorists and their contacts detecting deceptive identities, or employing other surveillance and monitoring techniques, intelligence and warning systems may issue timely, critical alerts to prevent attacks or crimes from occurring.

Intelligence and WarningIntelligence and Warning

Case Study

Project Data Characteristics Technologies UsedCritical Mission Area Addressed

1Detecting deceptive identities

Authoritative sourceStructured criminal identity records

Association miningIntelligence and warning

2Dark Web Portal

Open sourceWeb hyperlink data

Web spidering and archivingPortal access

Intelligence and warning

3Jihad on the Web

Open sourceMultilingual, web data

Web spideringMultilingual indexingLink and content analysis

Intelligence and warning

4Analyzing al qaeda network

Open sourceNews articles

Statistics-basedNetwork topological analysis

Intelligence and warning

Four case studies of relevance to intelligence and warning

40

• The capabilities of counter-terrorism and crime-fighting can be greatly improved by creating a “smart border,” where information from multiple sources is integrated and analyzed to help locate wanted terrorists or criminals. Technologies such as information sharing and integration, collaboration and communication, and biometrics and speech recognition will be greatly needed in such smart borders.

Border and Transportation SecurityBorder and Transportation Security

CaseStudy

Project Data Characteristics Technologies UsedCritical MissionArea Addressed

5BorderSafeinformationsharing

Authoritative sourceStructured criminal

identity records

Information sharing and integration

Database federation

Border andTransportationsecurity

6Cross-bordernetworkanalysis

Authoritative sourceStructured criminal

identify records

Network topological analysis

Border andTransportationSecurity

Two case studies of relevance to Border and Transportation Security

41

• As terrorists, both international and domestic, may be involved in local crimes. Information technologies that help find cooperative relationships between criminals and their interactive patterns

would also be helpful for analyzing domestic terrorism.

Domestic Counter-terrorismDomestic Counter-terrorism

CaseStudy

Project Data Characteristics Technologies UsedCritical MissionArea Addressed

7 COPLINK detectAuthoritative sourceStructured data

Association mining Domestic counter-terrorism

8Criminal networkanalysis

Authoritative sourceStructured data

Social network analysisCluster analysisVisualization

Domestic counter-terrorism

9Domesticextremists on the web

Open sourceWeb-based text data

Web spideringLink and content analysis

Domestic counter-terrorism

10Dark networksanalysis

Authoritative and open sources

Network topological analysis

Domestic counter-terrorism

Four case studies of relevance to Domestic Counter-terrorism Security in Chapter 7

42

• Criminals and terrorists are increasingly using the cyberspace to conduct illegal activities, share ideology, solicit funding, and recruit. One aspect of protecting cyber infrastructure is to determine the source and identity of unwanted threats or intrusions.

Protecting Critical Infrastructure and Key Assets

Protecting Critical Infrastructure and Key Assets

CaseStudy

Project Data Characteristics Technologies UsedCritical MissionArea Addressed

11Identity tracing incyber space

Open sourceMultilingual, text, web data

Feature extractionClassifications

Protecting criticalInfrastructure

12Writeprint featureselection

Open sourceMultilingual, text, web data

Feature extractionFeature selection

Protecting criticalinfrastructure

13Arabic authorshipanalysis

Open sourceMultilingual, text, web data

Feature extractionClassifications

Protecting criticalinfrastructure

Three case studies of relevance to Protecting Critical Infrastructure and Key Assets

43

• Biological attacks may cause contamination, infectious disease outbreaks, and significant loss of life. Information systems that can efficiently and effectively collect, access, analyze, and report data about catastrophe-leading events can help prevent, detect, respond to, and manage these attacks.

Defending Against Catastrophic TerrorismDefending Against Catastrophic Terrorism

CaseStudy

Project Data Characteristics Technologies UsedCritical Mission Area Addressed

14BioPortal forinformation sharing

Authoritative sourceStructured data

Information integration and messaging

GIS analysis and visualization

Defending againstCatastrophic terrorism

15 Hotspot analysisAuthoritative sourceStructured data

Statistics-based SatScanClustering; SVM

Defending againstcatastrophic terrorism

Two case studies of relevance to Defending Against Catastrophic Terrorism

44

• Information technologies that help optimize response plans, identify experts, train response professionals, and manage consequences are beneficial to defend against catastrophes in the long run. Moreover, information systems that provide social and psychological support to the victims of terrorist attacks can also help the society recover from disasters.

Emergency Preparedness and ResponsesEmergency Preparedness and Responses

CaseStudy

Project Data Characteristics Technologies UsedCritical MissionArea Addressed

16Terrorismexpert finder

Open sourceStructured, citation data

Bibliometric analysisEmergencypreparedness andresponses

17Chatterbot forterrorism information

Open sourceStructured data

Dialog systemEmergencypreparedness andresponses

Two case studies of relevance to Emergency Preparedness and Responses

45

• Dark Web Collection Building

• Dark Web Content Analysis

• Dark Web Forum Authorship Analysis and Visualization

ISI Dark Web Case StudiesISI Dark Web Case Studies

46

Exploring the Dark Side of the Web: Collection and Analysis of Extremist Online Forums

47

Terrorists’ Communication on the Internet

• Internet enables diverse forms of communication.• The complexity of communication can range from text only messages to the

use of multimedia.• Below is a comparison between communication mediums on the Internet.

Access Temporal flow Directionality Focus Feasibility of Automatic surveillance

Emails Private Asynchronous One to One Not focused

Not feasible

(only email service providers have access to the data)

Instant messengers

Private Synchronous One to One Not focused

Not feasible

(only network server have access to the data)

Forums, newsgroups, discussion boards

Public Asynchronous Many to Many

Usually focused

Feasible

(all registered group members have access to the data from all over the world)

Chat rooms Public/ Private

Synchronous Many to Many

Usually not focused

Not Feasible

(content does not retain; only chat room servers have access to the data)

48

Forum as Communication Tool for Terrorists and Their Supporters

The title of the board

Multiple pages of the board

The title of the thread

Multiple pages of

one thread

# of replies of the thread # of views of

the thread

A Typical Forum

49

Forum as Communication Tool for Terrorists and Their Supporters

Post time The title of the thread

The body of the message

User ID

The virtual rank of the author in the

forum

Other information about the

author

Reply of the main thread

A Typical Forum

50

Proposed Approach

A Web Mining Approach

51

3: Handle Different Forum Software

• Identify forum software packages that were used

• Identify the URL patterns of the forum software

• Customize spiders based on the forum package

List of Forum Software

Software Package Language

Crosstar PHP

DCForum PHP

ezboard CGI

IM PHP

Invision Power Board PHP

newbb PHP

phpBB PHP

rafia PHP

vBulletin PHP

WebRing CGI

WebWiz ASP

YaBB PHP

52

4. Identify Threads, Posts, Authors, etc.

• Identify and record URL patterns for threads, posts, authors, time posted, # of views, etc. of the forum software.

• Meta data from both board files and thread files need to be extracted. – vBulletin example: (# represents numbers)

• URL pattern of boards: forumdisplay.php?forumid=#&daysprune=#&sortorder=&sortfield=lastpost&perpage=#&pagenumber=#

• URL pattern of topics: showthread.php?threadid=###&perpage=##&pagenumber=##

53

Forum Board File Example--Google Groups

Description of the group

The title of the thread

Author

Post time

The title of the group

Google Groups

54

Dark Web Forums Identification

0 20 40 60 80 100 120

Middle-Eastern

Latin-American

USDomestic

# of Forums

Local ISP

AOL

MSN

Google Groups

Yahoo! Groups

Websites

Websites 48 4 18

Yahoo! Groups 20 11 31

Google Groups 0 32 47

MSN 0 5 9

AOL 0 0 5

Local ISP 0 8 0

Middle-Eastern Latin-American US Domestic

Forum Identification-- Overall Distribution by ISP Providers

55

Forum Identification -- Distribution Analysis

# of Forums by CategoryUS Domestic

US Domestic Forums

31 30

21

1410

40 0

0

5

10

15

20

25

30

35 White Supremacy

Militia

Neo Nazis

Black Separatist

Others

Christian Identity

Neo Confederate

Racist Skinhead

Series1 31 30 21 14 10 4 0 0

White Supre

MilitiaNeo Nazis

Black Separ

Others

Christian

Neo Confe

Racist

56

Forum Identification -- Distribution Analysis

# of Forums by Category

Middle-Eastern

Middle-Eastern

48

17

1 0 0 00

10

20

30

40

50

60

Sunni Muslim

Others

Secular

Jewish

Shi'a Muslim

Communist/Socialist

Series1 48 17 1 0 0 0

Sunni Muslim

Others Secular JewishShi'a

MuslimCommunist/S

57

Forum Collection

(Yahoo Groups) • US Domestic

Group Name Forum Name Messages Members

Animal Liberation Front Animal Liberation Front 890 31

National Alliance American National Socialist Group 258 69

Neo Nazi Angelic_Adolf 464 77

World Knights of the Ku Klux Klan aryannationsknights 32 14

Westboro Baptist Church Peace Love And Unity Topeka 48 19

Council of Conservative Citizens Citizens Councils News Update 434 79

Neo Nazi Neo-Nazi 1614 154

New Black Panther Party New Black Panther Party 5051 1102

National Socialist Movement NSM World 5660 789

United Nuwaubian Nation of Moors NUWAUBU RIGHT KNOWLEDGE 5269 258

Neo Nazi smashnazism 103 13

Sons of Liberty Southern Sons Of Liberty 1614 154

Neo Nazi thejapanesenazis 218 41

World Knights of the Ku Klux Klan World_Knights 248 7

58

Forum Collection(Yahoo Groups)

• Middle EastGroup Name Forum Name Messages Members

AlQaeda azzamy 560 1052

Al-Dawa dawa-support · Dawa Committee Supporters 203 77

General Jihad (the exact affiliation is not clear)

friends_in_islam 1541 77

Hezbollah hezbollah_iran · ya MAHDI adrekni 854 92

General Jihad (the exact affiliation is not clear)

Islamic_Action_Group · Islamic Action Group 886 151

General Jihad (the exact affiliation is not clear)

islamicresistance · Islamic Resistance - Speak out for truth, justice, and Palstine

2306 121

General Jihad (the exact affiliation is not clear)

islamic-union · اإلسالمي اإلتحاد 1123 256

Al-Aqsa Martyrs' Brigades kataeb • kataeb al aqsa , شهداء كتائباالقصى

400 142

Hamas kataeb_qassam 855 188

AlQaeda taybah3 · { الطيبة { طيبة 406 89

AlQaeda Usama_bin_laden · الدن بن اسامة 360 535

General Jihad (the exact affiliation is not clear)

wa-islamah · _إسالماه وا 5026 5336

59

Forum Filetype Analysis (Website Forums)

Website Forums Collection

Arabic US Domestic

# of Files Volume (Bytes) # of Files Volume (Bytes)

Total 496,186 20,658,746,269 116,419 7,694,035,712

Indexable Files 208,174 12,132,567,109 93,655 6,511,416,058

HTML Files 2 832,049 33 570,175

Word Files 0 0 0 0

PDF Files 0 0 0 0

Dynamic Files 208,171 12,131,735,060 93,620 6,510,845,724

Text Files 98 1,054,027,204 2 20,967,860

Excel Files 0 0 0 0

Powerpoint Files 0 0 0 0

XML Files 0 0 2 229

Multimedia Files 226,118 6,661,118,184 21,518 1,136,758,815

Image Files 224,485 4,119,229,029 21,177 373,953,750

Audio Files 393 232,709,714 107 405,282,727

Video Files 1,240 2,309,179,441 234 357,522,338

Archive Files 0 0 0 0

Non-Standard Files 61,894 1,865,060,976 1,246 45,860,839

60

Findings

3asfhwww.3asfh.net

Shawatiwww.shawati.com

Discussions (1) Poems praising extremist actions(2) List of Clerics with phone numbers and emails(3) responses to media postings such as the “Desecration of the Qur'an” video

(1) Allegations of abuse of Iraqi children by American soldiers(2) links to websites of clerics(3) Praise of the late Saudi King(4) Reports from Iraq Jihadists

Images (1) Banners of Bin Laden(2) Banners praising Palestinian extremists

(1) Pictures showing cadavers purported to be innocent Iraqis killed by American soldiers(2) Picture of Chechen martyrs

Audio (1) Readings from qur’an.(2) Jihad hymns

(1) Audio recordings of speeches by extremist clerics

Video (1) Desecration of the Qur'an Video showing the shooting of an Iraqi “collaborator”

• 3. Multimedia content heavily used– Discussion and multimedia file content examples from Middle-

Eastern forums

61

Findings

• 3. Multimedia content contains rich messages– Discussion and multimedia file content

examples from Middle-Eastern forums– On http://www.alm2sda.net/vb/ we

found the following: • Mentions leader’s name (Bin Laden,

Zarqawi, and Sayyid Qutb)• Provides information about different

kinds of bombs (i.e. how to prepare it, weight of each type)

• Includes news reporting of operations and events

• Provides detailed descriptions with images of different missile

• Some of the members are from Hamas and they are recruiting other members to join

• Provides information on distributing viruses (under E-Jihad)

62

5. Event Tracking in Extremist Forums: “US Against US”

• Chronology of Events in Iraq in 2003-2004• Two types of external events

– Attack events carried by extremists against Western countries• Istanbul Attacks Nov,15,2003• Madrid Bombings  Mar 11,2004• Berg Beheading   May, 11,2004

– Attack events carried by westerners which happened on extremists’ own land: strong response!

• Feb. 24, 2003 The United States, Great Britain, and Spain submit a proposed resolution to the UN Security Council stating, “Iraq has failed to take the final opportunity afforded to it in Resolution 1441.” The resolution concludes it is time to authorize use of military force.

• March 17, 2003 Great Britain's ambassador to the UN says the diplomatic process on Iraq has ended. Arms inspectors evacuate. Pres. George W. Bush gives Saddam Hussein and his sons 48 hours to leave Iraq or face war.

• March 21, 2003 The major phase of the war begins with heavy aerial attacks on Baghdad and other cities.

• March 24, 2003 Troops march within sixty miles of Baghdad. • April 9, 2003 The fall of Baghdad: U.S. forces advance into central

Baghdad.

63

5. Event Tracking in Extremist Forums--Case Analysis on Yahoo Group

64

Analyzing Terror Campaign on the Internet: Technical Sophistication, Media Richness,

and Web Interactivity

65

Existing Studies on Dark Web

Organization Description Access

Archive

1. Internet Archive (IA) 1996-. Spidering (every 2 mths.) to collect open access HTML pages Via http://www.archives.org

Research Center

2. Artificial Intelligence (AI) Lab, University of Arizona

2003-. Spidering (every 2 mths.) to collect terrorist Web sites. Has 942 Web sites: U.S. Domestic (422), Latin America (188), and Middle Eastern (332) Web sites, 97 gigabytes size, 541,800 multimedia files

Via testbed portal called Dark Web Portal

3. Anti-terrorism Coalition (ATC)

2003-. Jihad Watch. Has 448 terrorist Web sites & forums Via http://www.jihadwatch.org

4. Prism (ICT, Israel) 2002 -. Limited # of Web sites. Project for Research of Islamist Movements.

Access reports via Web sitehttp://www.e-prism.org

5. MEMRI 2003 -. Jihad & Terrorism Studies Project. Access reports via http://www.memri.org

6. Site Institute 2003 -. Manually capture Web sites every 24 hrs. Access reports & subscribe to fee-based intelligence reports & alerts http://siteinstitute.org

7. Weimann (Univ. Haifa, Israel)

1998 -. Manually capture Web sites daily Closed collection

Vigilante Community

8. Internet Haganah 2001- . Spidering. Confronting the Global Jihad Project. Has 100s links to Web sites.

Provides snapshots of terrorist Web sites http://haganah.us

9. Johnathanrgalt 2001 – spidering Islamic Terror Sites on the WebHas 60-70 sites. Monitors sites that closed.

Provides snapshots to terrorist Web sites that are closed. http://www.geocities.com/johnathanrgalt

Table 1: Organizations that Capture and Analyze Terrorists’ Web sites

66

Web Usage Analysis in e-Government

• Several large-scale studies have been dedicated to study governments’ Web usage:– The Cyberspace Policy Research Group (CyPRG;

www.cyprg.arizona.edu)– United Nations Online Network in public Administration and

Finance (UNPAN; www.unpan.org)– European Commission's IST program (www.cordis.lu/ist/)

• Other than the technical sophistication and media richness, the e-Government research also studied interactivity and transparency of government Web sites (Demchak et al., 2001).

67

Dark Web Collection Building Method

Figure 1. The Dark Web Collection Building Procedure

1. Identify Terrorist Groups

TerrorismLexicon

(Organizationnames, leader

names, slogans,special

keywords…

Government Reports(FBI, US State Department,UN Security Council, etc)

Research Centers(ATC, MEMRI,Dartmouth, NorwegianResearch, etc)

2. Identify Terrorist Group URLs

Government Reports(FBI, US State Department,UN Security Council, etc)

Research Centers(ATC, MEMRI,Dartmouth, NorwegianResearch, etc)

Search Engines(Google, Yahoo, etc)

Initial SeedURLs

3. Expand Terrorist Group URLsby Link and Forum Analysis

Filtering

Back-linkExtraction

Out-linkExtraction

ExpandedURLs

Website ForumAnalysis

4. Download Terrorist Site Contents

Automatic WebCrawler(Downloadmultilingual,

multimedia Webcontents)

Dark WebTestbed

68

Dark Web Analysis Framework (DWAF): Technical Sophistication Measures

• Technical sophistication (TS)

– To study the level of advancement of the techniques used by terrorists to establish and maintain their Web presence.

– Table 1 shows the TS Measures identified from Palmer and David (1998).

Measures Weights Comments

Basic HTML Techniques

Use of Lists 0/1 All attributes can be automatically identified from terrorist Websites using programs.

Use of tables 0/2

Use of Frames 0/2

Use of Forms 0/1.5

Embedded Multimedia

Use of Background Image 0/1

Use of Background Music 0/2

Stream Audio/Video 0/3.5

Advanced HTML Use of phtml/shtml 0/2.5

Use Predefined Functions? 0/2

Use Self-defined Functions? 0/4.5

Dynamic Web Programming

Use CGI 0/2.5

Use PHP 0/4.5

Use JSP/ASP 0/5.5

Table 1. Technical Sophistication Measures

69

Dark Web Analysis Framework: Media Richness Measures

• Media richness (MR)

– To study how effectively the information is disseminated from terrorist Websites to their target audiences (basic non-interaction function of Websites).

– Table 2 shows the MR measures identified from computer-mediated communication literature (Trevino et al., 1987; Palmer & Griffith, 1998 ).

Measures Scores Comments

Hyperlink # of Hyperlinks “Push Media” and “Content Search” may need manual identification. Other measures can be automatically extracted.

File/Software Download

# of downloads

Animation # of animations

Image # of images

Video/Audio File # of video/audio files

Table 2. Media Richness Measures

70

Dark Web Analysis Framework: Interactivity Measures

• Web interactivity (WI)– To study how effectively the

terrorist Websites facilitate the interactions between the terrorists and their supporters.

– Contains multiple sub-levels:• One-to-one interaction• Community-level interaction • Transaction-level interaction

– Table 3 shows the WI measures identified from literature (Berthon et al., 1999 ).

Measures Weights Comments

One-to-one interaction

Email Feedback 0/1.75 Automatic extraction + manual identificationEmail List 0/2.25

Contact Address 0/1.25

Feedback Form 0/2.75

Guest Book 0/1.5

Community-level interaction.

Private Messages 0/4.25 Automatic extraction + manual identificationOnline Forums 0/4.25

Chat rooms 0/4.75

Transaction-level interaction

Online Shop 0/4 Automatic extraction + manual identificationOnline Payment 0/4

Online Application Form 0/4

Table 3. Web Interactivity Measures

71

Middle East Terrorist Web Collection File Type Breakdown

• Dynamic files (e.g., PHP, ASP, JSP, etc.) are widely used in terrorist Web sites, indicating a high level of technical sophistication.

• Multimedia is also heavily used in terrorist Web sites.

Terrorist Collection # of Files Volume(Bytes)

Total 222,687 12,362,050,865

Indexable Files 179,223 4,854,971,043

HTML Files 44,334 1,137,725,685

Word Files 278 16,371,586

PDF Files 3,145 542,061,545

Dynamic Files 130,972 3,106,537,495

Text Files 390 45,982,886

Powerpoint Files 6 6,087,168

XML Files 98 204,678

Multimedia Files 35,164 5,915,442,276

Image Files 31,691 525,986,847

Audio Files 2,554 3,750,390,404

Video Files 919 1,230,046,468

Archive Files 1,281 483,138,149

Non-Standard Files 7,019 1,108,499,397

Number of Fi l es Di stri buti on (Arabi c)

80%

16%

0%

4%

I ndexabl eFi l esMul medi aFi l esArchi ve Fi l es

Non-StandardFi l es

Vol ume Di stri buti on (Arabi c)

39%

48%

4%9% I ndexabl e

Fi l esMul medi aFi l esArchi ve Fi l es

Non-StandardFi l es

(Terrorist)

(Terrorist)

72

US Government Web Collection File Type Breakdown

US Government Collection # of Files Volume (Bytes)

Total 277,274 19,341,345,384

Indexable Files 221,684 6,502,288,302

HTML Files 71,518 2,632,912,620

Word Files 298 210,906,045

PDF Files 841 663,293,376

Dynamic Files 145,590 2,071,734,849

Text Files 2,878 555,403,447

Excel Files 4 98,560

Powerpoint Files 5 725,017

XML Files 554 367,214,389

Multimedia Files 49,582 10,835,029,216

Image Files 45,707 850,011,712

Audio Files 3,429 8,153,419,931

Video Files 449 1,831,597,573

Archive Files 538 286,312,990

Non-Standard Files 5,471 1,717,714,876

Number of Fi l es Di st r i but i on (US)

80%

18%

2%

0%

I ndexabl eFi l esMul medi a Fi l es

Non- StandardFi l esArchi ve Fi l es

Vol ume Di stri buti on (US)

33%

56%

10% 1%I ndexabl eFi l esMul medi a Fi l es

Non-StandardFi l esArchi ve Fi l es

• Similarly to the terrorist collection, dynamic files and multimedia are also heavily used in government Web sites.

73

Analysis Results: Technical Sophistication

• Overall, the technical sophistication of terrorist Web sites is on par with US government Web sites.

• US government Web sites are better at the use of basic HTML techniques and dynamic Web programming.

• Terrorist Web sites are using more embedded multimedia.

High-level Attributes Weighted Average Score

t-Test Result

US Terrorists

Basic HTML Techniques 0.913043 0.710526 p < 0.0001**

Embedded Multimedia 0.565217 0.833333 p = 0.0027**

Dynamic HTML 1.789855 1.771929 p = 0.139

Dynamic Web Programming

2.159420 1.407894 p = 0.0066**

Average 1.356884 1.180921 p = 0.060

Table 4. Technical Sophistication Comparison Results

74

Analysis Results: Media Richness

• Overall, terrorist Web sites are not as good as US government Web sites in terms of Media Richness.

• Terrorist Web sites have significantly less hyperlinks and download contents.

Attributes Average Counts per Sites t-Test Result

US Terrorists

Hyperlink 3513.254654 3172.658483 p < 0.0001**

File/Software Download

400.9674532 151.868427 p = 0.0103*

Image 582.352456 540.0484563 p = 0.466

Video/Audio File

91.55434783 50.9736828 p < 0.0001**

Average 1154.531598 978.8871471 p < 0.0001**

Table 5. Media Richness Comparison Results

75

Result Discussions: Technical Sophistication and Media Richness

• Terrorist Web sites use more embedded media but achieved lower media richness scores.

– Government sites use many background images to improve the look of their pages.

– The small background images were not counted as “embedded media.”

– Terrorists use less background images, but more media with rich contents such as history pictures, posters, video/audio recordings, etc.

Arizona state government homepage alone contains 43 images; 42 of which are small background images (less than 4KB).

76

Result Discussions: Sample Media Provided by Terrorists

• Historical pictures or event pictures• Many movie clips of several

“martyrdom operations” in Iraq were posted in http://wwwlb.dm.net.lb/ubb/Forum4/

Flash animation and pictures depicting Marxist symbols, historical locations, and personalities on the Website of the Iranian People’s Fadaee Guerilla. (Source: http://siahkal.com/)

Documentation (with pictures) of an assassination attempt of Libyan president Mu’amar Kdhafi by members of the “Fighting Islamic Group” guerilla. (Source: http://www.almuqatila.com/)

The hero martyr Abdullah Radwan, may God have mercy upon him, hiding in the crowds and awaiting the arrival of the dictator. Shown clearly inside the red circle

77

Result Discussions: Sample Media Provided by Terrorists

• Posters praising terrorist leaders or inviting men to join Jihad.

A Hamas poster inviting men to join the military struggle.(Source: http://www.palestine-info.com).

Have you fought for the sake of God?You say no.Then you should have your mouth shot.

Emir Zarqawi, may God save him.Eagle of Iraq, volcano of Jihad, and the beheader.

Poster depicting terrorist leader in Iraq, Abu Mus’ab Zarqawi. (Source: http://www.islamic-f.net/vb/)

78

Result Discussions: Sample Media Provided by Terrorists

• Audio/video records from terrorist leaders as well as well as other extremist religious teachings.

A list of audio streams from the website of extremist cleric sheikh Hamed Al Ali. The audio files consist of preaching in the Salafi ideology and political issues.(Source: http://www.h-alali.net)

Kashmiri Jihad and the conference for recognizing Al-Taiba Pakistani extremist organization.

Anbaar Iraqi terrorism websites, audio section. Presents holy war songs and hymns.(Source: http://www.anbaar.net/audio/)

79

Analysis Results: Web Interactivity• At Web interactivity level, terrorist Web

sites do not show significant differences from US government Web sites.

• At one-to-one interaction level, the government Web sites are doing significantly better by providing their contact information (e.g., email, mail address, etc.) on their sites.

• However, terrorist Web sites are doing much better in supporting community-based interaction by providing online forums and chat rooms; while few government Web sites do.

• We did not identify transaction-based interaction in terrorist Web sites, although such interaction might be hidden in their sites.

Attributes Weighted Average Score

t-Test Result

US Terrorists

One-to-one 0.342857 0.292169 0.024*

Community 0.028571 0.168675 0.0025**

Transaction 0.3 Not presented

Average (Transaction not included)

0.185714 0.230422 0.056

Table 6. Web Interactivity Comparison Results

80

Result Discussions: Web Interactivity

• Terrorists use guest books and forums intensively to facilitate the communications among themselves and their supporters.

The Qalaa forum, one of the largest terrorist forums, has dozens of thousands of threads and hundreds of thousands of replies.(Source: http://www.qal3ati.net/)

An Al Queada guest book with 176 signitures(Source: http://www.alfida.jeeran.com/)

Welcome to the guest book of the Fida’ Website (Website of Sacrifice)

81

Applying Authorship Analysis to Web Forums: Identification and

Authentication

82

Authorship Identification Characteristics

• Features– Attributes or writing style features that are the most effective discriminators.– Lexical

• Word or character-based measures (e.g., sentence length, vocabulary richness etc).– Syntactic

• Sentence level writing style (e.g., punctuation, function words).– Structural

• Text organization and layout (de Vel et al. 2001)– Content Specific

• Keywords on specific topics (Martindale & McKenzie, 1995)

• Techniques– Analytical methods used to discriminate between authors.– Machine learning approaches typically outperform statistical methods due to

greater computational power and ability to handle noisy data• Parameters

– Number of categories and number of records per category used in experiments.– Generally, there will be some degree of drop off in performance as the number of

authors increases.

83

Online and Multilingual Messages

• Online Messages– Increasingly popular area due to augmented misuse of the internet (cyber crime).– Email

• Objective is to classify set of emails as belonging to particular author.– de Vel et al. 2000, 2001

– Web Forums• Attribute authorship of posted messages in chat groups.

– Zheng et al. 2005; Li et al., 2005; Abbasi & Chen, 2005.– Online Newspapers

• Evaluated online newspaper corpus.– Stamamatos et al., 2001

• Multilingual Content– Applying authorship analysis techniques across different languages. – Greek newspaper corpus.

• Stamatatos et al. 2001– Greek, Chinese, and English novels.

• Peng et al. 2003– Chinese and English web forum messages.

• Zheng et al. 2005– Russian Novels

• Khemeniv, 2003

84

Arabic Feature Extraction Component

Feature Set

Elongation FilterCount +1

Degree + 5

Incoming Message

Filtered Message

Root Dictionary

Root Clustering Algorithm

Similarity Scores (SC)

max(SC)+1

Generic Feature Extractor

All Remaining Features Values

1

3

2

4

85

Arabic Feature Set

Lexical Syntactic StructuralContent Specific

Feature Set

Char-Based

Word-Based

Punctuation

Function Words

Word Structure

Word R

oots

Technical Structure

Race/N

ationality

Violence

Char-Level

Letter Frequency

Special Char.

Word-Level

Vocab. Richness

Word Length D

ist.

(262) (15)(62)(79)

(418)

(48) (31) (12) (200) (48) (11) (4)

(4) (35) (9) (6) (8) (15)

(50)M

essage Level

Paragraph Level

Contact Inform

ation

Font Color

Font Size

Embedded Im

ages

(5) (6) (3) (29)

Hyperlinks

(14)

(8) (4) (7)

Elongation

(2)

86

An Authorship Identification Framework The Web

Dark Web

Extract Features

Feature Set

Elongation Filter Root

Dictionary

Clustering Algorithm

Word Root Feature Values

Collect Web Forum Messages

Text FormatHTML Format

Collection

Extraction

Feature Types

Lexical

Syntactic

Content

Structural

Experimental Techniques

SVM C4.5

Experiment 1

Experiment 4

Experiment 3

Experiment 2

Feature Set Relevance

Pair-wise t-test

Pair-wise t-test

Pair-wise t-test

SVM

AccuracyC4.5

Accuracy

Predictive Ability

Pair-wise

t-test

Technique Relevance

Experiment

Writing/Technical Feature Values

Extracted Values

Technical Structure Features

Writing Features

Filtered WordsRoots

87

Experiment Results

English Dataset Arabic Dataset

Features C4.5 SVM C4.5 SVM

F1 85.76% 88.0% 61.27% 87.77%

F1+F2 87.23% 90.77% 65.40% 91.00%

F1+F2+F3 88.30% 96.5% 71.23% 94.23%

F1+F2+F3+F4 90.10% 97.00% 71.93% 94.83%English Arabic

50.00

60.00

70.00

80.00

90.00

100.00

F1 F1+F2 F1+F2+F3 F1+F2+F3+F4 F1 F1+F2 F1+F2+F3 F1+F2+F3+F4

C4.5

SVM

88

Group Inferences Based on Writing Style

Summary of Previous Authorship Visualization Studies

• All previous studies used n-grams.• None of the previous studies used an automated technique for

evaluating the visualizations.• None of the studies were applied to online messages.• There is no indication of whether the techniques can be successfully

applied in a multilingual setting, such as in cyberspace.

Study Type Visualization Name

Features Techniques Dataset Evaluation

Kjell et al., 1994 Authorship Identification

Nebulas, Histograms

N-grams PCA,

Cosine similarity

Federalist papers

Manual

Shaw et al., 1999

Authorship Identification

SFA N-grams PCA Biblical Texts

Manual

Ribler & Abrams, 2000

Similarity Detection

Patterngrams N-grams Matching Algorithm

Student Programs

Manual

90

Authorship Visualization Process Design

Collect MessagesThis is the first

message written in a long long time since

the olden days.

The Web

Feature Usage

Storage

Feature Set

Feature Extractor

Principal Component

Analysis

Entropy Based Feature

Selection

Writeprints

Ink Blots

Identification

Dynamic Sliding Window

Algorithm

Ink Blot Algorithm

Authentication

eigenvectors

key featuresfeature

vectorsinput messages

extracted messages

pattern coordinates

blot sizes/colors

Collect Messages Extract Features Reduce Dimensionality

Generate Visualization

Data

Create Visualizations

Perform Analysis

feature usage values

reference

The Web

91

Writeprints Using PCA

• Transform by multiplying feature usage vectors with eigenvectors.– The sum of the product of the primary eigenvector and

the feature vector is the x-coordinate.– The sum of the product of the secondary eigenvector

and the feature vector is the y-coordinate.• Plot transformations onto 2D/3D plane. • Sliding Window Algorithm (Kjell et al., 1994)

– An iterative algorithm used to generate more data points in order to create better writing patterns for text documents by capturing usage variations at a finer level of granularity.

Sliding Window Algorithm Illustration

1,0,0,2,1,2

0,1,3,0,1,0

0.533 0.956 -0.541 0.445 0.034 0.089 0.653 0.456 0.975 -0.085 0.143 -0.381

Compute eigenvectors for 2 principal components of feature group

Transform into 2-dimensional space

x

Extract feature usage vectors

y

x = Zx

y = Zy

Repeat steps 2 and 3

1.

3.

2.

x

y

Message Text

Feature Usage Vector Z

Eigenvectors

93

Selected Feature Groups

• Based on these criteria, the categories selected are highlighted.

• Function words were omitted since there were too many.

• Structural features could not be captured using the sliding window, so they were transformed using feature vectors at the message level.

Feature Group English Arabic

Char-Level Lexical 6 4

Letter Usage 26 35

Special Char. 21 15

Word-Level Lexical 6 6

Word Length 20 15

Punctuation 8 12

Function Words 150 250

Structural 14 14

Content Specific 15 15

Vocab. Richness 8 8

94

Example Author Writeprint

Interpreting Writeprint

Feature x y

~ 0 0

@ 0.022814 -0.01491

# 0 0

$ -0.01253 -0.17084

% 0 0

^ -0.01227 -0.01744

& -0.01753 -0.0777

* -0.03017 -0.05931

- -0.12656 0.991784

_ 0.998869 0.047184

= -0.05113 -0.07576

+ 0.142534 0.021726

> -0.1077 0.392182

< -0.10618 0.213193

[ 0 0

] 0 0

{ 0 0

} 0 0

/ -0.05075 -0.09065

\ 0 0

| -0.05965 0.428848

Special Char. Eigenvectors

Author A

Author B

Author C

Author D

Special Char. Writeprints

96

Anonymous MessagesAuthor Writeprints

Author B

Author A 10 messages

10 messages

Determining Blot Size and Color

• The size of a blot is proportional to the ratio of entropy reduction to message length (except for structural features).– Done to compensate against

biases in favor of lengthier messages (blot overflow).

– For example, letter/word/punctuation usage is greater in longer messages.

• Color is based on feature usage. Heavy usage is red, low usage is blue, and everything in between is yellow.

• Thus, correct author-message matches should result in predominantly red ink blot patterns (“hot”) and a minimal amount of blue (“cold”).

d e/c

c = message length in characters

e = entropy reduction

d

Size

Tuning Blot Colors

• The color settings for each feature are “tuned” on the training set (by optimizing settings of q1 and q3).– This is done by maximizing the

ratio of red to blue area in correct messages and maximizing the ratio of blue to red in incorrect messages.

– The terms “low”, “medium” and “high” are defined based on usage rank thresholds set by q1 and q3.

– Since decision trees tend to pick outliers (in order to maximize entropy reduction/info. gain) this approach works well.

Low Medium High

min q1 q3 max

Feature Initial Setting

Low Medium High

min q1 q3 max

Feature Tuned Setting

99

Ink Blot Example: Software DatasetAuthor A Author B Author C

Mes

sag

e 1

Mes

sag

e 2

Author D

100

Ink Blots: Al-Aqsa Martyr Dataset

This image shows 10 potential authors for a single message. Using Ink Blots, we can easily identify the correct author (the one with the greatest ratio of red/blue blots).

101

Evaluating Visualization Techniques

Collect MessagesThis is the first

message written in a long long time since

the olden days.

The Web

Feature Usage

Storage

Feature Set

Feature Extractor

Principal Component

Analysis

Entropy Based Feature

Selection

Writeprints

Ink Blots

Identification

Dynamic Sliding Window

Algorithm

Ink Blot Algorithm

Authentication

eigenvectors

key featuresfeature

vectorsinput messages

extracted messages

pattern coordinates

blot sizes/colors

Collect Messages Extract Features Dimensionality Reduction

Generate Visualization

Data

Create Visualizations

Perform Analysis

feature usage values

reference

The Web

102

Writeprint ResultsForum/Classifier 10-message groups 5-message groups 1-message groups*

Writeprint SVM Writeprint SVM Writeprint SVM

USENET Software 100.00% 50.00% 95.00% 55.00% 76.19% 93.00%

White Knights of KKK 100.00% 60.00% 100.00% 65.00% 85.14% 94.00%

Al-Aqsa Martyrs 100.00% 50.00% 90.00% 60.00% 68.89% 87.00%

It should be noted that for individual messages, Writeprint was not able to perform on messages shorter than 250 characters (approximately 35 words) due to the need to maintain a minimum sliding window size and gather sufficient data points for the evaluation algorithm. The table below shows the number of single messages classified out of the testing set of 100 per forum.

Forum Messages Classifiable

USENET Software 53

White Knights of KKK 60

Al-Aqsa Martyrs 74

103

Ink Blot ResultsForum/Messages Shorter Messages (< 200 characters) All Test Messages

Ink Blots SVM Ink Blots SVM

USENET Software 97.87% 94.59% 95.00% 93.00%

WK of the KKK 97.50% 92.31% 88.00% 94.00%

Al-Aqsa Martyrs 69.23% 84.62% 75.00% 87.00%

• In comparing the Ink Blots to SVM, the Ink Blots technique outperformed SVM on the USENET dataset but was outperformed overall when testing all messages.

• When evaluating the shorter test messages of length less than 200 characters (the messages unclassifiable by Writeprints), the Ink Blots tended to outperform SVM.

• Overall, the Ink Blot technique did not work as well on the Arabic messages.– This could be attributable to the inability of the entropy-based feature

selection technique to identify features that were clear cut enough to distinguish authors within the Al-Aqsa Martyrs forum.

104

The Partnership and Collaboration Framework

105

• Ensuring Data Security and Confidentiality

• Reaching Agreements among Partners

• The COPLINK Chronicle

• Future Directions

The Partnership and Collaboration Framework

The Partnership and Collaboration Framework

106

• The Department of Homeland Security has proposed to establish a network of research centers across the nation– To create a multidisciplinary environment for developing

technologies to counter various threats to homeland security

• A variety of barriers need to addressed, including:– Security and confidentiality

Data regarding crimes, criminals, terrorist organizations, and potential terrorist attacks may be highly sensitive and confidential

Improper use of data could lead to fatal consequences– Trust and willingness to share information

Different agencies may not be motivated to share information and collaborate if there is no immediate gain

Fear that information being shared would be misused, resulting in legal liabilities.

– Data ownership and access control Who owns a particular data set? Who is allowed to access, aggregate, or

input data? Who owns the derivative data (knowledge)?

IntroductionIntroduction

107

• The NSF COPLINK Center at the Artificial Intelligence (AI) Lab of the University of Arizona is intended to become a part of the national network of ISI research laboratories.

– The COPLINK Center is a leading NSF research center for law enforcement and intelligence information and knowledge management

– The COPLINK Center has encountered many of these non-technical challenges in its partnerships with various law enforcement and federal agencies such as;

Tucson Police Department (TPD) Phoenix Police Department (PPD) Tucson Customs and Border Patrol (CBP)

• We present some of our experiences and lessons learned in this section.

The NSF COPLINK Center The NSF COPLINK Center

108

• At the COPLINK Center, we have taken the necessary measures to ensure data privacy, security, and confidentiality

– Only law enforcement data are shared between agencies

– All personnel who have access to law enforcement data are screened Background information and fingerprints are checked by TPD investigators All personnel sign a non-disclosure agreement (NDA) provided by TPD and

take the Terminal Operator Certificate (TOC) test every year Requirements are similar to those imposed upon non-commissioned civilian

personnel in a police department

– All law enforcement data reside behind a firewall and in a secure room accessible only by activated cards

– When an employee stops working on projects these data: Their card is de-activated The NDA is perpetual and remains in effect

Ensuring Data Security and ConfidentialityEnsuring Data Security and Confidentiality

109

• A sample individual user data license agreement was developed by university contracting officers and lawyers in several institutions and government agencies.

• Most of the terms and conditions are applicable to national security projects that demand confidentiality.

• It consists of the following sections:

– Permitted Uses– Access to the Information– Indemnification– Delivery and Acceptance

A Sample Individual User Data LicenseA Sample Individual User Data License

110

• Agreements between agencies within their respective jurisdictions are required to receive advanced approval from their governing hierarchy– This precludes informal information sharing agreements.

• Requirements varied from agency to agency according to the statutes by which they were governed. – The ordinances governing information sharing by the city of Tucson varied

somewhat from those governing the city of Phoenix.

• Similar language existed in the ordinances and statutes governing this exchange but the process varied significantly

• It appears as though the size of the jurisdiction is proportional to the level of bureaucracy required. – Negotiating a contract between University of Arizona and ARJIS (Automated

Regional Justice Information System) of Southern California required six to nine months of discussion between legal staff, contract specialists, and agency officials.

Reaching Agreements among PartnersReaching Agreements among Partners

111

• TPD has recently developed a generic Inter-Governmental Agreement (IGA) that could be adopted between different law enforcement agencies. – IGA was condensed from MOUs (Memorandum of Understanding),

policies, and agreements that previously existed – IGA was drafted in a generic manner, including language from those

laws, but excluding reference to any particular chapter or section.

• Sharing of information between agencies with disparate information systems has also led to bridging boundaries between software vendors and agencies (their customers). – We insured that non-disclosure agreements existed – We insured that contract language assured compliance with the

vendors’ licensing policies.

• We believe MOU and IGA can be used as templates of information sharing agreements and contracts and serve as a component of an ISI partnership framework.

Inter-Governmental Agreement (IGA)Inter-Governmental Agreement (IGA)

112

• Many agencies, partners, and individuals have contributed significantly to the success of this program

• The COPLINK System– Has been cited as a national model for public safety information sharing

and analysis– Has been adopted in more than 150 law enforcement and intelligence

agencies– Had been featured in New York Times, Newsweek, Los Angeles Times,

Washington Post, and Boston Globe, among others– Was selected as a finalist by the prestigious International Association of

Chiefs of Police (IACP)/Motorola 2003 Weaver Seavey Award for Quality in Law Enforcement

• The Research has recently been expanded to border protection (BorderSafe), disease and bioagent surveillance (BioPortal), and terrorism informatics research (Dark Web), funded by NSF, CIA, and DHS

The COPLINK SystemThe COPLINK System

113

• September 1994-August 1998, NSF/ARPA/NASA, Digital Library Initiative (DLI) funding: Selected concept association and data mining techniques developed under the DLI program.

• July 1997-January 2000, DOJ, National Institute of Justice (NIJ) funding: Initial COPLINK research -- database integration and access for a law enforcement Intranet.

• January 2000, first COPLINK prototype: Developed and tested in Tucson Police Department.

• May 2000, Knowledge Computing Corporation (KCC) founded: KCC received venture capital funding and licensed COPLINK technology.

• November 2, 2002, DC Sniper investigation, New York Times: “An electronic cop that plays hunches.”

• April 15, 2003, Newsweek and ABC News: “Google for cops.”• September 2003-August 2005, NSF, DHS, CNRI funding for BorderSafe project:

Cross-jurisdictional information sharing and criminal network analysis.• September 2003-August 2006, NSF, Digital Government Program funding for Dark

Web project: Social network analysis and identity deception detection for law enforcement and homeland security.

• August 2004-July 2008, NSF, Information Technology Research (ITR) Program funding for BioPortal project: A national center of excellence for infectious disease informatics.

The COPLINK ChronicleThe COPLINK Chronicle

114

– The BorderSafe project Continue to contribute to border safety and cross-

jurisdictional criminal network analysis research– The Dark Web project

Help create an invaluable terrorism research testbed

Develop advanced terrorism analysis methods– The BioPortal project

Contribute to the development of a national or even international infectious disease and bioagent information sharing and analysis system

Future DirectionsFuture Directions

115

• New technologies should be developed in a legal and ethical framework without compromising privacy or civil liberties of private citizens.

• Large scale non-sensitive data testbeds consisting of data from diverse, authoritative, and open sources and in different formats should be created and made available to the ISI research community.

• The ultimate goal of ISI research is to enhance our national security. However, the question of how this type of research has impacted and will impact society, organizations, and the general public reminds unanswered.

• Active ISI research will help improve knowledge discovery and dissemination and enhance information sharing and collaboration among academics, local, state, and federal agencies, and industry, thereby bringing positive impacts to all aspects of our society.

Conclusions and Future DirectionsConclusions and Future Directions

116

Tucson Police Department Phoenix Police Department Pima County Sheriff Department Tucson Customs and Border Protection San Diego, Automated Regional Justice Information Systems

(ARJIS) Corporation for National Research Initiatives (CNRI) California Department of Health Services New York State Department of Health United States Geological Survey Library of Congress San Diego Supercomputer Center (SDSC) National Center for Supercomputing Research (NCSA)

AcknowledgementsAcknowledgements

117

For more information:

AI Lab web site: http://ai.arizona.edu

[email protected]