NIMD 1 Exploring Massive Structured Data with ARGUS PI Meeting November 29, 2005 Main contacts : Prof. Jaime Carbonell – [email protected] Dr. Santosh

1

NIMD

Exploring Massive Structured Data with ARGUS

PI MeetingNovember 29, 2005

Main contacts: Prof. Jaime Carbonell – [email protected]

Dr. Santosh Ananthraman – [email protected]

2

NIMD

Project ARGUS Objectives1. Novelty detection in structured databases or data streams

• Detect and track situation-specific “alert-watch” patterns• Cluster analysis to establish background (normal) models• Cluster density and locus analysis for early detection of new pattern onset,

or meaningful change to established pattern

2. Data Explorer - analyst interface• Framework for intensive, analyst-directed data exploration• Applications

• MED: Massachusetts hospital admission database to detect attacks by biological agents

• NED: Network anomaly/attack detection with CERT®, the federally funded computer security incident response center at CMU

3. Fast multi-dimensional structured-data matching• Exact and approximate matching• Scalable: O(106) to O(1012) records• Profile matching for streaming data: O(1) to O(106) profiles

3

NIMD

Role of ARGUS in Hypothetical End-to-End Multifunctional Architecture

Analysts

Structured DataBanking Transactions

Raw Data - Other

Raw Data – Other

Raw Data - Net Traffic

Raw Data-Annotations

Raw Data - RSS

Raw Data - Financial

Raw Data - Reports

Raw Data - Materials

Raw Data - Customs

Raw Data - News

Raw Data - Biometric

DataNormalization

& Modeling

Distributed Structured Search Engines

Exploration

Analyst Workstation

QueryGeneration

MassiveSearchControl

Data SourcePrioritization

Active ContextControl

HypothesisManagement

ProfileQueries

MatchedEvents

Analyst Interface

Structured DataNetwork Traffic

Structured DataHospital Admissions

Structured DataExtracted News Archives

Structured DataExtracted Agency Reports

Analysis Subsystem – Text & Data

SituationAssessmentValidation

AnalystCollaboration

HypothesisEvaluation

Events and Alerts

Structured Data Search

• Exact• Approximate• Massive data• Streaming data

NoveltyDetection

Text Extraction

Mobile Agents

4

NIMD

Novelty Detection

• Objective: – Detect the onset of novel events in incoming data streams

– Generate alert for analyst (with justification)

– If judged significant track developments, else discard

• Properties– Need a model of “business as usual” to detect divergences

therefrom (done by clustering recent history)

– Control points (tradeoff in precision-recall)• Degree of deviation from normalcy required• Amount of data support (e.g. # of observations) before alerting• Statistical model of normal “noise” in data streams

5

NIMD

Cluster Evolution and Density Change Detection

Constant Event New Unobfuscated Event

New Obfuscated Event Growing Event

6

NIMD

Visualizations in Display Area

7

NIMD

Sample Application: Monitoring for Bioterrorism

• Database of all Mass hospital stays discharged between 10/2000 and 9/2001 (835,895 records)

• 18 fields per record, including:– provider (hospital)– patient (gender, age, birthdate, race, ZIP)– timing (admit date, length of stay)– diagnoses (up to 8 with one primary)– payment source

• Cluster to form background models• Inject new streaming data that may include

potential threats (e.g. SARS, Anthrax, toxin-based attack,…)

• New Mini-Cluster Analysis reveals outbreaks of:

• Tularemia• Dengue Fever• Myiasis• Chagas Disease

• SARS Outbreak simulation– Added new records for patients

from a small geographical region diagnosed with influenza in 9/2001

– Graph shows resulting secondary peak in the pulmonary disease density function

8

NIMD

CERT Collaboration• Working with CERT on NetFlow data for scalable detection of network attack patterns

(viruses, denial of service, unauthorized entry attempts, etc.)

9

NIMD

CERT: Preliminary Data Analysis

• Principal component analysis is used for data reduction where the 11 input features are reduced to 3 principal component features (PC1, PC2 and PC3 below) to capture 54%, 25% and 13%, respectively, of the variance in the original 11 features

• For example, PC2 is mainly comprised of DST FLOWS, PKTS, and BYTES, and PC3 is mainly comprised of UNIQ_PORTS, SUBNETS and DSTPORT

• Clustering in the principal components dimension to explore automatically-generated aggregations and abstractions of data for meaningful matching and pattern detection

10

NIMD

Scalable Matcher

• In-memory matchers faster than 1/100th of a second

• Disk matcher faster than 1/10th of a second until disk access barrier 1 second per match above 108 records (in 2-year-old processor)

Matcher Versions

Record Volume

Time complexity

Status

In-memory 106 to 108 Logarithmic Mature

Disk-based 107 to 1010 Low power-law Algorithmically stable

Distributed 109 to 1012 As underlying matcher

Initial prototype only

11

NIMD

Matching Data Streams to Profiles

Data Streams

Novelty Detection

Analyst

Matcher

Profiles

Novel E

vents

Ale

rts

New Profiles

Profile = “alert-watch” pattern

• Generated by analyst

• Novelty detection & vetted

• Need rapid matching for 105+ simultaneously active profiles

12

NIMD

Profile Sharing Framework

ARGUSQuery

NetworkManager

Query

Data Tables

Analyst

Identified Threats

Data Streams

Query Network

SystemCatalog

DynamixMatcher

13

NIMD

Evaluation

MED: Bio-surveillance

FED: Fedwire suspicioustransaction pattern tracking

0

20

40

60

80

100

120

140

100 200 300 400 500 565

# of queries

NonJoinS MatchPlan+NCanon AllSharing

0

50

100

150

200

250

100 200 300 400 500 600 700 768

# of queries

NonJoinS MatchPlan+NCanon AllSharing

AvgTime/Query with 565 queries in seconds: NonJoinS: 0.20MatchPlan+NCanon: 0.12AllSharing: 0.11

AvgTime/Query with 768 queries in seconds: NonJoinS: 0.25MatchPlan+NCanon: 0.12AllSharing: 0.04

14

NIMD

ARGUS Achievements: Summary• Solid scientific underpinnings

– Efficient algorithms for approximate search and exploration– Efficient matching of complex patterns on streaming data– Novelty detection via radial cluster-density function analysis

• Prototype development– User validation of utility of techniques (at NIST)– Analyst GUI - Data Explorer (under development)– Applications

• MED: Massachusetts hospital admission database for detection of attacks by biological agents

• FED: Fedwire Money Transfer database for suspicious transaction pattern tracking

• NED: NetFlow database from CERT® for scalable detection of network attack patterns

• Sufficient progress to interest operational IC– Exploring collaboration with GDAIS (their client has >108 transactions daily,

>1010 records total)– Getting ready for stage 1 RDEC insertion

15

NIMD

Additional Slides for Q & A Session

16

NIMD

Cluster Evolution

Constant Event New Unobfuscated Event

New Obfuscated Event Growing Event

17

NIMD

Novelty DetectionFunctionality

• Build background model– Expected Events (clusters)

• Find divergences – Individual outliers (but many false

positives)

– New Mini-clusters (more reliable, unobfuscated new-event detection)

– Detect when a novel event is masked by ordinary happenings or intentiallly obfuscated

• Trigger Alerts– Route & Prioritize

– Formulate hypotheses for Analyst

Technology

• Modeling methods– (Hierarchical) k-means

• Divergence metrics– Radial density gradients from

cluster centroid– Temporally-adaptive distance

measures– Secondary peaks in density

function

• Create analyst profiles– RETE-based SAMs methods (last

PI-meeting ARGUS paper)

18

NIMD

ARGUS Query Network Manager

QueryNetwork

Query

ARGUSQuery NetworkManager

Coordinator

SystemCatalog

Common Computation Identifier

Sharing Optimizer

Projection Manager

Network Topology& Operation Manager

Query Rewriter

Query Optimizer

Code Assembler

19

NIMD

Recording & Identifying Common Comps

r2.type_code = 1000r3.type_code = 1000r1.type_code = 1000r1.amount > 1000000

r1.rbank_aba = r2.sbank_abar1.benef_account = r2.orig_account

r2.amount * 2 > r1.amountr1.tran_date <= r2.tran_date

r2.tran_date <= r1.tran_date + 10r2.rbank_aba = r3.sbank_aba

r2.benef_account = r3.orig_accountr2.amount = r3.amount

r2.tran_date <= r3.tran_dater3.tran_date <= r2.tran_date + 10

r1.type_code = 1000r1.amount > 1000000r2.type_code = 1000r2.amount > 500000r3.type_code = 1000r3.amount > 500000

r1.rbank_aba = r2.sbank_abar1.benef_account = r2.orig_account

r2.amount * 2 > r1.amountr1.tran_date <= r2.tran_date

r2.tran_date <= r1.tran_date + 10r2.rbank_aba = r3.sbank_aba

r2.benef_account = r3.orig_accountr2.amount = r3.amount

r2.tran_date <= r3.tran_dater3.tran_date <= r2.tran_date + 10

r1.amount – r2.amount * 2 < 0

r3.tran_date – r2.tran_date <= 10

System Catalog

PredID CanonicalForm …PredSetID PredID …NodeID PredSetID …

PredicateIndex

PredicateSetIndex

TopologyIndex

Canonicalization

Inference & Classification

CommonComputationIdentification

20

NIMD

Preliminary Data Analysis CERT: The Data

• Exploratory data for this exercise comprised a matrix of 65k rows and 24 columns which was aggregated as follows

• For every SCAN_HOUR, for every unique SCAN_ID

record the {independent, input features - time element}

TIME DATETIME - FIRST TIME THIS (SCAN, PORT, HOST) WAS SEEN THIS HOUR

STIME DATETIME - START TIME OF THE FIRST FLOW IN THE SCAN

ETIME DATETIME - START TIME OF THE LAST FLOW IN THE SCAN

record the {independent, input features - Source details}

SRCADDR ADDRESS - SOURCE IP ADDRESS

COUNTRY CHAR - TWO-LETTER COUNTRY CODE OF THE SRC (FROM GEOIP)

UNIQ_DSTS INTEGER - NUMBER OF UNIQUE (PORT, ADDR) PAIRS SCANNED

FLOWS INTEGER - TOTAL NUMBER OF FLOWS IN THE SCAN

PKTS INTEGER - TOTAL NUMBER OF PACKETS IN THE SCAN

BYTES INTEGER - TOTAL NUMBER OF BYTES IN THE SCAN

UNIQ_PORTS INTEGER - NUMBER OF UNIQUE PORTS SCANNED

UNIQ_HOSTS INTEGER - NUMBER OF UNIQUE HOSTS SCANNED

SUBNETS INTEGER - NUMBER OF UNIQUE CLASS /24 PREFIXES SCANNED

HAS_EXPLOIT INTEGER - 1 IF ANY OF THE TARGETS "TALKED BACK"

record the {independent, input features - Destination details}

DSTPORT INTEGER - DESTINATION PORT

FLOWS INTEGER - NUMBER OF FLOWS FOR THIS (SCAN, HOUR, PORT)

PKTS INTEGER - " PACKETS "

BYTES INTEGER - " BYTES "

DSTADDR ADDRESS - DESTINATION IP ADDRESS

EXPLOIT INTEGER - 1 IF THE DESTINATION HOST "TALKED BACK" TO THE SOURCE

record the {dependent, output features - SCAN classification labels based on CERT expert heuristics}

SCAN_PROB FLOAT - PROBABILITY THAT THIS EVENT REPRESENTS A SCAN

SCAN_FP INTEGER - 0: UNKNOWN, 1: HORIZONTAL, 2: VERTICAL

SCAN_TYPE INTEGER - 0: NOT A SCAN, 1: SYN SCAN, 2: SYN-FIN SCAN,

3: NULL SCAN, 4: XMAS SCAN, 5: FIN SCAN,

6: UNIDENTIFIED SCAN

HAS_TROJAN_PORT INTEGER - 1 IF ANY DSTPORT IS USED BY A KNOWN TROJAN

IS _WORM INTEGER – 1 IF THE SCAN APPEARS TO BE A WORM

Documents

NIMD 1 Exploring Massive Structured Data with ARGUS PI Meeting November 29, 2005 Main contacts : Prof. Jaime Carbonell – [email protected] Dr. Santosh