NIMD 1 Scalable Data Exploration and Novelty Detection NIMD Grand Finale PI Meeting April 18, 2006 Main contacts: Prof. Jaime Carbonell, Carnegie Mellon

1NIMD

Scalable Data Exploration and Novelty Detection

NIMD Grand Finale PI MeetingApril 18, 2006

Main contacts:Prof. Jaime Carbonell, Carnegie Mellon University

Dr. Santosh Ananthraman, DYNAMiX Technologies

2NIMD

Project ARGUS Progression1. Fast multi-dimensional matching of structured data

• Exact and approximate matching• Scalable: Up to a trillion (1012) records• Profile matching for streaming data: Up to a million (106) profiles

2. Novelty detection in structured databases or data streams• Detection and tracking of situation-specific “alert-watch” patterns• Cluster analysis to establish background (normal) models• Cluster density and locus analysis for early detection of new pattern onset, or meaningful changes to

established patterns

3. Test Applications• FED: Fedwire money-transfer database (simulated), which allows tracking of suspicious transaction

patterns• MED: Massachusetts hospital admission database to detect attacks by biological agents• NED: Network-flow databases from two sources (CERT® at CMU, and the MIT Lincoln Labs), which

allow detection of network attacks such as denial of service

4. ARGUS Data Explorer – a prototype analyst interface• v0.8 = evaluation at SAIC RDEC PP and MITRE/NIST• Framework for intensive, analyst-directed data exploration• Challenge of harnessing the technologies into a robust, user-friendly package that helps analysts in

significantly reducing the size of the proverbial “haystack”

3NIMD

Role of ARGUS in a Hypothetical End-to-End Multifunctional Architecture

Analysts

Structured DataBanking Transactions

Raw Data - Other

Raw Data – Other

Raw Data - Net Traffic

Raw Data-Annotations

Raw Data - RSS

Raw Data - Financial

Raw Data - Reports

Raw Data - Materials

Raw Data - Customs

Raw Data - News

Raw Data - Biometric

DataNormalization& Modeling

Distributed Structured Search Engines

Exploration

Analyst Workstation

QueryGeneration

MassiveSearchControl

Data SourcePrioritization

Active ContextControl

HypothesisManagement

ProfileQueries

MatchedEvents

Analyst Interface

Structured DataNetwork Traffic

Structured DataHospital Admissions

Structured DataExtracted News Archives

Structured DataExtracted Agency Reports

Analysis Subsystem – Text & Data

SituationAssessmentValidation

AnalystCollaboration

HypothesisEvaluation

Events and Alerts

Structured Data Search

• Exact• Approximate• Massive data• Streaming data

NoveltyDetection

Text Extraction

Mobile Agents

Analysts

Structured DataBanking Transactions

Raw Data - OtherRaw Data - Other

Raw Data – OtherRaw Data – Other

Raw Data - Net TrafficRaw Data - Net Traffic

Raw Data-AnnotationsRaw Data-Annotations

Raw Data - RSSRaw Data - RSS

Raw Data - FinancialRaw Data - Financial

Raw Data - ReportsRaw Data - Reports

Raw Data - MaterialsRaw Data - Materials

Raw Data - CustomsRaw Data - Customs

Raw Data - NewsRaw Data - News

Raw Data - BiometricRaw Data - Biometric

DataNormalization& Modeling

Distributed Structured Search Engines

Exploration

Analyst Workstation

QueryGeneration

MassiveSearchControl

Data SourcePrioritization

Active ContextControl

HypothesisManagement

ProfileQueries

MatchedEvents

Analyst Interface

Structured DataNetwork Traffic

Structured DataHospital Admissions

Structured DataExtracted News Archives

Structured DataExtracted Agency Reports

Analysis Subsystem – Text & Data

SituationAssessmentValidation

AnalystCollaboration

HypothesisEvaluation

Events and Alerts

Structured Data Search

• Exact• Approximate• Massive data• Streaming data

NoveltyDetection

Text Extraction

Mobile Agents

4NIMD

Information Flow

CreateBackground

Model

DetectNovelEvents

GenerateProfiles

Re-cluster

UpdateProfiles

Match

HistoricalData

BackgroundModel

NewData

NovelEvents

Analyst

NovelClusters

TrackedEvents

New ProfilesProfiles

NewData

Alerts

Analyst

Select

CreateBackground

Model

DetectNovelEvents

GenerateProfiles

Re-cluster

UpdateProfiles

Match

HistoricalData

BackgroundModel

NewData

NovelEvents

AnalystAnalyst

NovelClusters

TrackedEvents


NewData

Alerts

AnalystAnalyst

Select

CreateBackground

Model

DetectNovelEvents

GenerateProfiles

Re-cluster

UpdateProfiles

Match

HistoricalData

BackgroundModel

NewData

NovelEvents

Analyst

NovelClusters

TrackedEvents


NewData

Alerts

CreateBackground

Model

DetectNovelEvents

GenerateProfiles

Re-cluster

UpdateProfiles

Match

HistoricalData

BackgroundModel

NewData

NovelEvents

Analyst

NovelClusters

TrackedEvents


NewData

Alerts

Analyst

Select

CreateBackground

Model

DetectNovelEvents

GenerateProfiles

Re-cluster

UpdateProfiles

Match

HistoricalData

BackgroundModel

NewData

NovelEvents

AnalystAnalyst

NovelClusters

TrackedEvents


NewData

Alerts

AnalystAnalyst

Select

5NIMD

Novelty Detection

• Objective – Detect the onset of novel events in incoming data streams

– Generate alerts for the analyst (with justifications, priorities, etc.)

– If judged significant, then track developments, else discard

• Properties– Need a model of “business as usual” to detect divergences from it,

which is done by clustering recent history

– Control points (tradeoff in precision-recall)• Degree of deviation from normalcy required (radial density functions)• Amount of data support (e.g. number of observations) before alerting• Statistical model of normal “noise” in data streams

6NIMD

Cluster Evolution and Density Change Detection

Constant Event New Unobfuscated Event

New Obfuscated Event Growing Event

7NIMD

ARGUS Data Explorer Implementation

• A Client-Server System– Data Explorer Client

• GUI components, i.e., DYNAMiX’s 2-D GUI embedded with ManTech’s 3-D GUI module

• Interface connecting the GUI components to the Web Service API

– ARGUS Server• Web Service API that delegates tasks to the application layer• Application Layer, which encompasses the core application functionality

including clustering, novelty detection, re-clustering, exact and approximate matching

• Data Access Layer, which includes application functionality such as set operations and data exchanges between the application layer and the database

• DYNAMiX iX server used as a component for matching structured data• Data store (database)

8NIMD

ARGUS Data Explorer• Typical Hardware

– Server• Processor: Intel® Xeon™, 3.0GHz, 2MB Cache

• Memory: 8GB DDR2 400MHz (4X2GB), Dual Ranked DIMMs

• Disk Space: 300 GB

– Client• Processor: Pentium 3 or higher

• Memory: 512MB

• Disk Space: 100MB

• Graphics subsystem: High-performance supporting 1280 x1024 resolution

• Typical Software– Server

• Operating System: Red Hat Enterprise Linux ES v4

• Other: Oracle 9i; JSDK V1.4.2 (Tomcat); DYNAMiX iX;

– Client• Operating System: Windows 2000; Mac OS 10.4

• Other: Java3D; JRE V1.4.2_11 (Swing Application)

9NIMD

ARGUS Data Explorer

10NIMD

ARGUS Data Explorer

11NIMD

3-D Data Cluster Display

3-D Data and Cluster display Region. In this Region the display will also show the X, Y, and Z axes along with axis labels for the quantitative data dimensions chosen by the user

Slider to control rotation about the vertical (Y) axis. This allows the user to rotate the data set to obtain a favorable view to see the clusters of greatest interest.

Slider to control rotation about the Horizontal (X) axis. This allows the user to rotate the data set to obtain a favorable view to see the clusters of greatest interest.

3-D Data and Cluster display Region. In this Region the display will also show the X, Y, and Z axes along with axis labels for the quantitative data dimensions chosen by the user

Slider to control rotation about the vertical (Y) axis. This allows the user to rotate the data set to obtain a favorable view to see the clusters of greatest interest.

Slider to control rotation about the Horizontal (X) axis. This allows the user to rotate the data set to obtain a favorable view to see the clusters of greatest interest.

12NIMD

Cluster Control

Cluster selection controls allow the user to select any cluster and then hide non-selected clusters to allow focus on relevant data.

Cluster size control allows user to adjust radius of transparent sphere to be One or Three Standard Deviations

Cluster selection controls allow the user to select any cluster and then hide non-selected clusters to allow focus on relevant data.

Cluster size control allows user to adjust radius of transparent sphere to be One or Three Standard Deviations

13NIMD

Novelty Detection

T0

0-15°

T1

16-30°

T2

31-45°

T0

0-15°

T1

16-30°

T2

31-45°

14NIMD

ARGUS Data Explorer: Menu Items

15NIMD

DEMO PROBLEM: The MIT LL Test Dataset

• Network-flow dataset from the MIT Lincoln Labs, to experiment with the detection of network attacks

• 805,049 records x 42 independent fields (derived features of raw data) x 1 dependent field (post hoc classification label)

• The dependent field is a discrete value representing Connection Type– 0 is normal– 1, 2, 3 and 4 are malicious (1:probe, 2:denial_of_service, 3:user_to_root, and

4:remote_to_local)

• The goal is to use the ARGUS Data Explorer to learn a “normal” background model and then detect the onset of “malicious” attacks

16NIMD

DEMO PROBLEM: The MIT LL DatasetMITLLT0 = 494,020 records x 43 fieldsMITLLT1 = 311,029 records x 43 fields

# Name Ex Recd Ex Recd Description Type1 RECID 1 2 id discrete2 DURATION 0 0 length (number of seconds) of the connection continuous3 PROTOCOL_TYPE 2 2 type of the protocol, e.g. tcp, udp, etc. discrete4 SERVICE 23 23 network service on the destination, e.g., http, telnet, etc. discrete5 FLAG 10 10 normal or error status of the connection discrete 6 SRC_BYTES 181 239 number of data bytes from source to destination continuous7 DST_BYTES 5450 486 number of data bytes from destination to source continuous8 LAND 0 0 1 if connection is from/to the same host/port; 0 otherwise discrete9 WRONG_FRAGMENT 0 0 number of ``wrong'' fragments continuous

10 URGENT 0 0 number of urgent packets continuous11 HOT 0 0 number of ``hot'' indicators continuous12 NUM_FAILED_LOGINS 0 0 number of failed login attempts continuous13 LOGGED_IN 1 1 1 if successfully logged in; 0 otherwise discrete14 NUM_COMPROMISED 0 0 number of ``compromised'' conditions continuous15 ROOT_SHELL 0 0 1 if root shell is obtained; 0 otherwise discrete16 SU_ATTEMPTED 0 0 1 if ``su root'' command attempted; 0 otherwise discrete17 NUM_ROOT 0 0 number of ``root'' accesses continuous18 NUM_FILE_CREATIONS 0 0 number of file creation operations continuous19 NUM_SHELLS 0 0 number of shell prompts continuous20 NUM_ACCESS_FILES 0 0 number of operations on access control files continuous21 NUM_OUTBOUND_CMDS 0 0 number of outbound commands in an ftp session continuous22 IS_HOT_LOGIN 0 0 1 if the login belongs to the ``hot'' list; 0 otherwise discrete23 IS_GUEST_LOGIN 0 0 1 if the login is a ``guest''login; 0 otherwise discrete24 COUNT 8 8 number of connections to the same host as the current connection in the past two seconds continuous25 SRV_COUNT 8 8 number of connections to the same service as the current connection in the past two seconds continuous26 SERROR_RATE 0 0 % of connections that have ``SYN'' errors continuous27 SRV_SERROR_RATE 0 0 % of connections that have ``SYN'' errors continuous28 RERROR_RATE 0 0 % of connections that have ``REJ'' errors continuous29 SRV_RERROR_RATE 0 0 % of connections that have ``REJ'' errors continuous30 SAME_SRV_RATE 1 1 % of connections to the same service continuous31 DIFF_SRV_RATE 0 0 % of connections to different services continuous32 SRV_DIFF_HOST_RATE 0 0 % of connections to different hosts continuous 33 DST_HOST_COUNT 9 19 continuous 34 DST_HOST_SRV_COUNT 9 19 continuous 35 DST_HOST_SAME_SRV_RATE 1 1 continuous 36 DST_HOST_DIFF_SRV_RATE 0 0 continuous 37 DST_HOST_SAME_SRC_PORT_RATE 0 0 continuous 38 DST_HOST_SRV_DIFF_HOST_RATE 0 0 continuous 39 DST_HOST_SERROR_RATE 0 0 continuous 40 DST_HOST_SRV_SERROR_RATE 0 0 continuous 41 DST_HOST_RERROR_RATE 0 0 continuous 42 DST_HOST_SRV_RERROR_RATE 0 0 continuous 43 CONNECTION_TYPE 0 0 0:normal, 1:probe, 2:denial_of_service, 3:user_to_root; 4:remote_to_local discrete

MITLLT0 = 494,020 records x 43 fieldsMITLLT1 = 311,029 records x 43 fields


10 URGENT 0 0 number of urgent packets continuous11 HOT 0 0 number of ``hot'' indicators continuous12 NUM_FAILED_LOGINS 0 0 number of failed login attempts continuous13 LOGGED_IN 1 1 1 if successfully logged in; 0 otherwise discrete14 NUM_COMPROMISED 0 0 number of ``compromised'' conditions continuous15 ROOT_SHELL 0 0 1 if root shell is obtained; 0 otherwise discrete16 SU_ATTEMPTED 0 0 1 if ``su root'' command attempted; 0 otherwise discrete17 NUM_ROOT 0 0 number of ``root'' accesses continuous18 NUM_FILE_CREATIONS 0 0 number of file creation operations continuous19 NUM_SHELLS 0 0 number of shell prompts continuous20 NUM_ACCESS_FILES 0 0 number of operations on access control files continuous21 NUM_OUTBOUND_CMDS 0 0 number of outbound commands in an ftp session continuous22 IS_HOT_LOGIN 0 0 1 if the login belongs to the ``hot'' list; 0 otherwise discrete23 IS_GUEST_LOGIN 0 0 1 if the login is a ``guest''login; 0 otherwise discrete24 COUNT 8 8 number of connections to the same host as the current connection in the past two seconds continuous25 SRV_COUNT 8 8 number of connections to the same service as the current connection in the past two seconds continuous26 SERROR_RATE 0 0 % of connections that have ``SYN'' errors continuous27 SRV_SERROR_RATE 0 0 % of connections that have ``SYN'' errors continuous28 RERROR_RATE 0 0 % of connections that have ``REJ'' errors continuous29 SRV_RERROR_RATE 0 0 % of connections that have ``REJ'' errors continuous30 SAME_SRV_RATE 1 1 % of connections to the same service continuous31 DIFF_SRV_RATE 0 0 % of connections to different services continuous32 SRV_DIFF_HOST_RATE 0 0 % of connections to different hosts

MITLLT0 = 494,020 records x 43 fieldsMITLLT1 = 311,029 records x 43 fields


10 URGENT 0 0 number of urgent packets continuous11 HOT 0 0 number of ``hot'' indicators continuous12 NUM_FAILED_LOGINS 0 0 number of failed login attempts continuous13 LOGGED_IN 1 1 1 if successfully logged in; 0 otherwise discrete14 NUM_COMPROMISED 0 0 number of ``compromised'' conditions continuous15 ROOT_SHELL 0 0 1 if root shell is obtained; 0 otherwise discrete16 SU_ATTEMPTED 0 0 1 if ``su root'' command attempted; 0 otherwise discrete17 NUM_ROOT 0 0 number of ``root'' accesses continuous18 NUM_FILE_CREATIONS 0 0 number of file creation operations continuous19 NUM_SHELLS 0 0 number of shell prompts continuous20 NUM_ACCESS_FILES 0 0 number of operations on access control files continuous21 NUM_OUTBOUND_CMDS 0 0 number of outbound commands in an ftp session continuous22 IS_HOT_LOGIN 0 0 1 if the login belongs to the ``hot'' list; 0 otherwise discrete23 IS_GUEST_LOGIN 0 0 1 if the login is a ``guest''login; 0 otherwise discrete24 COUNT 8 8 number of connections to the same host as the current connection in the past two seconds continuous25 SRV_COUNT 8 8 number of connections to the same service as the current connection in the past two seconds continuous26 SERROR_RATE 0 0 % of connections that have ``SYN'' errors continuous27 SRV_SERROR_RATE 0 0 % of connections that have ``SYN'' errors continuous28 RERROR_RATE 0 0 % of connections that have ``REJ'' errors continuous29 SRV_RERROR_RATE 0 0 % of connections that have ``REJ'' errors continuous30 SAME_SRV_RATE 1 1 % of connections to the same service continuous31 DIFF_SRV_RATE 0 0 % of connections to different services continuous32 SRV_DIFF_HOST_RATE 0 0 % of connections to different hosts continuous 33 DST_HOST_COUNT 9 19 continuous 34 DST_HOST_SRV_COUNT 9 19 continuous 35 DST_HOST_SAME_SRV_RATE 1 1 continuous 36 DST_HOST_DIFF_SRV_RATE 0 0 continuous 37 DST_HOST_SAME_SRC_PORT_RATE 0 0 continuous 38 DST_HOST_SRV_DIFF_HOST_RATE 0 0 continuous 39 DST_HOST_SERROR_RATE 0 0 continuous 40 DST_HOST_SRV_SERROR_RATE 0 0 continuous 41 DST_HOST_RERROR_RATE 0 0 continuous 42 DST_HOST_SRV_RERROR_RATE 0 0 continuous 43 CONNECTION_TYPE 0 0 0:normal, 1:probe, 2:denial_of_service, 3:user_to_root; 4:remote_to_local discrete

17NIMD

DEMO PROBLEM: The MIT LL Test Dataset

• Recipe– Learn the baseline “normal” set of clusters

– Cluster into this baseline, the incremental data including “malicious” records

– Assess quantitative and qualitative changes in clusters and test the capability of the system to detect novelties and alert the user

– Iterate through this “learn ↔ test” cycle dynamically over time, progressively building domain knowledge based on data empirics

• Results– Please visit us at our DEMO & POSTER

session on April 19th between 0800-1300

– We will demonstrate the successful alerting (by cluster density changes, as in the adjacent graph) on denial of service and unauthorized entry attempts by cluster density changes

18NIMD

ARGUS Data Explorer: Challenges

• Goal– The Data Explorer should be a robust, intuitive, user-friendly

system that provides decision support for an analyst who an expert in the problem domain, but not an expert in advanced statistics and pattern recognition technologies

• Challenges– Human-in-the-loop automation of the underlying algorithms– Ensuring the transfer of maximally relevant data through the “data

pipe” connecting the server to the client– Balancing-act: iterative historical batch and near-real-time

processing– Iterative scaling of the application: increase the amount of data

handled; ensure user expectation is still maintained; repeat cycle– Setting the right expectation: a “work in progress” prototype

19NIMD

ARGUS Data Explorer:Where do we go from here?

• Combining the “top-down” and “bottom-up” analysis approaches in ARGUS II

• “Top-down” – Hypothesis creation using first principles and process heuristics

• “Bottom-up” – Probabilistic hypothesis validation and refutation using data empirics

Documents

NIMD 1 Scalable Data Exploration and Novelty Detection NIMD Grand Finale PI Meeting April 18, 2006 Main contacts: Prof. Jaime Carbonell, Carnegie Mellon