View
215
Download
0
Tags:
Embed Size (px)
Citation preview
1
NIMD
Exploring Massive Structured Data with ARGUS
PI MeetingNovember 29, 2005
Main contacts: Prof. Jaime Carbonell – [email protected]
Dr. Santosh Ananthraman – [email protected]
2
NIMD
Project ARGUS Objectives1. Novelty detection in structured databases or data streams
• Detect and track situation-specific “alert-watch” patterns• Cluster analysis to establish background (normal) models• Cluster density and locus analysis for early detection of new pattern onset,
or meaningful change to established pattern
2. Data Explorer - analyst interface• Framework for intensive, analyst-directed data exploration• Applications
• MED: Massachusetts hospital admission database to detect attacks by biological agents
• NED: Network anomaly/attack detection with CERT®, the federally funded computer security incident response center at CMU
3. Fast multi-dimensional structured-data matching• Exact and approximate matching• Scalable: O(106) to O(1012) records• Profile matching for streaming data: O(1) to O(106) profiles
3
NIMD
Role of ARGUS in Hypothetical End-to-End Multifunctional Architecture
Analysts
Structured DataBanking Transactions
Raw Data - Other
Raw Data – Other
Raw Data - Net Traffic
Raw Data-Annotations
Raw Data - RSS
Raw Data - Financial
Raw Data - Reports
Raw Data - Materials
Raw Data - Customs
Raw Data - News
Raw Data - Biometric
DataNormalization
& Modeling
Distributed Structured Search Engines
Exploration
Analyst Workstation
QueryGeneration
MassiveSearchControl
Data SourcePrioritization
Active ContextControl
HypothesisManagement
ProfileQueries
MatchedEvents
Analyst Interface
Structured DataNetwork Traffic
Structured DataHospital Admissions
Structured DataExtracted News Archives
Structured DataExtracted Agency Reports
Analysis Subsystem – Text & Data
SituationAssessmentValidation
AnalystCollaboration
HypothesisEvaluation
Events and Alerts
Structured Data Search
• Exact• Approximate• Massive data• Streaming data
NoveltyDetection
Text Extraction
Mobile Agents
4
NIMD
Novelty Detection
• Objective: – Detect the onset of novel events in incoming data streams
– Generate alert for analyst (with justification)
– If judged significant track developments, else discard
• Properties– Need a model of “business as usual” to detect divergences
therefrom (done by clustering recent history)
– Control points (tradeoff in precision-recall)• Degree of deviation from normalcy required• Amount of data support (e.g. # of observations) before alerting• Statistical model of normal “noise” in data streams
5
NIMD
Cluster Evolution and Density Change Detection
Constant Event New Unobfuscated Event
New Obfuscated Event Growing Event
7
NIMD
Sample Application: Monitoring for Bioterrorism
• Database of all Mass hospital stays discharged between 10/2000 and 9/2001 (835,895 records)
• 18 fields per record, including:– provider (hospital)– patient (gender, age, birthdate, race, ZIP)– timing (admit date, length of stay)– diagnoses (up to 8 with one primary)– payment source
• Cluster to form background models• Inject new streaming data that may include
potential threats (e.g. SARS, Anthrax, toxin-based attack,…)
• New Mini-Cluster Analysis reveals outbreaks of:
• Tularemia• Dengue Fever• Myiasis• Chagas Disease
• SARS Outbreak simulation– Added new records for patients
from a small geographical region diagnosed with influenza in 9/2001
– Graph shows resulting secondary peak in the pulmonary disease density function
8
NIMD
CERT Collaboration• Working with CERT on NetFlow data for scalable detection of network attack patterns
(viruses, denial of service, unauthorized entry attempts, etc.)
9
NIMD
CERT: Preliminary Data Analysis
• Principal component analysis is used for data reduction where the 11 input features are reduced to 3 principal component features (PC1, PC2 and PC3 below) to capture 54%, 25% and 13%, respectively, of the variance in the original 11 features
• For example, PC2 is mainly comprised of DST FLOWS, PKTS, and BYTES, and PC3 is mainly comprised of UNIQ_PORTS, SUBNETS and DSTPORT
• Clustering in the principal components dimension to explore automatically-generated aggregations and abstractions of data for meaningful matching and pattern detection
10
NIMD
Scalable Matcher
• In-memory matchers faster than 1/100th of a second
• Disk matcher faster than 1/10th of a second until disk access barrier 1 second per match above 108 records (in 2-year-old processor)
Matcher Versions
Record Volume
Time complexity
Status
In-memory 106 to 108 Logarithmic Mature
Disk-based 107 to 1010 Low power-law Algorithmically stable
Distributed 109 to 1012 As underlying matcher
Initial prototype only
11
NIMD
Matching Data Streams to Profiles
Data Streams
Novelty Detection
Analyst
Matcher
Profiles
Novel E
vents
Ale
rts
New Profiles
Profile = “alert-watch” pattern
• Generated by analyst
• Novelty detection & vetted
• Need rapid matching for 105+ simultaneously active profiles
12
NIMD
Profile Sharing Framework
ARGUSQuery
NetworkManager
Query
Data Tables
Analyst
Identified Threats
Data Streams
Query Network
SystemCatalog
DynamixMatcher
13
NIMD
Evaluation
MED: Bio-surveillance
FED: Fedwire suspicioustransaction pattern tracking
0
20
40
60
80
100
120
140
100 200 300 400 500 565
# of queries
NonJoinS MatchPlan+NCanon AllSharing
0
50
100
150
200
250
100 200 300 400 500 600 700 768
# of queries
NonJoinS MatchPlan+NCanon AllSharing
AvgTime/Query with 565 queries in seconds: NonJoinS: 0.20MatchPlan+NCanon: 0.12AllSharing: 0.11
AvgTime/Query with 768 queries in seconds: NonJoinS: 0.25MatchPlan+NCanon: 0.12AllSharing: 0.04
14
NIMD
ARGUS Achievements: Summary• Solid scientific underpinnings
– Efficient algorithms for approximate search and exploration– Efficient matching of complex patterns on streaming data– Novelty detection via radial cluster-density function analysis
• Prototype development– User validation of utility of techniques (at NIST)– Analyst GUI - Data Explorer (under development)– Applications
• MED: Massachusetts hospital admission database for detection of attacks by biological agents
• FED: Fedwire Money Transfer database for suspicious transaction pattern tracking
• NED: NetFlow database from CERT® for scalable detection of network attack patterns
• Sufficient progress to interest operational IC– Exploring collaboration with GDAIS (their client has >108 transactions daily,
>1010 records total)– Getting ready for stage 1 RDEC insertion
17
NIMD
Novelty DetectionFunctionality
• Build background model– Expected Events (clusters)
• Find divergences – Individual outliers (but many false
positives)
– New Mini-clusters (more reliable, unobfuscated new-event detection)
– Detect when a novel event is masked by ordinary happenings or intentiallly obfuscated
• Trigger Alerts– Route & Prioritize
– Formulate hypotheses for Analyst
Technology
• Modeling methods– (Hierarchical) k-means
• Divergence metrics– Radial density gradients from
cluster centroid– Temporally-adaptive distance
measures– Secondary peaks in density
function
• Create analyst profiles– RETE-based SAMs methods (last
PI-meeting ARGUS paper)
18
NIMD
ARGUS Query Network Manager
QueryNetwork
Query
ARGUSQuery NetworkManager
Coordinator
SystemCatalog
Common Computation Identifier
Sharing Optimizer
Projection Manager
Network Topology& Operation Manager
Query Rewriter
Query Optimizer
Code Assembler
19
NIMD
Recording & Identifying Common Comps
r2.type_code = 1000r3.type_code = 1000r1.type_code = 1000r1.amount > 1000000
r1.rbank_aba = r2.sbank_abar1.benef_account = r2.orig_account
r2.amount * 2 > r1.amountr1.tran_date <= r2.tran_date
r2.tran_date <= r1.tran_date + 10r2.rbank_aba = r3.sbank_aba
r2.benef_account = r3.orig_accountr2.amount = r3.amount
r2.tran_date <= r3.tran_dater3.tran_date <= r2.tran_date + 10
r1.type_code = 1000r1.amount > 1000000r2.type_code = 1000r2.amount > 500000r3.type_code = 1000r3.amount > 500000
r1.rbank_aba = r2.sbank_abar1.benef_account = r2.orig_account
r2.amount * 2 > r1.amountr1.tran_date <= r2.tran_date
r2.tran_date <= r1.tran_date + 10r2.rbank_aba = r3.sbank_aba
r2.benef_account = r3.orig_accountr2.amount = r3.amount
r2.tran_date <= r3.tran_dater3.tran_date <= r2.tran_date + 10
r1.amount – r2.amount * 2 < 0
r3.tran_date – r2.tran_date <= 10
System Catalog
PredID CanonicalForm …PredSetID PredID …NodeID PredSetID …
PredicateIndex
PredicateSetIndex
TopologyIndex
Canonicalization
Inference & Classification
CommonComputationIdentification
20
NIMD
Preliminary Data Analysis CERT: The Data
• Exploratory data for this exercise comprised a matrix of 65k rows and 24 columns which was aggregated as follows
• For every SCAN_HOUR, for every unique SCAN_ID
record the {independent, input features - time element}
TIME DATETIME - FIRST TIME THIS (SCAN, PORT, HOST) WAS SEEN THIS HOUR
STIME DATETIME - START TIME OF THE FIRST FLOW IN THE SCAN
ETIME DATETIME - START TIME OF THE LAST FLOW IN THE SCAN
record the {independent, input features - Source details}
SRCADDR ADDRESS - SOURCE IP ADDRESS
COUNTRY CHAR - TWO-LETTER COUNTRY CODE OF THE SRC (FROM GEOIP)
UNIQ_DSTS INTEGER - NUMBER OF UNIQUE (PORT, ADDR) PAIRS SCANNED
FLOWS INTEGER - TOTAL NUMBER OF FLOWS IN THE SCAN
PKTS INTEGER - TOTAL NUMBER OF PACKETS IN THE SCAN
BYTES INTEGER - TOTAL NUMBER OF BYTES IN THE SCAN
UNIQ_PORTS INTEGER - NUMBER OF UNIQUE PORTS SCANNED
UNIQ_HOSTS INTEGER - NUMBER OF UNIQUE HOSTS SCANNED
SUBNETS INTEGER - NUMBER OF UNIQUE CLASS /24 PREFIXES SCANNED
HAS_EXPLOIT INTEGER - 1 IF ANY OF THE TARGETS "TALKED BACK"
record the {independent, input features - Destination details}
DSTPORT INTEGER - DESTINATION PORT
FLOWS INTEGER - NUMBER OF FLOWS FOR THIS (SCAN, HOUR, PORT)
PKTS INTEGER - " PACKETS "
BYTES INTEGER - " BYTES "
DSTADDR ADDRESS - DESTINATION IP ADDRESS
EXPLOIT INTEGER - 1 IF THE DESTINATION HOST "TALKED BACK" TO THE SOURCE
record the {dependent, output features - SCAN classification labels based on CERT expert heuristics}
SCAN_PROB FLOAT - PROBABILITY THAT THIS EVENT REPRESENTS A SCAN
SCAN_FP INTEGER - 0: UNKNOWN, 1: HORIZONTAL, 2: VERTICAL
SCAN_TYPE INTEGER - 0: NOT A SCAN, 1: SYN SCAN, 2: SYN-FIN SCAN,
3: NULL SCAN, 4: XMAS SCAN, 5: FIN SCAN,
6: UNIDENTIFIED SCAN
HAS_TROJAN_PORT INTEGER - 1 IF ANY DSTPORT IS USED BY A KNOWN TROJAN
IS _WORM INTEGER – 1 IF THE SCAN APPEARS TO BE A WORM