Upload
ledat
View
215
Download
2
Embed Size (px)
Citation preview
This Conference brought to you by www.ttcus.com
@Techtrain
Linkedin/Group:
Technology Training
Corporation www.ttcus.com
Technology Training
Corporation
Neal Ziring
Technical Director, Capabilities
National Security Agency
U/OO/800671-17
Big Data Analytics and Mission –
A view from NSA
4 April 2017
Structure
0. Introduction – importance of and taxonomy for analytics
1. Some conceptual models for analytics
2. Analytics Integration (w/ tool examples)
3. Some NSA lessons about analytics
2
Part 0 – Introduction
Why are Analytics So Important?
“We are drowning in data, but starving for knowledge!” – John Naisbitt, 1982
• Naisbitt’s sentiment is still valid today: modern IT allows easy collection and storage of data, gaining knowledge and answering analytic questions are still hard.
• Analytic computations and processes are essential for extracting useful, actionable knowledge from volumes of data. • “Big data” makes new forms of analysis possible, but collecting the right
data is more important than just collecting lots of data.
• Technologies for building and running analytic processing have improved immensely since the 1990s, but utilizing them for effective analysis still requires care, foundational skills, and understanding of the subject area.
4
A Simple Taxonomy for Analytics
5
• Summary & simple statistics
• Includes selection, counting, mean, range, variance, etc.
L1 Basic
• Extract relationships from multi-dimensional data, complicated statistics
• Identify simple groupings, norms, outliers,
L2 Behavioral
• Extraction of trends, correlations, models
• Generate new knowledge about datasets and their relationships
L3 Predictive
Further Divisions about Prediction
• What are the subjects of the prediction? • Natural activities – tend to be driven by randomness and cause-and-effect
e.g. machine failures, signal propagation, weather systems, disease progression
• Social activities – driven by human motivations/reactions, plus randomness e.g., shopping habits, stock prices, traffic congestion, pandemic spread
• Covert activities – driven by human motivations and desire to evade prediction e.g., terrorist attacks, military operations, money laundering
• How fast do you need a prediction? • Real-time – need a prediction within a fixed time interval
• Active-time – need a prediction before immediate impacts of an activity occur (amount of time involved varies for different domains)
• Non-real-time – need a prediction but time-frame is more flexible
• What certainty is necessary for predictions? • Certain – sufficient surety to take irrevocable action
• Legal – sufficient to satisfy a legal, regulatory, or compliance test
• Best-effort – sufficient to take a low-risk action 6
Part 1 – Conceptual Models for Analytics
1 – Series of Observations over Time
• Usual goals: • Simple: predictions about future observations • Complex: detecting and characterizing patterns in observations over time
• Key concerns: • Wide variety of techniques may apply • What features/properties of the observations are most important? • What aspects of future observations do you need to predict?
8
t
?
2 – Analyzing Entity Behavior
• Focus on entities (people, hosts, networks, services, etc.)
• Usual goals: • Define or learn clusters/bins for entities based on behavior • Build up models of entities (e.g., state transition based) to project future behavior. • Be able to predict behavior of new entities based on similarity to known entities
• Key concerns: • Identifying best features to use for the model • Dealing with missing data and noisy data
9
t
e1
e2
e3
3 – Pattern mining and Sequence mining
• Focus on sequences of events (connections, transactions, start/stop, etc.)
• Usual goals: • Infer or learn sub-sequences that appear often or have properties of interest • Identify missing or anomalous events • Predict future events and their times
• Key concerns: • Preventing ‘state-space explosion’ in the model • Distinguishing meaningful sequences from noise • Determining optimal features of events for extracting patterns.
10
t
4 – Measurements on Objects
• Focus on objects (files, programs, sessions, messages, transactions)
• Usual goals: • Group objects into classes or categories (e.g., malicious v. non-malicious) • Associate classes with features of interest, sources, or lineage • Given a new unknown object, determine the class into which it best fits • Identify outliers and anomalous objects
• Key concerns: • Identifying the most meaningful features to use to build models • Finding and maintaining a good, diverse training set of objects • Applying the model to new environments
11
Part 2 – Data Analytic Integration
(Realizing value depends on integrating
analytics into mission flow)
Rough Integration Model
• Exploration identifies new useful analytic techniques or mission applications, and those are moved to Operation. Operation identifies new mission needs to drive Exploration. Both are used to drive Acquisition.
13
Data Collection
Data Transport &
Staging
QA, Transform & Ingest
Exploration/ Characterization
Technique Development
Model Building &
Sustainment
Production Analysis
Visualization, Presentation,
Action
ACQUISITION
EXPLORATION
OPERATION
Basic Acquisition Stage Attributes
• Primary task: gather data necessary/useful for analysis, move it into the analytic platform(s).
14
Data Collection
Data Transport &
Staging
QA, Transform & Ingest
Stage Basic Requirements Example Tool(s)
Data Collection
• Sense data from target environment • Extract useful components
OpenDataKit, RedHawk
Transport & Staging
• Package data into aggregates • Assured transfer from point of
collection to enterprise platforms
Google® QUIC, Tsunami
QA, Transform, & Ingest
• Clean up and filter data • Transform data into consumable
format and add to repository
OpenRefine, Apache NiFiTM
Basic Exploration Stage Attributes
15
Exploration/ Characterization
Technique Development
• Primary task: understand your data and develop the means for extracting mission value from it.
Stage Basic Requirements Example Tool(s)
Exploration/ Characterization
• Support exploring/viewing data from multiple perspectives
• Sampling, filtering, rough display
OpenRefine, Divvy
Technique Development
• Application of multiple strategies, model types, algorithms
• Support collaborative work
Jupyter Notebooks, R language
Basic Operation Stage Attributes
16
Model Building &
Sustainment
Production Analysis
Visualization, Presentation,
Action
• Primary task: execute data analysis in a scalable & managed way to drive mission execution.
16
Stage Basic Requirements Example Tool(s)
Model Building & Sustainment
• Create & update analysis foundational assets (e.g. models)
TensorFlowTM, MLlib, Oryx
Production Analysis
• Perform analysis on incoming data • Create result sets to drive activities • Manage resources, prioritize
Apache SparkTM, Apache MesosTM, Apache ApexTM
Present, Visualize, Act
• Present analytic results to users • Drive mission actions from results • Push expert feedback into models
iWeave, Apache ODETM, PredictionIOTM
Part 3 – Some NSA Views on Analytics
Some Key Areas for Advanced Analytics
• Cyber defense • Compromise detection
• File triage and malware characterization
• Tradecraft analysis (see next slide)
• Empowering human analysts and operators • Analyst assistance – predict analyst needs and offer information proactively
• Create complex analytic queries from natural language text
• Language modeling
• Recommender systems • Suggest source material for analysts, reporters, operators
• Suggest jobs of interest for individuals
• Intelligence collection • Intelligence Value Estimation
• Optimize value derived from limited collection capacity
18
Some NSA Views on Analytics Development and Data Science
1. Data science is necessarily multi-disciplinary • To build up our data science cadre, NSA found success in establishing
mission-focused rotations through a multi-disciplinary dedicated lab (iCafe).
• Collaboration between programmers, mathematicians, platform experts, and domain experts is very powerful – each learns from the others.
• Drive analytics work with access to real data and real mission problems.
2. Data volume is often less of a problem than data speed. • Defense and intelligence often require analytic answers fast.
• Use deep, batch analysis to build models, use streaming analysis to apply them to incoming data (see next slide)
3. Effective data science requires strong foundations. • Basic statistics and data mining
• Core computer science
• Understanding of the problem domain
19
Example: Hybrid batch/streaming analytics
20
Event data store
streaming
analytic
platform
events
analytic
platform
action
Machine-learning analytic
feedback
analyst
model
results
results
Wrap-up
Conclusions
• Big data analytics allow us to answer analytic questions and guide operations in new ways. • Finding subtle, unexpected, actionable relationships
• Extracting very-deeply-buried knowledge
• Modeling very complex behaviors
• There are many ways to view analytics • Begin with exploratory analysis, try & compare different approaches
• Understand the mission need that the analytic will address
• Make your analytics only as complex as they need to be
• To drive mission value, analytic strategy must span your enterprise • Every stage matters, from initial collection to final presentation and action
• Many tools and products are available for every step, choose ones that fit your situation
• Powerful analytics can commit powerful mistakes if misapplied • Don’t apply analytic techniques/algorithms/packages blindly.
• Build multi-disciplinary teams with math, computing, and domain expertise
• Validate analytics before promoting them to production status 22
Backup Slides
End Notes
• Google® is a registered trademark of Google Inc.
• Apache NiFi™ is a trademark of The Apache Software Foundation
• TensorFlow™ is a trademark of Google Inc.
• Apache Spark™, Mesos ™, Apex ™, ODE™, and PredictionIO™ are trademarks of The Apache Software Foundation
• iWeave™ is a trademark of Campbell, Steven I.
24 U/OO/800671-17