Upload
subutai-ahmad
View
231
Download
0
Embed Size (px)
Citation preview
DETECTING ANOMALIES IN STREAMING DATA
Data By The Bay May 19, 2016 Subutai Ahmad @SubutaiAhmad [email protected]
OUTLINE
• Real-time streaming analytics
• Anomaly detection with Hierarchical Temporal Memory
• Benchmarking real-time anomaly detection
• Summary
Monitoring IT infrastructure
Uncovering fraudulent
transactions Tracking vehicles
Real-time health
monitoring
Monitoring energy
consumption
Detection is necessary, but prevention is often the goal
REAL-TIME ANOMALY DETECTION
• Exponential growth in IoT, sensors and real-time data collection is driving an explosion of streaming data
• The biggest application for machine learning is anomaly detection
EXAMPLE: PREVENTIVE MAINTENANCE
Planned shutdown
Behavioral change preceding failure
Catastrophic failure
THE STREAMING ANALYTICS PROBLEM
Given all past input and current input, decide whether the system behavior is anomalous right now.
Must report decision, perform any retraining, bookkeeping, etc. before next input arrives.
No look-ahead No training/test set split – everything must be done online System must be automated, and customized to each stream
HIERARCHICAL TEMPORAL MEMORY (HTM)
• Powerful sequence memory derived from recent findings in experimental neuroscience
• High capacity memory based system • Models temporal sequences in data • Inherently streaming • Continuously learning and predicting • No need to tune hyper-parameters • Open source: github.com/numenta
HTM PREDICTS FUTURE INPUT
• Input to the system is a stream of data
• Encoded into a sparse high dimensional vector
• Learns temporal sequences in input stream and makes a prediction in the form of a sparse vector
• represents a prediction for upcoming input
HTM
ANOMALY DETECTION WITH HTM
HTM
Raw anomaly score
Anomaly likelihood
is an instantaneous measure of prediction error
• 0 if input was perfectly prediction
• 1 if it was completely unpredicted
• Could threshold it directly to report anomalies, but in very noisy environments we can do better
ANOMALY LIKELIHOOD
• Second order measure: did the predictability of the metric change?
1. Estimate historical distribution of anomaly scores
2. Check if recent scores are very different
ANOMALY LIKELIHOOD
• Second order measure: did the predictability of the metric change?
1. Estimate historical distribution of anomaly scores
2. Check if recent scores are very different
ANOMALY DETECTION WITH HTM
HTM
Raw anomaly score
Anomaly likelihood
Learns temporal sequences
Continuously makes predictions
Continuously learning
Was current input predicted?
Has level of predictability changed significantly?
ANOMALIES IN IT INFRASTRUCTURE
• Grok • Commercial server based product detects anomalies in IT infrastructure • Runs thousands of HTM anomaly detectors in real time
• 10 milliseconds per input per metric, including continuous learning
• No parameter tuning required • http://grokstream.com
ANOMALIES IN FINANCIAL DATA
• HTM for Stocks • Real-time free demo application • Continuously monitors top 200 stocks
• Available on iOS App Store or Google Play Store
• Open source application: github.com/numenta/numenta-apps
OUTLINE
• Real-time streaming analytics
• Anomaly detection with Hierarchical Temporal Memory
• Benchmarking real-time anomaly detection • Summary
EVALUATING STREAMING ANOMALY DETECTION
• Most existing benchmarks are designed for batch data, not streaming data
• Hard to find benchmarks containing real world data labeled with anomalies
• There is a need for an open benchmark designed to test real-time anomaly detection
• A standard community benchmark could spur innovation in streaming anomaly detection algorithms
NUMENTA ANOMALY BENCHMARK (NAB)
• NAB: a rigorous benchmark for anomaly detection in streaming applications
NUMENTA ANOMALY BENCHMARK (NAB)
• NAB: a rigorous benchmark for anomaly detection in streaming applications
• Real-world benchmark data set • 58 labeled data streams
(47 real-world, 11 artificial streams)
• Total of 365,551 data points
NUMENTA ANOMALY BENCHMARK (NAB)
• NAB: a rigorous benchmark for anomaly detection in streaming applications
• Real-world benchmark data set • 58 labeled data streams
(47 real-world, 11 artificial streams)
• Total of 365,551 data points
• Scoring mechanism • Rewards early detection
• Different “application profiles”
NUMENTA ANOMALY BENCHMARK (NAB)
• NAB: a rigorous benchmark for anomaly detection in streaming applications
• Real-world benchmark data set • 58 labeled data streams
(47 real-world, 11 artificial streams)
• Total of 365,551 data points
• Scoring mechanism • Rewards early detection
• Different “application profiles”
• Open resource • AGPL repository contains data, source code,
and documentation
• github.com/numenta/NAB
• Ongoing competition to expand NAB
HOW SHOULD WE SCORE ANOMALIES?
• The perfect detector • Detects anomalies as soon as possible
• Provides detections in real time
• Triggers no false alarms
• Requires no parameter tuning
• Automatically adapts to changing statistics
• Scoring methods in traditional benchmarks are insufficient • Precision/recall does not incorporate importance of early detection
• Artificial separation into training and test sets does not handle continuous learning
• Batch data files allow look ahead and multiple passes through the data
OTHER DETAILS
• Application profiles • Three application profiles assign different weightings based on the tradeoff between
false positives and false negatives.
• EKG data on a cardiac patient favors False Positives.
• IT / DevOps professionals hate False Positives.
• Three application profiles: standard, favor low false positives, favor low false negatives.
• NAB emulates practical real-time scenarios • Look ahead not allowed for algorithms. Detections must be made on the fly.
• No separation between training and test files. Invoke model, start streaming, and go.
• No batch parameter tuning. Must be fully automated with single set of parameters across data streams. Any further parameter tuning must be done on the fly.
TESTING ALGORITHMS WITH NAB
• NAB is designed to easily plug in and test new algorithms
• Results with several algorithms: • Hierarchical Temporal Memory
• Etsy Skyline • Popular open source anomaly detection technique
• Mixture of statistical experts, continuously learning
• Twitter ADVec • Open source anomaly detection released last year
• Robust outlier statistics + piecewise approximation
• Bayesian Online Change Point Detection • Formal Bayesian method for detecting anomalies in time series
DETECTION RESULTS: CPU USAGE ON PRODUCTION SERVER
Simple spike, all 3 algorithms detect
Shift in usage
Etsy Skyline
Numenta HTM
Twitter ADVec
Red denotes False Positive
Key
DETECTION RESULTS: MACHINE TEMPERATURE READINGS
HTM detects purely temporal anomaly
Etsy Skyline
Numenta HTM
Twitter ADVec
Red denotes False Positive
Key
All 3 detect catastrophic failure
DETECTION RESULTS: TEMPORAL CHANGES IN BEHAVIOR OFTEN PRECEDE A LARGER SHIFT
HTM detects anomaly 3 hours earlier
Etsy Skyline
Numenta HTM
Twitter ADVec
Red denotes False Positive
Key
NAB COMPETITION!!
• NAB is a resource for the streaming analytics community • Need additional real-world data files and more algorithms tested
• NAB Competition offers cash prizes for: • Additional anomaly detection algorithms tested on NAB • Submission of real-world data files with labeled real anomalies
• Cash prizes of $2,500 each for algorithms and data • Easy to enter, high likelihood of winning!
• Go to http://numenta.org/nab for details
SUMMARY • Anomaly detection for streaming data imposes unique challenges
• Stringent real-time constraints and automation requirements • Typical batch methodologies do not work well
• HTM learning algorithms • Can be used to create a streaming anomaly detection system • Performs very well across a wide range of datasets
• Open source, commercially deployable
• NAB is an open source benchmark for streaming anomaly detection • Includes a labeled dataset with real world data • Scoring methodology designed for practical real-time applications • NAB competition!
RESOURCES
Grok (anomalies in IT infrastructure): http://grokstream.com
HTM Studio (desktop app for easy experimentation): contact me
Open Source Repositories:
Algorithm code: https://github.com/numenta/nupic HTM Stocks demo: https://github.com/numenta/numenta-apps NAB code + paper: https://github.com/numenta/nab
Apache Flink: https://github.com/nupic-community/flink-htm
Contact info: Subutai Ahmad [email protected], @SubutaiAhmad Alex Lavin [email protected], @theAlexLavin