Application Identification in information-poor environments
Charalampos Rotsos
02/02/2010 1
What is application identificationCurrent status
My workFuture plans
Open questions
Taxonomy of Application identification techniques
• Deep Packet InspectionMatch payload with well know protocol signatures
• Statistical AnalysisExtract network measurement ( packet size, pack
interarrival time ) and search for patterns (ML, statistical analysis etc.)
• Behavioral/Graph AnalysisFind connection patternCreate features based on the connection graph
02/02/2010 3
Statistical Analysis
Focused on flow-features• Which features are high-quality?• Which features are computationally-simple?02/02/2010 4
???Packet-size
Inter-packet-rate
TCP header information
Flow duration
Progress so far• The problem is solved
– 5 packets sufficient to classify a flow– Achieve at least 90% accuracy on all classes
• But not really….– Difficult to extract required features– Identification accuracy– Temporal stability is aweful– Technical issues:
02/02/2010 5
Can we do better?
• Restate the problem. • Use information that can be extracted from
current networks (a.k.a. SNMP, NetFlow).• Use better machine learning.• Define models that bridge the gap between
statistical and behavioral properties.
02/02/2010 6
Better ML on NetFlow• Semi-supervised learning on NetFlow data using
Bayesian data analysis. Better performance than Bayes classifier in Weka Bayesian modeling provides good parameterization Efficient reduction of the effect of time dependence
of the feature set.
Temporal and Spatial decay Difficult to balance between a model both accurate and flexibleNetFlow doesn’t provide clean separation of classes
02/02/2010 7
What is next?
• Richer dataset– Aggregate flows for ports/hosts/networks– Increase dimensions by simple feature
engineering.
• Better mathematical models– Incorporate domain-specific knowledge.– Connection graph defined inference diagram.
02/02/2010 8
Inference Diagram
02/02/2010 9
Alice Web Server Bob
• The flows between Alice - web server are correlated and respond to the same application.
• The flow of Alice - web server and Bob - web server also correspond to the same application.
• Research on application identification hasn’t found a framework to accommodate these observations.
Web-browser
Web-browser
Inference Diagram – more difficult
02/02/2010 10
AliceUse random ports
BobUse random ports
Ftp Server – port 22Web Server – port 80
Database Server – port 1680
• Computers will run multiple application in parallel.
• BUT, applications on a particular server will always use a specific port.
A first approach!
• Similar problem can be found in the case of node labeling– Aggregate flow records over some defined period – Use Markov Random Fields model for inference
propagations– Apply approximate inference methods (Gibbs
sampling, Message Passing) – In the end, apply some engineering ideas to refine
results
02/02/2010 11
Open problems
• Is the model a good approximation?• What am I classifying and for how long?• Ports, Hosts or Networks? Is it possible to do
multi-layer analysis?• Are the approximation techniques converging? Turning the difficulty to “Eleven”…• Compute the performance of an individual
traffic within a VPN… by monitoring alone.
02/02/2010 12
Thank you!!!!