Five-sigmaNetworkEvents(andhowtofindthem)
JohnO'NeilEdgewiseNetworksHalloween,2018
Networks are Complex
�2
• Nooneknowswhat'sgoingon
Finding the Strange & Unusual
�3
• Orthenew&unexpected🎃
• …andifit’sdifferent,itmightbebad.☠
• Outlier—Improbabledatapointintheexpecteddistribution🦔
• Anomaly—Datapointgeneratedbyadifferentdistribution👽
Mr. Splanky
�4
Anomaly & Outlier Tools
�5
“Ifyouwantsomethingdoneright,doityourself.”—Charles-GuillaumeÉtienne
Using Python
�6
• Interpretablepseudocode
• Maturelibraries.
• Easytoinstall
• Fastenough😀
Creating Tools for Outlier Detection
�7
• IntroducingafewtoolswritteninPython
• Intendedtoanswerinterestingquestionsandscalewell
• Easytomodify/improvetosatisfyyourcuriosity
• Astartingpointforyourowntools
• Codeisavailableat:
http://github.com/EdgewiseNetworks/five-sigma
Discover Bad Things Before Big Problems
�8
• Keeptrackofnetflowsacrossmachinesandacrosstime
• Wellenoughtorecognizeunusualthings
• Buttoomuchinformation
• Andmakeittunable
Standard Deviation
�9
5σ ≈ 10−6
3.29σ ≈ 10−3
Theamountof“spread”ina(usuallyGaussian)distribution.
Project Overview
�10
1. Createafeedoftypicalnetflows• Basedonrealnetflowsbutanonymized• Format:timestamp,src_ip,src_port,dest_ip,dest_port,flow_count
2. Createaconsumerforthesenetflows.
3. Createanumberofconsumertoolstotrackinterestingstatistics.• Standarddeviations• Updateperiod
Examples Of Useful Information
�11
1. DoesanIPaddresskeepscanningfornewopenports?
2. DidanIPaddresssuddenlygetalotbusierthanit’severbeeninthepast?
3. DidanIPaddresssuddenlygetalotbusierthananyotherIPaddress?
4. Shouldn’tthisIPaddresshavestoppeddoingnewthingsbynow?
Tools To Use: Sketching & Streaming
�12
• Lotsofdatatokeeptrackof
• Butwe'reonlyinterestedincertainaspectsofit- Setcardinality—HyperLogLog- Incrementalmeans&standarddeviations- Onlinelinearregression
• Makebigdataintosmalldata
Other Examples of Approximate Probabilistic Sketches
�13
• BloomFilter(setmembership)
• Count-MinSketch(countingitems)
• MinHash(setintersection)
• Locality-SensitiveHashing(LSH:nearestneighbors)
• Q-digest/T-digest(quantiledistribution—MOREABOUTTHISLATER)
IpPortScanDetector
�14
Q:DoesanIPaddresskeepscanningfornewopenports?
Contains:{IP_address:HyperLogLog}mapEachHLLcountsdistinctIP:portdestinations.
At each period:mean, sigma = Stdev(hll.cardinality() for every HLL)For each IP_address & HLL :if HLL.cardinality() > N sigmas above the mean:report it.
GrowthDetector
�15
Q:DidanIPaddresssuddenlygetalotbusierthanit’severbeeninthepast?
Contains:{IP_address:HyperLogLog}—periodCardinalityMapEachHLLcountsdistinctIP:portdestinationsoveralltime.
{IP_address:StdDev}—periodStatisticsMapEachStdDevincrementallycalculatesmeansandstdevs.
At each period:For each IP_address & HLL & StdDev:currCount = HLL.cardinality()mean, sigma = StdDev.getMeanAndStdev()if currCount > N sigmas above its mean:report it.
HLL.clear()StdDev.add(currCount, current_period)
ExplosionDetector
�16
Q:DidanIPaddresssuddenlygetalotbusierthananyotherIPaddress?
Contains:{IP_address:HyperLogLog}—periodCardinalityMapEachHLLcountsdistinctIP:portdestinationsincurrentperiod.
At each period:mean, sigma = Stdev(hll.cardinality() for every HLL)For each IP_address, hll:curr = hll.cardinality()if curr > N sigmas above the mean:report it.
hll.clear()
Host Stabilization
�17
• Assumean“exponentialdecay”ofnewIP:portcontactsovertime
• Weknowhowmanywe’veseen,butnothowmanyareleft.
• CanweestimateN_remgivenN_obs?Somecalculuslater…whyyes,wecan.
Nrem ≈ − slope(xi) × avg(xi)
y ∼ e−ax
HostStabilizationDetector
�18
Q:Shouldn’tthisIPaddresshavestoppeddoingnewthingsbynow?
Contains:{IP_address:HyperLogLog}—cardinalityMapEachHLLcountsdistinctIP:portdestinationsoveralltime.
{IP_address:StdDev}—periodAverageMapEachStdDevincrementallycalculatesmeansandstdevs.
{IP_address:IncrLinReg}—IncrementalLinearRegressionMapEachStdDevincrementallycalculatesmeansandstdevs.
At each period:For each IP_address, HLL, StdDev, IncrLinReg:N_obs = HLL.cardinality()slope, intercept = IncrLinReg.estimate()avg = StdDev.getMean()N_rem = -slope * avgreportIfDisagree(N_rem < tol, IP_address.frozen)IP_address.setFrozen(N_rem < tol){IncrLinReg, StdDev}.update(N_obs, current_period)
Demo Time!
�19
But is it Gaussian?
�20
• “Long-tail”or“fat-tail”distributions?
• Trypowerlaworlog-linearfitting• Andmanyothers?• Butthiscangetcomplicated….
• ReplaceStdDevwithtdigest.TDigest
Conclusions
�21
• Withouttheagonizingpain
• PythondatasciencetoolsFTW
• Coolsketching&streamingdatastructures
• “Alittlelearningisadangerousthing”…andalittlestatisticsisevenbetter!
• Onlythebeginning—lotsofroomforimprovement
TheEndThanksforattending!
Suggestedquestions
1. HowdoIinstallPython,again?2. WhatcanIdowithflow_countsinmynetflows?3. ShowmethecalculusforestimatingNrem!4. So,whatistherealstatisticaldistributionofthatdata?5. HowdoesHyperLogLogwork?
http://github.com/EdgewiseNetworks/five-sigma