1 Hailuoto Workshop A Statistician ’ s Adventures in Internetland J. S. Marron Department of Statistics and Operations Research University of North Carolina

Embed Size (px)

DESCRIPTION

3 Long Range Dependence Controversy Initial Models: Queuing Theory Short Range Dep’ce Aggregation of Mice and Elephant: Heavy tail Durations LRD Mandelbrot, Taqqu, Paxson, Willinger,… More recently: Aggregation of point Processes Poisson Cleveland, et al.

Citation preview

1 Hailuoto Workshop A Statistician s Adventures in Internetland J. S. Marron Department of Statistics and Operations Research University of North Carolina February 9, 2016 2 A Menu of Interesting Issues Bin Count Time Series Long Range dependence? Point Process of Flow Start Times Duration Distributions (heavy tails) Heavy tail Durations LRD Relationship between Size and Duration? Time series of packets within flows? 3 Long Range Dependence Controversy Initial Models: Queuing Theory Short Range Depce Aggregation of Mice and Elephant: Heavy tail Durations LRD Mandelbrot, Taqqu, Paxson, Willinger, More recently: Aggregation of point Processes Poisson Cleveland, et al. 4 Explanation of Controversy Zooming Autocorrelation Depends on Scale (i.e. binwidth, m) Fine Scales: (< 1 ms) ~ White Noise Poisson Medium Scales: (~ 10 ms) Dependence lifts up Coarse Scale: (> 1 sec) Consistent with L. R. D. 5 Long Range Dependence Theory Self-Similar: H: Hurst parameter Increments: LRD: If 6 Drawbacks of Conventional Time Series Methods Clumsy at Modeling L. R. D. E.g. ARMA etc. all S. R. D. Assumption of Stationarity Really need local stationarity Assumption of Linear Processes Doesnt make physical sense Instead have aggregation of flows Correlation for heavy tailed distns? 7 An H estimation approach: Wavelets : wavelet coefficients 8 Estimation of H, Based on Wavelet Spectrum Properties Weighted linear regression on Estimation of H: Abry & Veitch (1998) Robust to nonstationarities (linear trend) : uncorrelated 9 Example Wavelet Spectrum (FGN, H=0.9) 10 Experience with Hurst parameter estimation Toy Data: Excellent, (Poisson Data is flat, FGN linear) Real Data: More challenging Studied ~30 two hour time blocks, 2002 H Estimation makes sense (~ 0.8 0.9) for many cases i.e. FGN is a reasonable model But some there were very strange cases (H >> 1) 11 Real Data (nice): 2002 Apr 13 Sat 19:30 21:30 12 Real Data (ugly): 2002 Apr 13 Sat 1 pm 3 pm 13 Explanatory Tool: SiZer SIgnificance of ZERo crossings of the derivative of the smooths in scale space: Chaudhuri and Marron (1999) Exploratory smoothing method Are bumps really there? Consider all smoothing levels Study (simultaneous) C. I.s for slope (derivative) of smooth Combine with statistical inference and visualization Blue: slope significantly upwards Red: slope significantly downwards Purple: insignificant slope 14 SiZer Example British Incomes Data Kernel Density Estimation Two modes really there! Bralowers Fossil Data Local Linear Regression Smaller valley not there 15 Dependent SiZer Park, Marron, and Rondonotti (2004) SiZer compares data with white noise Inappropriate in time series Dependent SiZer compares data with an assumed model Goodness of fit test 16 Depent SiZer : 2002 Apr 13 Sat 1 pm 3 pm 17 Zoomed view (to red region, i.e. flat top) 18 Further Zoom: finds very periodic behavior! 19 Revisit: 2002 Apr 13 Sat 1 pm 3 pm 20 Quick Check: Delete periodic time block 21 Possible Physical Explanation IP Port Scan Common device of hackers Searching for break in points Send query to every possible (within UNC domain): IP address Port Number Replies can indicate system weaknesses Internet Traffic is hard to model 22 Experience with Hurst parameter estimation Studied ~30 two hour time blocks, 2002 H Estimation makes sense (~ 0.8 0.9) for many cases i.e. FGN is a reasonable model But some there were very strange cases (H >> 1) Studied ~30 two hour time blocks, 2003 Traffic appears similar, using e.g. Dependent SiZer But H estimates much smaller (~ 0.7), across all time blocks Why??? 23 Wavelet Spectrum: 2003 Sat 9:30 11:30 pm 24 Explanation of Shoulder: different protocols Major Components of Traffic: Transmission Control Protocol (TCP), often ~80% Acknowledges packets for sure transfer Web browsing (HTTP), FTP,, User Datagram Protocol (UDP), often ~15% Unacknowledged for data streaming Video, music, 25 Wavelet Spectra: all 2003 packet TCP vs. UDP Overlay all time blocks, and sub-spectra for TCP and UDP In 2002 TCP Dominated Now UDP creates major hump at medium scales Scale ~ 1 sec 26 Explanation of UDP Bump Blubster - File Sharing Application A replacement for Napster Transfers big files by TCP Does handshaking by UDP Work around for server (could be shut down) Huge fraction of traffic (just to stay in touch)?!? 27 Blubster sub-spectrum: 2003 Sat 9:30 28 Zoomed (convental) SiZer View of Blubster 29 Final Blubster Oddity Effect shows up for packet counts Not for byte counts Reason: Blubster handshake packets are small Thus not significant fraction of total bytes Violation of conventional wisdom Usually byte behavior ~ packet behavior 30 Wavelet Spectrum : packet vs. byte 31 A deeper look at sampling Revisit Mice-Elephant Sampling,Mice-Elephant Sampling Over wide range of scales: Random Sampling But not representative Artifact of: Huge Sample Size Very Heavy Tails