Study of Internet Traffic to Analyze and Predict Traffic

SITAPTSTUDY OF INTERNET TRAFFIC TO ANALYZE AND PREDICT TRAFFIC

Amit [email protected]://www.linkedin.com/in/amit-arora-539120a

mailto:[email protected]

WHAT DOES INTERNET TRAFFIC LOOK LIKE?Word cloud of mean percentage packets contributed by various applications from 2008 to 2015

BACKGROUND Pervasive growth of the

Internet. Internet access becomes faster

and applications move to the cloud the profile of Internet traffic continues to change.

Peer to Peer traffic, video sharing and OTT services coupled with almost ubiquitous access to high speed internet poses new challenges Service providers: how to

better utilize bandwidth ? OEMs: how to increase bits

per second and packets per second through the equipment ?

A key to understanding and solving these challenges is to

understand what constitutes Internet traffic

how the internet traffic will look like in the coming years

optimize networks and infrastructure to better utilize available resources.

This is what this project aims to address i.e. understanding internet traffic from various perspectives (application, protocol, packet size and others)

This understanding can then feed into network and infrastructure design.

A data product named SITAPT (Study of Internet Traffic to Analyze and Predict Traffic) is built which addresses the above requirements.

TARGET AUDIENCE

Network OEM Service Provider

SCOPE OF SITAPTVisualization of Traffic

Data

• Hundreds of applications (web, secure web, file transfer etc)

• Tens of protocols (TCP, UDP, ESP, GRE etc).

• Various packet sizes

Time Series Analysis for Traffic Prediction

• Multivariate timeseries analysis to predict traffic for various applications and protocols in the next 12 months.

• Identify trends in key and upcoming applications/protocols

Clustering to Explore Similarity

• Use machine learning algorithms to identify possible clusters of similarity between traffic patterns across multiple years.

Relationship betwen traffic types

• Identify and model relationship between key protcols

DATA SCIENCE PIPELINE

SITAPT itself is implemented completely in Python (version 2.7.11), although it relies heavily on other python packages such as numpy etc. that might be written in other programming languages for speed. The entire code for SITAPT is available on Github SITAPT repo.

Obtain anonymized internet traces from CAIDA. The traces are available in pcap format and contain IP and Transport layer headers only.

Read the packet trace and convert information from each packet into a JSON which is then stored in MongoDB. Remove all data that is not IP.

Find out the traffic mix by protocols, application, packet sizes and other criteria. Convert the results in a pandas dataframe which is again stored into Mongo.

· Model traffic as a multivariate time series. Use time series analysis techniques to forecast traffic for various applications and protocols.

· Use unsupervised machine learning (clustering) to identify similarity in traffic across the dataset.

· Identify and model relationship between key protocols

Create visualizations for understanding the data. Representing time series and describing trends.Explore clustering and regression

https://github.com/georgetown-analytics/sitapt

https://github.com/georgetown-analytics/sitapt

ARCHITECTURE

ApplicationLayer

ApplicationLayer

Anon

ymize

d In

tern

et tr

aces

Internet trace from CAIDA

Internet trace from a Service Provider

Ingestion module(BeautifulSoup4, wget,

gzip, pycapfile)Ingestion

WORM

ETL

Offline (non-realtime),Authenticated ingestion

NOSQL - MongoDB

Immutable Data store

ETL = Extract only IP packets, Transform to JSON, Load in data store

Computation & Modeling

Models - AR(I)MA- PCA, KMeans- Linear Regression.

Packages- Statsmodels - SciKit Learn

Train model on available data sets

Wal

l of i

nter

pret

ation

Data from models made available in various formats

A WORD ABOUT DATA USED IN SITAPTCAIDA (Center for Applied Internet Data Analysis, http://www.caida.org/) maintains a lot of data that can be used as analyzing and understanding Internet traffic. Anonymized internet traces from the year 2008 to 2015 are available upon request from CAIDA (see http://www.caida.org/data/passive/passive_2015_dataset.xml ), these traces form the dataset used by SITAPT.

All traces in this dataset are anonymized (by CAIDA itself) with the same key. In addition, the payload has been removed from all packets (again by CAIDA itself).

http://www.caida.org/

http://www.caida.org/data/passive/passive_2015_dataset.xml

A WORD ABOUT DATA USED IN SITAPT(CONTD..)How much data does SITAPT use?CAIDA provides around 13,000 files (in compressed format) for the period from 2008 to 2015. The combined size of the uncompressed version of the files stored into a database would run into several tens of terabytes. The current version of SITAPT is not a *big data* product.

Clearly, analyzing this amount of data requires horizontal scaling which is outside the scope of the current project. To reduce the problem to a more manageable level, SITAPT works with one file for every month of every year from 2008 to 2015 (for 2014 and 2015 CAIDA provides one file per quarter).

In total, SITAPT analyses 73 packet trace files from 2008 to 2015. Each trace contains millions of packets. The size of the database that stores the JSON representation of the trace files is more than 1TB.

DATA TRANSFORMATION FOR COMPUTATION Ingested data is stored in Mongo collections (one for each year). This data needs to be transformed into matrix form to make it

amenable for computation.

Once created, the three collections (applications, protocols and packet size distribution) are also stored as CSV files such that the modeling phase do not have to interact with the database at all and can read data on which they need to work on directly from the CSV file.

DATA TRANSFORMATION FOR COMPUTATION (CONTD..)These files represent time series data such that each row is a parameterized representation of traffic expressed either as combination of applications or protocols or packet size distribution.

Date sun-sr-iiop

transmit-port

ieee-mms-ssl

passgo

joaJewelSuite ovhpas sdo-tls interwise lm-instmgr

3/19/2008

4.36E-06 3.49E-05 0 8.72E-06

0 0.000798

0.000959

0.001051

0.000606

4/30/2008

9.00E-06 9.00E-06 9.00E-06

9.00E-06

9.00E-06 9.00E-06 9.00E-06 9.00E-06 9.00E-06

5/15/2008

0 0 0 0 0 3.96E-05 0.001187

0.004413

0

6/19/2008

0 0 0 0 0 0.000368

0.000403

0.000266

0

7/17/2008

0.000744

6.20E-05 0 0 0 0.003299

0.002828

1.24E-05 2.48E-05

Time

Applications

Read row wise: composition of Internet traffic by packets percentage contributed by each application in a trace captured at a particular day. Read column wise: packets percentage of each application as a time series.

VISUALIZATIONSThree different types of visualizations are explored Word cloud (for applications and protocols) Stacked chart (for applications, protocols and

packet size distribution) Heat map (for packet size distribution) Parallel Coordinates (protocols)

VISUALIZATIONS: STACKED CHARTS

Observations: Almost exponential increase in HTTPS traffic Decrease in unclassified (unknown) traffic Logarithmic decay in applications that contribute less than 0.5% of packets individually.

VISUALIZATIONS: HEAT MAP

Observations:Packet size distribution is almost entirely bimodal, with only the 1 to 100 bytes range and the 1400 to 1500 bytes range showing packet percentages of any significance. Only two rows show any dark colors (which represents a significant packet percentage) and these are the 1 to 100 packet size row and the 1400 to 1500 packet size row..

VISUALIZATIONS: PARALLEL COORDINATES

The protocols parallel coordinates for protocols contributing more than 0.01% to overall traffic. This chart clearly shows a negative correlation between TCP and UDP protocol traffic.

http://localhost/pc/examples/slickgrid_protocols.html

Each time series is studied and analyzed individually. The following operations are done on each time series.

Plot the first difference series to identify trends. Evaluate ACF and PCF to identify dependencies of the

series upon previous time samples. Seasonal decomposition to identify trend and seasonality

and residuals. ARMA and ARIMA modeling of the time series Modeling is done using “statsmodels” package The output of each of the above steps is available as part

of SITAPT analysis.

TIME SERIES ANALYSIS

TIME SERIES MODELING FOR TCP PROTOCOL

TIME SERIES MODELING FOR TCP PROTOCOL (CONTD..)

TIME SERIES MODELING FOR TCP PROTOCOL (CONTD..)

TIME SERIES MODELING FOR MULTIPLE PROTOCOLS AND APPLICATIONS

Forecasted traffic mix

TRENDS IN SOME IMPORTANT APPLICATIONS

Almost exponential increase in HTTPS trafficHTTP traffic is decreasing but still contributes a significant percentage

CLUSTERING To explore if there are any patterns hidden in the internet

traffic data a clustering technique is employed. Each protocol or application or packet size interval is treated

as a feature and each trace is treated as an instance. Clustering is done in two steps: Dimensionality reduction via PCA (Principal Component

Analysis) For applications, PCA reduces 5000+ dimensions to 10.

Clustering via KMeans K = 4

PCA and KMeans are both done using the scikit-learn API.

CLUSTERING (CONTD..)

Some clustering present in applications and protocols data, not so much in packet size distribution data (needs higher K maybe).

CLUSTERING

Date Year Half Quarter Fortnight

DayOfTheWeek

cluster TCP ESP UDP

5/17/2012 2012 1 2 2 Thursday 2 91.01197292 0.190882127 8.2196799517/19/2012 2012 2 3 2 Thursday 2 91.75760765 0.084797929 7.5882906579/20/2012 2012 2 3 2 Thursday 2 93.10399212 0.024899575 6.12521031210/18/2012 2012 2 4 2 Thursday 2 90.43387514 0.492682745 8.56456854211/15/2012 2012 2 4 1 Thursday 2 95.2341703 0.262422627 3.84373103112/20/2012 2012 2 4 2 Thursday 2 91.12739062 0.035976644 8.604101833/21/2013 2013 1 1 2 Thursday 2 94.65616355 0.019147528 5.2066727696/20/2013 2013 1 2 2 Thursday 2 90.5349465 0.061648601 9.1950060539/18/2014 2014 2 3 2 Thursday 2 90.60646827 0.020378309 8.885648934

It is a matter of further analysis to figure out what event or phenomenon was happening which caused the Internet traffic at during different times between 2008 to 2015 to be similar.

If this study was being done on traffic from a closed network (such as from a single ISP) then it would be much easier to attribute this clustering to real world events (such as the OS update for mobile phones for example).

The following is an excerpt from the generated CSV file for protocols showing the additional fields added, including the label field provided by the clustering algorithm.

The table is filtered on cluster (label) type 2 and it is seen that traces which has higher than usual TCP traffic % (90 to 95%) are clustered together.

LINEAR REGRESSION

Parallel coordinates showed negative correlation between the percentage of TCP traffic and the percentage of UDP traffic. Creating a scatter plot of TCP Vs UDP and then creating a linear regression model to fit a straight line through it.The coefficients vector is [ -1.00805723] and the variance score is 0.96.

WHAT WORKED? The fact that all the packet traces are now available in

a document database means that the data is now available in a consumable format and this really opens up avenues for further analysis, asking different types of questions off the data.

The time series analysis revealed interesting trends about the data, such as an almost exponential increase in secure HTTP traffic which was expected but at the same time there is not a huge decrease in non-secure HTTP traffic which was somewhat unexpected.

Various types of visualization techniques (like parallel coordinates) and tools like Bokeh provide a really good insight into the data.

WHAT DID NOT WORK? With the amount of data involved, this is clearly a Big Data project,

since that was not something that could be done completed in a short time so the alternative was use to trace file for each month and that reduced the number of data points available for analysis (only 73 data points). This limited the prediction ability of the time series models, not all applications and protocols could be modeled within the 95% confidence interval and a MAPE of < 5%.

This data would provide much more insights if it corresponded to traffic from a closed network rather than the Internet. For example, such as an ISP’s network limited to certain geographical areas because then the data would have less variability and would be easier to explain the clustering.

For the time series model, only the MAPE was considered while choosing between the AR(I)MA models. There are other criteria as well such as Durbin-Watson statistic, the BIC and HQIC etc. which should have been explored but were not.

CONCLUSIONSITAPT provides valuable insights into network traffic composition and trends. In terms of applications there is an exponential

growth trend in HTTPS traffic, a trend that is visible even at a macro level (generic internet packet trace).

The time series analysis is able to provide predictions for applications and protocols.

In terms of packet sizes there is a bi-modal distribution.

Clustering reveals patterns in terms of both application and protocols

Technology

Study of Internet Traffic to Analyze and Predict Traffic