114
AI-BASED TCP PERFORMANCE MODELLING K. Mahmoud M.Sc. September 2012

AI-Based TCP Performance Modelling

Embed Size (px)

Citation preview

  • AI-BASED TCP PERFORMANCE MODELLING

    K. Mahmoud

    M.Sc.September 2012

  • AI-BASED TCP PERFORMANCE MODELLING

    A thesis submitted to the University of Plymouthin partial fulfillment of the requirements for the degree of

    Master of Science

    Project Supervisor: Dr Bogdan Ghita

    Karim MahmoudSeptember 2012

    School of Computing and MathematicsFaculty of Science and Technology

    University of Plymouth, UK

  • AI-Based TCP Performance ModellingKarim Mahmoud

    Abstract

    Different mathematical models exist for modelling TCP algorithms and the in-terrelations between TCP and network parameters. In this research, an artificialneural network modelling approach was considered in order to model TCP per-formance, represented in the transmission time needed to transfer data payloadwithin TCP flows. Two models were developed, for each lossless and lossy TCPconnections. A base line was defined by a mathematical model in order to com-pare the accuracy obtained in estimating the transmission time needed in termsof regression between actual and estimated values, and in terms of cumulativedistribution of relative error.

    The neural models had initially given better results over the mathematicalmodel for the same conditions and datasets used. Manual analysis was performedon poorly estimated samples, and this revealed the presence of additional pro-longed idle periods within flows, which was not accounted for in the mathematicalmodel, and was not sufficiently estimated in neural models.

    The effect of idle time on modelling accuracy has been thoroughly investigatedto study the effect it had on reinitialising the congestion window and how differentTCP implementations dealt with idle time occurrences when resuming transmis-sion. Other filtering criteria were applied on traffic to exclude statistical outliersand non-standard TCP connections. This has provided improved results for bothmodels used. Nevertheless, the neural network modelling approach had outper-formed the mathematical modelling of TCP throughput along all stages of thisresearch. Finally, it was suggested to revise the available mathematical model totake idle time into consideration.

  • Declaration

    This is to certify that the candidate, Karim Mahmoud carried out the work submit-ted herewith.

    Candidates Signature:

    Karim Mahmoud . . . . . . . . . . . . . . . . . . . . . . . . . . . Date: 30/09/2012

    Supervisors Signature:

    Dr Bogdan Ghita . . . . . . . . . . . . . . . . . . . . . . . . . . . Date: 30/09/2012

    Second Supervisors Signature:

    Dr David Lancaster . . . . . . . . . . . . . . . . . . . . . . . . . . . Date: 30/09/2012

    Copyright & Legal Notice

    This copy of the dissertation has been supplied on the condition that anyone whoconsults it is understood to recognize that its copyright rests with its author andthat no part of this dissertation and information derived from it may be publishedwithout the authors prior written consent.The names of actual companies and products mentioned throughout this disserta-tion are trademarks or registered trademarks of their respective owners.

    iii

  • Acknowledgements

    I wish to express my deep and sincere appreciation to Dr Bogdan Ghita for hisguidance, assistance, patience, and usual constructive feedback. Working underhis supervision has been inspiring and has developed a deeper confidence in myresearch and intellectual abilities. I feel privileged to have been one of Dr Ghitasstudents.

    I also wish to express my gratitude to all the teaching staff at the School of Comput-ing and Mathematics at Plymouth University for a wonderful learning experience.

    To all those who supported my decision to pursue a masters degree: Dr MahmoudKhalil at Ain Shams University, Rami Mohamed and Walid Refaat at Orange Busi-ness Services.

    To my loving parents and sister for their persistent encouragement and support.

    v

  • Table of Contents

    Page

    1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Project Aim and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 TCP Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2.2.1 TCP Transition States . . . . . . . . . . . . . . . . . . . . . . . 62.2.1.1 Connection Establishment . . . . . . . . . . . . . . . 72.2.1.2 Data Transfer . . . . . . . . . . . . . . . . . . . . . . . 72.2.1.3 Connection Termination . . . . . . . . . . . . . . . . . 7

    2.2.2 TCP Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.2.1 Sliding Window Protocol . . . . . . . . . . . . . . . . . 8

    2.2.3 TCP Congestion Control . . . . . . . . . . . . . . . . . . . . . . 92.2.3.1 Slow Start . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.3.2 Congestion Avoidance . . . . . . . . . . . . . . . . . . 102.2.3.3 Retransmission Timeout, Fast Retransmit, and Fast

    Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.3.4 Fast Retransmit . . . . . . . . . . . . . . . . . . . . . . 112.2.3.5 Fast Recovery . . . . . . . . . . . . . . . . . . . . . . . 12

    2.2.4 Idle Time Considerations . . . . . . . . . . . . . . . . . . . . . 122.2.5 TCP Timers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    2.3 Formula-Based Modelling . . . . . . . . . . . . . . . . . . . . . . . . . 152.3.1 Cardwell Mathematical Model . . . . . . . . . . . . . . . . . . 16

    2.4 Previous Research and Machine Learning Approaches . . . . . . . . 192.4.1 Performance Estimation . . . . . . . . . . . . . . . . . . . . . . 192.4.2 Performance Prediction . . . . . . . . . . . . . . . . . . . . . . . 192.4.3 History-Based Models . . . . . . . . . . . . . . . . . . . . . . . 202.4.4 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . 21

    2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    3 Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.1 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    3.2.1 TCPTRACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2.2 Data Processing in MATLAB . . . . . . . . . . . . . . . . . . . 25

    vii

  • TABLE OF CONTENTS

    3.3 Neural Network Modelling in MATLAB . . . . . . . . . . . . . . . . . 253.4 Statistical Analysis in MATLAB . . . . . . . . . . . . . . . . . . . . . 26

    3.4.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.4.2 MSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.4.3 Absolute Relative Error . . . . . . . . . . . . . . . . . . . . . . 27

    3.5 Base Line for Analysing Model Accuracy . . . . . . . . . . . . . . . . . 273.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    4 Data Pre-processing and Traffic Analysis . . . . . . . . . . . . . . . . 294.1 Types of Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.2 Extracting TCP Parameters . . . . . . . . . . . . . . . . . . . . . . . . 304.3 TCP Parameters Pre-processing . . . . . . . . . . . . . . . . . . . . . . 30

    4.3.1 Identifying Valid TCP Flows . . . . . . . . . . . . . . . . . . . . 314.3.2 Selection of Forward Direction . . . . . . . . . . . . . . . . . . 314.3.3 Classification of Lossless and Lossy Flows . . . . . . . . . . . . 324.3.4 Computing the Mathematical Throughput Estimate . . . . . . 324.3.5 Normalisation of TCP Parameters . . . . . . . . . . . . . . . . 33

    4.4 Statistical Distribution of TCP Parameters . . . . . . . . . . . . . . . 334.4.1 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.4.2 Data Transmitted . . . . . . . . . . . . . . . . . . . . . . . . . . 344.4.3 Initial Window Size . . . . . . . . . . . . . . . . . . . . . . . . . 344.4.4 Maximum Segment Size . . . . . . . . . . . . . . . . . . . . . . 354.4.5 Data Transmission Time . . . . . . . . . . . . . . . . . . . . . . 364.4.6 Average RTT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.4.7 Maximum Idle Time . . . . . . . . . . . . . . . . . . . . . . . . 38

    4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    5 Neural Network Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . 415.1 Backpropagation Feed Forward Neural Networks . . . . . . . . . . . 415.2 Backpropagation Neural Network Parameters . . . . . . . . . . . . . 44

    5.2.1 Initialization of Weights . . . . . . . . . . . . . . . . . . . . . . 445.2.2 Initialization of Bias . . . . . . . . . . . . . . . . . . . . . . . . 445.2.3 Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.2.4 Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.2.5 Hidden Layers and Nodes . . . . . . . . . . . . . . . . . . . . . 455.2.6 Number of Samples . . . . . . . . . . . . . . . . . . . . . . . . . 455.2.7 Stopping Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    5.3 Neural Network Model Structure . . . . . . . . . . . . . . . . . . . . . 475.3.1 Lossless Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.3.2 Lossy Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    viii

  • REFERENCES

    6 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516.1 Results from the Combined Dataset (UNIBS-2009 and MAWI) . . . . 51

    6.1.1 Considering All Valid TCP Connections . . . . . . . . . . . . . 516.1.1.1 Results for the Lossless Dataset . . . . . . . . . . . . 526.1.1.2 Results for the Lossy Dataset . . . . . . . . . . . . . . 53

    6.1.2 Idle Time Investigation . . . . . . . . . . . . . . . . . . . . . . . 556.1.3 Filtering TCP Connections with High Relative Idle Time . . . 57

    6.1.3.1 Results for the Lossless Dataset . . . . . . . . . . . . 576.1.3.2 Results for the Lossy Dataset . . . . . . . . . . . . . . 59

    6.1.4 Investigation of Non-Standard Flows . . . . . . . . . . . . . . . 616.1.5 Filtering Non-Standard Flows . . . . . . . . . . . . . . . . . . . 62

    6.1.5.1 Results for the Lossless Dataset . . . . . . . . . . . . 626.1.5.2 Results for the Lossy Dataset . . . . . . . . . . . . . . 65

    6.1.6 Throughput and Estimation Error . . . . . . . . . . . . . . . . 676.2 Manual Analysis of Connection with Poorly Estimated Throughput . 696.3 Results from the Plymouth University Campus Dataset . . . . . . . . 706.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    7 Conclusions and Future Research Directions . . . . . . . . . . . . . . 737.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737.2 Research Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747.3 Direction of Future Research . . . . . . . . . . . . . . . . . . . . . . . 75

    8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

    A Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81A.1 UNIBS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81A.2 MAWI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

    B Results Using the Dataset from Plymouth University Campus . . . 83B.1 Considering All Valid TCP Connections . . . . . . . . . . . . . . . . . 83

    B.1.1 Results for the Lossless Dataset . . . . . . . . . . . . . . . . . 83B.1.2 Results for the Lossy Dataset . . . . . . . . . . . . . . . . . . . 84

    B.2 Filtering TCP Connections with High Relative Idle Time and Non-Standards TCP Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86B.2.1 Results for the Lossless Dataset . . . . . . . . . . . . . . . . . 86B.2.2 Results for the Lossy Dataset . . . . . . . . . . . . . . . . . . . 87

    C MATLAB Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89C.1 Cardwell Mathematical Model Implementation . . . . . . . . . . . . . 89C.2 Neural Network Modelling . . . . . . . . . . . . . . . . . . . . . . . . . 91

    ix

  • List of Tables

    Table Page

    4.1 TCP parameters of interest as collected by tcptrace. . . . . . . . . . . 304.2 Number of valid TCP flow for both lossless and lossy subsets. . . . . 314.3 Mean values of TCP parameters evaluated for the three datasets used. 36

    5.1 Stopping criteria used for the neural network during learning process. 475.2 Neural network structure and input parameters for the lossless model. 485.3 Neural network structure and input parameters for the lossy model. 49

    6.1 MSE and regression results post filtering samples with high maxi-mum idle time to average RTT (lossless combined dataset). . . . . . . 57

    6.2 MSE and regression results post filtering samples with high maxi-mum idle time to average RTT (lossy combined dataset). . . . . . . . 60

    6.3 Neural network results obtained post filtering non-standard flowsfrom the lossless subset. . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    6.4 Mathematical results obtained post filtering different non-standardflows from the lossless subset. . . . . . . . . . . . . . . . . . . . . . . . 63

    6.5 Neural network results obtained post filtering different non-standardflows from the lossy subset. . . . . . . . . . . . . . . . . . . . . . . . . 65

    6.6 Mathematical results obtained post filtering different non-standardflows from the lossy subset. . . . . . . . . . . . . . . . . . . . . . . . . 65

    6.7 Accuracy results for the lossless dataset of Plymouth University. . . 716.8 Accuracy results for the lossy dataset of Plymouth University. . . . . 71

    A.1 Composition of the UNIBS 2009 trace (UNIBS: Data sharing, 2011). 81A.2 Composition of the MAWI traces(UNIBS: Data sharing, 2011). . . . . 82

    xi

  • List of Figures

    Figure Page

    2.1 TCP state transition diagram for both client and server. . . . . . . . . 62.2 Timeline of TCP connection establishment and termination. . . . . . 72.3 Slow start and congestion avoidance sending patterns. . . . . . . . . 102.4 Slow start and congestion avoidance, as implemented for TCP Tahoe

    and TCP Reno. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    3.1 Process diagram of research stages. . . . . . . . . . . . . . . . . . . . . 233.2 Regression analysis showing regression fitting line and residual values. 26

    4.1 Percentages of both lossless and lossy TCP connections within thenetwork traffic captured. . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    4.2 Cumulative distribution of throughput. . . . . . . . . . . . . . . . . . 334.3 Cumulative distribution of data transmitted. . . . . . . . . . . . . . . 344.4 Cumulative distribution of initial window bytes. . . . . . . . . . . . . 354.5 Cumulative distribution of initial window packets. . . . . . . . . . . . 354.6 Cumulative distribution of MSS. . . . . . . . . . . . . . . . . . . . . . 364.7 Cumulative distribution of data transmission time. . . . . . . . . . . 374.8 Cumulative distribution of RTT. . . . . . . . . . . . . . . . . . . . . . . 374.9 Cumulative distribution of maximum idle time. . . . . . . . . . . . . 384.10 Box-and-whisker diagrams of TCP time parameters (UNIBS Traffic) 394.11 Box-and-whisker diagrams of TCP time parameters (MAWI Traffic) . 394.12 Box-and-whisker diagrams of TCP time parameters (Plymouth Uni-

    versity Traffic) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    5.1 Simplified neural network structure . . . . . . . . . . . . . . . . . . . 425.2 Computations at a single neural perceptron. . . . . . . . . . . . . . . 435.3 Backpropagation of error signal to update neural network weights. . 435.4 MSE performance measures for learning, validating, and testing sub-

    sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.5 Neural network model developed for lossless TCP traffic. . . . . . . . 485.6 Neural network model developed for lossy TCP traffic. . . . . . . . . 48

    6.1 Regression obtained for lossless connections for the combined datasetusing both mathematical and neural network model. . . . . . . . . . 52

    6.2 CDF of absolute relative error for lossless connections (combined dataset). 536.3 Regression obtained for lossy connections for the combined dataset

    using both mathematical and neural network model. . . . . . . . . . 54

    xiii

  • LIST OF FIGURES

    6.4 CDF of absolute relative error for lossless connections for the com-bined dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    6.5 Time-sequence graph for a TCP connection with relatively high idletime. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

    6.6 Regression obtained for lossless connections for the combined datasetusing both mathematical and neural network model, after filteringconnections with maximum idle time larger than twice the averageRTT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

    6.7 CDF of absolute relative error for lossless connections for the com-bined dataset, after filtering connections with maximum idle timelarger than twice the average RTT. . . . . . . . . . . . . . . . . . . . . 59

    6.8 Regression obtained for lossy connections for the combined datasetusing both mathematical and neural network model, after filteringconnections with maximum idle time larger than twice the averageRTT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

    6.9 CDF of absolute relative error for lossy connections for the combineddataset, after filtering connections with maximum idle time largerthan twice the average RTT. . . . . . . . . . . . . . . . . . . . . . . . . 61

    6.10 Regression obtained for lossless connections for the combined datasetusing both mathematical and neural network model, post filteringvarious non-standard flows. . . . . . . . . . . . . . . . . . . . . . . . . 64

    6.11 CDF of absolute relative error for lossless connections for the com-bined dataset, prior and post filtering various non-standard flows. . . 64

    6.12 Regression obtained for lossy connections for the combined datasetusing both mathematical and neural network model, post filteringvarious non-standards flows. . . . . . . . . . . . . . . . . . . . . . . . . 66

    6.13 CDF of absolute relative error for lossy connections for the combineddataset, prior and post filtering various non-standards flows. . . . . . 67

    6.14 Scatter plot of actual throughput and corresponding relative errorof estimated throughput, for lossless connections of the combineddataset, prior any filtering. . . . . . . . . . . . . . . . . . . . . . . . . . 68

    6.15 Scatter plot of actual throughput and corresponding relative errorof estimated throughput, for lossless connections of the combineddataset, after filtering connections with high idle time to RTT ratio. . 69

    6.16 Trace 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706.17 Trace 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

    B.1 Regression obtained for lossless connections for the Plymouth datasetusing both mathematical and neural network model, prior any filtering. 83

    B.2 CDF of absolute relative error for lossless connections for the Ply-mouth dataset, prior any filtering. . . . . . . . . . . . . . . . . . . . . 84

    B.3 Regression obtained for lossy connections for the Plymouth datasetusing both mathematical and neural network model, prior any filtering. 84

    xiv

  • B.4 CDF of absolute relative error for lossless connections for the Ply-mouth dataset, prior any filtering. . . . . . . . . . . . . . . . . . . . . 85

    B.5 Regression obtained for lossless connections for the Plymouth datasetusing both mathematical and neural network model, after filteringall non-standards TCP flows and connections with high relative idletime. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

    B.6 CDF of absolute relative error for lossless connections for the Ply-mouth dataset, after filtering all non-standards TCP flows and con-nections with high relative idle time. . . . . . . . . . . . . . . . . . . . 87

    B.7 Regression obtained for lossy connections for the Plymouth datasetusing both mathematical and neural network model, after filteringall non-standards TCP flows and connections with high relative idletime. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

    B.8 CDF of absolute relative error for lossless connections for the Ply-mouth dataset, after filtering all non-standards TCP flows and con-nections with high relative idle time. . . . . . . . . . . . . . . . . . . . 88

    xv

  • Acronyms and Abbreviations

    ACK Acknowledgement Segment

    AI Artificial Intelligence

    CDF Cumulative Distribution Function

    CWND Congestion Window

    FIN Finish Segment

    IP Internet Protocol

    IW Initial Congestion Window

    MSE Mean Squared Error

    MSS Maximum Segment Size

    RFC Request for Comments

    RST Reset Segment

    RTO Retransmission Timeout

    RTT Round Trip Time

    RTTVAR Round Trip Time Variation

    RWND Receiving Window

    SMSS Sender Maximum Segment Size

    SVR Support Vector Regression

    SYN Synchronisation Segment

    TCP Transport Control Protocol

    xvii

  • 1 IntroductionThe majority of the Internet traffic is dominated by the Transmission Control Pro-tocol (TCP), as it carries about 90 percent of exchanged traffic (Shah et al., 2007).Due to this very important role for TCP, the performance of this protocol in partic-ular reflects directly on the general performance of IP networks and the Internet.From here comes the need to provide realistic and efficient performance modellingof the TCP transport protocol in particular, and to find relationships between theprotocols performance with regards to TCP parameters, and network conditionswhen transferring traffic. Many traditional mathematical models have been devel-oped to model the behaviour of TCP, and despite the complexity of such models,the accuracy obtained when modelling short-lived TCP connections is not alwaysvalid (Ghita and Furnell, 2008). The use of Artificial Intelligence models such asneural networks has been approached in several previous researches in order toobtain more accurate performance models of TCP connections not only for steady-state period of a TCP connections, but for short-lived connections as well, whereslow start has a primary effect the connections performance. The motivation forchoosing this project was to explore how efficient artificial neural networks can beused for TCP throughput estimation, what is the level of accuracy such model canreach, and explore if further improvements can be made with respect to previousapproaches.

    1.1 Project Aim and Objectives

    The goal of the project is to successfully develop a robust model using artificialneural network in MATLAB that can accurately estimate the TCP throughput fora wide range of TCP transfers with variant and different network path character-istics. The research will attempt to extend and contribute to the previous researchapproaches, and in such a case where no further improvement could be reach, thenit would be essential to comprehend the challenges and obstacles that prevent usfrom achieving that ideal estimation model. The objectives of this project are listedas follow:

    1

  • Chapter 1. Introduction

    1. Obtain a full understanding of the TCP transport protocol, operation and dif-ferent network algorithms.

    2. Have a thorough overview of the different available mathematical modelsused for TCP performance (i.e. throughput) evaluation, and analyse theirefficiency, considering the actual observed TCP performance.

    3. Analyse traffic traces from various live networks, and initially perform sta-tistical analysis on these traces in order to obtain an understanding of thenature of the traffic and the distribution of TCP parameters collected, andtheir significance to TCP throughput.

    4. Based on traffic analysis, select suitable TCP parameters that should be con-sidered as input for the neural network model.

    5. Develop an artificial neural network in MATLAB that models TCP perfor-mance with regards to the selected TCP parameters.

    6. Evaluate the efficiency of the developed neural network when modelling TCPconnections with various characteristics, such as short-lived connections, whereslow start congestion control strategy consumes a large part of the connectiontime, and consider filtering of statistical and behavioural outliers, and observethe effect these exclusion on the acquired estimation accuracy.

    7. Study the effect of packet loss on TCP performance, and how the developedneural network model reacts to such network impairment.

    8. Identify any other relationship or patterns between network parameters andTCP throughput, and how this relationship can be represented in a model.

    9. Provide an evaluation on how neural network are suitable for modelling TCPperformance, and under which conditions it performs better or worse.

    1.2 Thesis Structure

    The thesis is structured into five major parts as follow:

    Chapter 2 includes a complete overview on all the theoretical fields of studythat were addressed in this research as well as a background on related research

    2

  • 1.2. Thesis Structure

    and findings that were previously published. It starts with a review on the TCPstandards and an explanation of the different stages within the lifespan of a typicalTCP connection, taking into consideration different scenarios and network condi-tions. Algorithms associated with each stage of TCP connections are also brieflydemonstrated as documented in different RFCs (Request for Comments). It thenincludes a review on different types for modelling TCP such as mathematical mod-els and history-based models. A short introduction to artificial neural networks hasbeen covered as being the main approach taken to develop a history-based modelin this research.

    Chapter 3 gives a breakdown of the methodology and approaches consideredfor carrying out this research, as well as the data acquisition approach and a de-scription of the traffic traces collected.

    Chapter 4 demonstrates the pre-processing stages that have been applied oncollected datasets, and the present fundamental statistical and trend analysis per-formed on these datasets into order to have an understanding of the nature of TCPconnections being used for modelling, and recognise the characteristics of the dif-ferent TCP parameters being considered for modelling TCP performance.

    Chapter 5 provides a description of the artificial neural networks developed,their structure, and the selected training and validating criteria as developed inMATLAB, as well as an explanation of the backpropagation neural network as thetechnique chosen for AI-based modelling.

    Chapter 6 demonstrates all the results obtained while developing the neuralnetwork models in MATLAB, and explains the results and findings of each steptaken while developing the models, and shows how different TCP parameters hadvariant effects on the estimated results obtained. It also demonstrates the variousfiltering conditions that have been used in order to reach satisfactory results, whileanalysing and explaining the significance behind these improved results. Chapter7 draws conclusions behind the analysis performed using neural network mod-elling, and how these analysis suggests the consideration of new TCP parameterswhen modelling TCP performance. The chapter then shed light on some limitationsthat have been encountered in this research, and provides some future research di-rections.

    3

  • 2 Literature ReviewThis chapter starts by providing a brief review on the TCP protocol and the variousTCP algorithms associated with the different stages and conditions encountered byTCP connections. The chapter then gives an overview on the different techniquesused for modelling TCP throughput, such as formula-based and history-based mod-els. Finally, previous research that has been done using AI-based methods - partic-ularly artificial neural networks - are analysed by comparing their methodologicalapproaches and presenting the results obtained by each approach.

    2.1 Background

    When evaluating the performance of a network path, we are mostly concernedwith the performance of TCP connections, since it represents the majority of theoverall traffic in a network. Among different type of traffic, bulk TCP transferswhich last for more than few seconds can be considered of greater importance,and are more suitable for TCP throughput prediction as opposed to short-livedTCP connections which are highly affected by the slow-start congestion controlmechanism (He et al., 2007). However it remains essential to have robust modelsthat can estimate the performance of short-lived transfers efficiently as well.

    The importance of quality and performance provisioning has been rising, hencethe need to develop models that replicate network protocols such as TCP in orderto estimate or predict the performance of data transfers (Ghita et al., 2005). Suchmodels can improve our understanding to the behaviour of Internet traffic and theinterrelations between various TCP and network parameters. Additionally theseperformance predictions can have several applications such as the dynamic selec-tion of the best path for a particular data transfer between end-hosts where mul-tiple paths are available such as distributed contents and multi-homing networksor when mirrored resources and server selection is an option in a grid networkarchitecture (Mirza et al., 2010).

    5

  • Chapter 2. Literature Review

    2.2 TCP Protocol

    The following sections provide an overview on the well known TCP algorithms usedalong TCP connections, and are of particular interest in the context of the thisresearch.

    2.2.1 TCP Transition States

    The states in which a TCP connection goes through can be summarised in threemain stages as described in RFC 793 (Postel, 1981); connection establishment, datatransfer, and connection termination. The transition from one state to anotheris accomplished by the exchange a specific sequence of segments. These states,transitions and segments can be represented in Figure 2.1, and are elaborated inthe following sections.

    Active ClosePassive Close

    Listen

    Closed

    SYN Received SYN SentConnection

    Established (Data Transfer State)

    FIN_WAIT_1

    FIN_WAIT_2

    TIME_WAIT

    CLOSE_WAIT

    LAST_ACK

    Rece

    ive: S

    YN

    Send

    : SYN

    , ACK

    Receive:

    Send: SYN

    Rece

    ive:

    Send

    :

    Receive: ACKSend:

    Receive: SYN, ACKSend: ACK

    Receive

    : FIN

    Send: A

    ck

    Receive: Send: FIN

    Receive: Send: FIN

    Receive: Ack

    Send:

    Receive: FINSend: Ack

    Receive: (timeout)Send: RST

    Receive: AckSend:

    Figure 2.1: TCP state transition diagram for both client and server.

    6

  • 2.2. TCP Protocol

    2.2.1.1 Connection Establishment

    As demonstrated in Figure 2.2, the TCP protocol uses a three-way handshakemechanism in order to establish a connection. In this mechanism, the client ini-tiates the handshake by sending a SYN segment to the server specifying the portnumber on which a connection is needed and its starting sequence number. Theserver responds to this request by sending a similar SYN segment with its ownstarting sequence number, and acknowledging the sequence number sent from theclient. Finally, the client responds to the server acknowledging its sequence num-ber. At this point, a TCP connection is established, and both client and server aretransitioned to the data transfer state (Stevens, 1993).

    |Time | 146.15.88.126 |

    | | | 164.133.140.237 |

    |0.000 | SYN | |Seq = 1337909654

    | |(1513) ------------------> (80) |

    |0.010 | SYN, ACK | |Seq = 289962769 Ack = 1337909655

    | |(1513) (80) |

    | |

    | |

    |6.973 | FIN, ACK | |Seq = 290028713 Ack = 1337910152

    | |(1513) (80) |

    |7.532 | FIN, ACK | |Seq = 1337910152 Ack = 290028714

    | |(1513) ------------------> (80) |

    |7.541 | ACK | |Seq = 290028714 Ack = 1337910153

    | |(1513)

  • Chapter 2. Literature Review

    four segments to be fully terminated. Some applications may require to only keepthe TCP in a half-close state, and hence justify the need for two segments in eachdirection in order to select which direction is to be closed and which to be kept open,or to simply fully terminate the connection. The terminal side - usually the client- that initiates the termination of a TCP connection is said to enter an active closetermination. The client sends a FIN segment to the server, entering a FIN WAIT 1state waiting for an ACK and a FIN from the server side, either to be receivedwithin individual segments or within a single segment. Once it receives an ACKfrom the server, the client enters a FIN WAIT 2 state, and once a FIN is received,it then enters a TIME WAIT state and sends an ACK back to the server.

    On the other hand, the side responding to a termination request by receivinga FIN segment is said to be entering a passive close termination. It responds bysending an ACK and entering in a CLOSE WAIT state, and then sending a FINsegment entering in LAST ACK, and waiting for the last ACK to be received fromthe client. The TCP connection is considered closed once the last ACK has beenreceived. In brief, a complete TCP connection is bounded by a SYN segment and aFIN segment in each direction. This is important to notice in order to identify anyincomplete or interrupted TCP connection.

    The timeline of a simple TCP connection establishment and termination isshown in Figure 2.2 excluding any data transfer segments.

    2.2.2 TCP Flow Control

    This section briefly describe the TCP algorithms used to handle both types of traffic(i.e. bulk transfer flows and short-lived flows) within the lifespan of a TCP connec-tion.

    2.2.2.1 Sliding Window Protocol

    The TCP flow control is based on the sliding window mechanism. During datatransfer, the order of segments sent and received is controlled using sequence num-bers. Both sender and receiver keep track of these numbers. Each side of the con-nection also maintains and advertises about its window size, which determines themaximum number of segments it can receive and buffer successfully before pro-cessing them. This in turn defines the number of segment a sender would transmitbefore receiving acknowledgements. This mechanism is maintained using a slid-

    8

  • 2.2. TCP Protocol

    ing window at each side, and the window is moved forward whenever a segment isreceived in the correct sequence.

    2.2.3 TCP Congestion Control

    TCP has no prior knowledge of the limitations and conditions of the network path.Accordingly, TCP algorithms must anticipate and adjust its behaviour continuouslywith respect to the status of the network. This is basically achieved using twoassociated mechanism: slow start and congestion avoidance. Both mechanismsaim to limit the number of unacknowledged packets from sender to receiver toavoid swamping the receiver or the network with a number of packets it can notprocess or buffer. Slow start and congestion avoidance are implemented at the atthe sender side. Fast retransmission and fast recovery are two algorithms that aremeant to deal with segment losses within a TCP connections. According to RFC5681 (Allman et al., 2009), these four algorithms are the principles for congestioncontrol, and are described in details in the following sections.

    2.2.3.1 Slow Start

    In order to avoid congestion along a TCP connection, two windows are used. Oneat the sending side, and is referred to as the congestion window (cwnd) to limitthe number of unacknowledged segments the sender can transmit. The cwnd isevaluated and maintained by the sender and never advertised. A similar window isused by the receiving side, and is referred to as the receiving window (rwnd) whichis constantly advertised to the sender to update it about the maximum numberoutstanding segments it can support. When transmitting, TCP on the sender sideis always bounded by the minimum value of both cwnd and rwnd (Allman et al.,2009).

    The slow start algorithm is used to gradually increase the cwnd. Slow start isengaged in two phases of the TCP connection: initially once a TCP connection is es-tablished, and subsequently whenever a retransmission timeout usually resultingfrom a loss segment occurs.

    As described in RFC 5681 (Allman et al., 2009), the initial value of the of cwndreferred to as (IW) is decided at the sender side according to the following condi-

    9

  • Chapter 2. Literature Review

    tions, where SMSS is the senders maximum segment size:

    IW =

    4SMSSbytes (maximum 4 segments), if SMSS 1095 bytes,

    3SMSSbytes (maximum 3 segments), if 1095 bytes SMSS < 2190 bytes, or

    2SMSSbytes (maximum 2 segments), if SMSS > 2190 bytes.(2.1)

    The cwnd is then incremented by the SMSS value for each ACK received. Thisbehaviour leads to an effect of doubling the cwnd value every RTT, as shown inFigure 2.3.

    cwnd=1 cwnd=2 cwnd=4 cwnd=8 cwnd=9

    Slow Start Congestion Avoidance

    Receiver

    Sender

    Figure 2.3: Slow start and congestion avoidance sending patterns.

    The slow start process continues to increment the cwnd exponentially as shownin Figure 2.4, which is meant to be efficient to determine a reasonable window sizeto be used along a particular TCP connection (Stallings, 2001). This process termi-nates when either the cwnd reaches a maximum threshold value called (ssthresh),after which the TCP connection transitions to the congestion avoidance phase, orwhen a retransmission timeout occurs indicating a probable segment loss. Bothcases are explained in the next sections.

    2.2.3.2 Congestion Avoidance

    Figure 2.4 illustrates the transition from slow start to congestion avoidance basedon the current threshold value ssthresh. During congestion avoidance, the senderadopt an additive increase approach for adjusting the cwnd. Depending on the TCPimplementation, it should increment the current cwnd by at most one SMSS.

    10

  • 2.2. TCP Protocol

    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 180123456789

    1011121314151617181920

    Time (Multiples of RTT)

    Con

    gest

    ion

    win

    dow

    siz

    e (M

    utlip

    les

    of S

    MS

    S)

    TCP RenoTCP Tahoe

    ssthresh

    SlowStart

    CongestionAvoidance

    ssthresh

    ssthresh

    timeout

    Figure 2.4: Slow start and congestion avoidance, as implemented for TCP Tahoeand TCP Reno.

    2.2.3.3 Retransmission Timeout, Fast Retransmit, and Fast Recovery

    In early implementations of TCP, the detection of segment losses was performedusing a retransmission timeout timer that triggers a retransmission once the timerelapses assuming a segment loss, as implemented in TCP Tahoe. The durationof this timer is referred to as RTO, and is evaluated in terms of the Round TripTime (RTT) measured within a TCP connection, as documented in RFC 793 (Postel,1981). The fact that TCP Tahoe only sends cumulative acknowledgements doesincrease the time needed for the RTO timer to expire and detect a segment loss.

    2.2.3.4 Fast Retransmit

    A more efficient approach for assuming segment loss was proposed by Jacobson(1990) and called fast retransmit. The fast retransmit algorithm suggested thatwhenever the TCP receiver detects an out-of-order segment, it should send or re-send an ACK for the last segment received in correct order. The receiver shouldcontinue sending these duplicate ACKs as long as the missing segment has notbeen received, and the correct sequence of segments has not been restored. Fromthe sending side, receiving a duplicate ACK would either imply a congested net-work or a segment loss. The fast retransmit algorithm states that once three

    11

  • Chapter 2. Literature Review

    duplicate ACKs have been received by the sender, it should then retransmit thelast unacknowledged segment. The fact that the receiver has been sending du-plicate ACKs implies that it has been receiving segments subsequent to the lostsegment. Hence, the sender is only supposed to retransmit the assumed lost seg-ment (Stallings, 2001).

    2.2.3.5 Fast Recovery

    At any moment within the lifetime of a TCP connection, whether during slow startor congestion avoidance phases, at the occurrence of a timeout, the slow start pro-cess is reinitialised and the current cwnd is reset to one SMSS. However, thelimiting threshold (ssthresh) has to be modified to either half the value of maxi-mum amount of unacknowledged data in the network, or twice the SMSS value,whichever is larger (Allman et al., 2009). This is to anticipate further possiblecongestion as previously experienced when using higher ssthresh.

    The adjustment of the cwnd itself is dependent on the TCP implementation. Itmay be reset to one SMSS as in TCP Tahoe, and hence slow start is reinitialiseduntil the new ssthresh is reached, or it may be set directly to the new ssthresh,and hence congestion avoidance is directly invoked, as implemented in TCP Reno.The approach taken by both TCP Tahoe and TCP Reno at timeout is illustrated inFigure 2.4.

    2.2.4 Idle Time Considerations

    As later demonstrated within the traffic analysis, idle time within the lifespanof a TCP connection may be substantial in some cases and accordingly may leadto significant estimation error of the throughput or transmission time. It is thenessential to explore the how TCP implementations deal with these idle times.

    2.2.5 TCP Timers

    In addition to purely logical idle time, the conditional transitioning between oneTCP stage to another can be time consuming, and may negatively affect the evo-lution of a TCP connection, and increase the overhead observed either to reach asmooth data transfer stage, or to efficiently terminate the transmission. Accord-ingly these transitions have to be bounded by different TCP timers to ensure TCPdoes not remain stuck in a certain stage. The understanding of these timers were

    12

  • 2.2. TCP Protocol

    particularly useful when performing manual analysis of TCP connection in furtherstages of the research.

    TCP implementations use two different clocks (tick counters): a slower clockwith interval set to 500 ms, and a faster clock set with 200 ms interval. In orderto regulate the value of each TCP timeout, these timers are triggered in a multiplenumber of ticks (500 ms and 200 ms) as needed (Stevens and Wright, 1995).

    According to the implementation specifications described by Stevens and Wright(1995), seven types of timers are used and are listed as follow:

    Connection Establishment Timer: At connection establishment, the first SYNsegment sent from the client times out after around 6 seconds (12 ticks). After thatthe client send a second and a third SYN segments which times out after 24 secondsand 48 seconds respectively. In typical implementation of TCP, after a total periodof 75 seconds without a response from the server, the TCP connection is aborted(Stevens and Wright, 1995).

    Retransmission Timer: As previously mentioned, the retransmission timer isused to assume a segment loss. For each segment sent, once the RTO timer iselapsed before receiving an ACK from the receiver, the segment is resent. ThisRTO value is dynamically calculated and based on previous values of smoothedRTT (SRTT) and variations in RTT (RTTVAR), as described in RFC 6298 (Paxsonet al., 2011):

    Initially, when neither previous RTT has been measured nor previous RTOhas been calculated, the RTO is set to one second.

    Once a RTT measurement has been made, SRTT is set to measured RTT, andRTTVAR is set to half the measured RTT. The current RTO is then calculatedas follow:

    RTO = SRTT +K RTTV AR; WhereK = 4 (2.2)

    With the measurement of further RTT, the value of SRTT and RTTVAR areadjusted as follow:

    RTTV AR = (1 ) RTTV AR + |SRTT RTT |; Where: = 1/4 (2.3)

    13

  • Chapter 2. Literature Review

    SRTT = (1 ) SRTT + RTT ; Where: = 1/8 (2.4)

    Delayed Acknowledgement Timer: Whenever segments are received and havenot been acknowledged yet, and do not require a direct ACK, the delayed acknowl-edgement timer in started. Once the timer expires, a cumulative ACK is sent toacknowledge all received segments. The aim of this mechanism is to reduce theoverhead resulting from sending direct ACKs for every segment. The typical timervalue as implemented in is 200 ms, and can be changed up to 500 ms.

    Persist Timer: The window size from receiver has to be continuously advertisedto the sender. Since the ACK segments are not reliably transmitted, the connectionmay enter a deadlock state in which the receiver is waiting for further data fromthe sender, while the sender is waiting for the receivers window to be advertised.In order to avoid this deadlock, TCP would trigger the persist timer whenever anull window size is advertised. If the timer elapses without receiving a non-nullvalue, the sender responds by issuing a probe to the receiver.

    Keep Alive Timer: The keep-alive timer is an optional service that is run at theTCP application layer, for which it allows one side of the TCP connection to probethe other side to check if it is still alive after a prolonged period of idle time withoutany data or acknowledgements being exchanged. The default status for this timeris to be turned off, and its default value is set to two hours, as documented in RFC1122 (Braden, 1989).

    FIN WAIT 2 Timer: A TCP connection termination is initiated by sending a FINsegment from one side to another side and receiving an ACK segment, waiting for asimilar FIN segment in the opposite direction as previously demonstrated in Figure2.1. As TCP transition to the FIN WAIT 2 stage, a timer of 10 minutes is firstlystarted, and then reinitialised to 75 seconds. By the expiration of the second timer,the TCP connection is dropped if no FIN segment is received.

    TIME WAIT Timer: As mentioned in section 2.2.1.3, TIME WAIT is the laststate that a TCP client ends up in, once FIN and ACK segments have been ex-changed in each direction. The client would remain in this state for a period equiv-alent to twice as much as the MSL Maximum Segment Lifetime which is the max-imum amount of time a segment can remain valid in a network without being

    14

  • 2.3. Formula-Based Modelling

    discarded and has a default value of two minutes. Accordingly this timer is re-ferred to as the 2MSL timer, and has a default value of four minutes (Stevens andWright, 1995). At this point, the TCP connection would be considered cleanly andlogically terminated, however, the TCP socket would not transits to a close stateuntil the 2MSL has passed. The purpose of using such timer is to ensure that nodelayed segments will be wrongly considered as part of a subsequent TCP connec-tion (Postel, 1981). This state timer is not of a particular interest from an IP layer3 perspective.

    2.3 Formula-Based Modelling

    Formula-based models depend on mathematical expressions to evaluate the ex-pected TCP throughput from the TCP parameters. The following mathematicalmodel was proposed by Mathis et al. (1997) and is referred to as the square-rootformula.

    E[R] =M

    T

    2bp

    3

    (2.5)

    Where E[R] is the expected TCP throughput, R is the actual throughput, M isthe maximum segment size, b is the number of TCP segments per new ACK, T isthe RTT, and p is the loss rate. Another mathematical model was also proposed byPadhye et al. (1998):

    E[R] = min(M

    T

    2bp

    3+ T0min(1,

    2bp

    8)p(1 + 32p2)

    ,W

    T) (2.6)

    Where T0 is the TCP retransmission timeout period, W is the maximum windowsize. The value of the TCP throughput is evaluated based on the TCP parameterswhile the TCP flow is in progress, and hence the value is considered an estimatedvalue rather than a prediction. A slight modification was introduced by He et al.(2007) by using TCP transfer probes prior to the transfer of the original flow. Theprobes used could possibly be ping sessions or very small TCP transfers (64KB).These probes are sent periodically in order to determine the TCP parameters onthe path such as RTT and loss rate. These parameters would then be fed into themathematical model proposed in order to predict the TCP throughput value prior

    15

  • Chapter 2. Literature Review

    to any flow transfer. The accuracy results obtained from this formula-based modelwere relatively low, with median Root Mean Square Relative Error (RMSRE) of 2.The RMSRE was less than 0.4 for only 20% of the traces.

    2.3.1 Cardwell Mathematical Model

    A mathematical model was proposed by Cardwell et al. (2000). This model wasused as a reference and baseline in this research in order to evaluate the perfor-mance results obtained from the AI-based model developed in further stages of theresearch. The reason behind selecting this model in particular was for its completemodelling of all the stages observed in a TCP connection, as well as its applica-bility to both lossless and lossy TCP flows. The model defines and aggregates thetime spent during four phases of the TCP connection; slow start, recovering losssegment, congestion avoidance and time spent due to delayed acknowledgementsas expressed in Equation 2.7. Each phase and associated mathematical represen-tation as described in (Cardwell et al., 2000) is explained in the following section.The script implementation of the model in MATLAB is documented in AppendixC.1.

    E[T ] = E[TSS] + E[Tloss] + E[Tca] + E[Tdelack] (2.7)

    Connection Establishment: Cardwell et al. (2000) proposed an estimation forthe time cost during connection establishment. Nevertheless, this period was notwithin the scope of this research, as the TCP throughput or transmission time wasmainly evaluated for the data transfer period of TCP connections. The transmis-sion time evaluated by tcptrace was also exclusively the time spent from first datasegment to the last data segment observed excluding connection establishment andtermination.

    Initial Slow Start: As the occurrence of a segment loss would end the slow startphase, Cardwell et al. (2000) initially evaluate this probability in terms of loss rateas per Equation 2.8. Then calculates the number of segments expected to be sentduring slow start in terms of this probability as per Equation 2.9.

    16

  • 2.3. Formula-Based Modelling

    lss = 1 (1 p)d (2.8)

    WSS(d) =

    (1 (1 p)d)(1 p)

    p+ 1 if loss rate (p) >0

    d if loss rate (p) = 0(2.9)

    Knowing the number of segments sent during slow start, the expected windowsize by the end of slow start is calculated as per Equation 2.10.

    E[WSS] =E[dSS]( 1)

    +w1

    (2.10)

    The total time spent in slow start is then calculated as per Equation 2.11.

    E[TSS] =

    RTT [log(Wmaxw1 ) + 1 + 1Wmax (E[dSS]Wmaxw1

    1 )] whenE[WSS] > Wmax

    RTT log(E[dSS ](1)w1 + 1) whenE[WSS] Wmax(2.11)

    Occurrence of First Loss: Cardwell et al. (2000) then evaluate the probabilityof having packet losses due to retransmission timeouts RTO as per Equation 2.12,and the expected time cost for a RTO as per Equation 2.14.

    Q(p, w) = min

    (1,

    (1 + ((1 p)3)(1 (1 p)w3))(1 (1 p)b)/(1 (1 p)3)

    )(2.12)

    Gp = 1 + p+ 2p2 + 4p3 + 8p4 + 16p5 + 32p6 (2.13)

    E[ZTO] =G(p)T01 p

    (2.14)

    T loss calculated in Equation 2.15 is then the expected time spent in order torecover from segment loss.

    17

  • Chapter 2. Literature Review

    Tloss = lss (Q(p, E[WSS]) E[ZTO] + (1Q(p, [WSS])) RTT ) (2.15)

    Transferring the Remainder: The amount of data segments to be transmittedafter slow start and loss recovery is calculated by Equation 2.16, and the expectedsize of congestion window (W p) at segment loss event is calculated by Equation2.17. d ca is the amount of data left to be transmitted after Slow Start and lossoccurrence.

    E[dca] = d E[dSS] (2.16)

    W (p) =2 + b

    3b+

    8(1 p)

    3bp+

    (2 + b

    3b

    )2(2.17)

    Cardwell et al. (2000) evaluates the steady state throughput (R) using Equation2.18, and accordingly deducts the time needed for transmitting the remainder ofsegments using that throughput as per Equation 2.19. This is the time spent incongestion avoidance phase.

    R =

    1pp

    +W (p)

    2+Q(p,W (p))

    RTT ( b2W (p)+1)+

    Q(p,W (p))G(p)T01p

    if W (p) < Wmax1pp

    +Wmax2

    +Q(p,Wmax)

    RTT ( b8Wmax+

    1ppWmax

    +2)+Q(p,Wmax)G(p)T0

    1potherwise

    (2.18)

    Tca =dcaR

    (2.19)

    Delayed Acknowledgements: The delayed ACK timer is meant to delay thetransmission of ACK, and combining several ACKs into one single ACK in orderto minimise the overhead. This timer depends on the TCP implementation, andusually ranges from 150 to 200 msec.

    18

  • 2.4. Previous Research and Machine Learning Approaches

    2.4 Previous Research and Machine LearningApproaches

    Machine learning is one area of artificial intelligence (AI) where computers areable to modify and adapt their behaviour and are able to take actions, decisions, ormake predictions so that these actions and decisions get more accurate to reflectthe real and correct ones (Freeman and Skapura, 1991).

    This part of the literature review provides a survey of the previous researchapproaches taken in developing models for TCP performance evaluation using ma-chine learning techniques and highlighting their research methodologies and asso-ciated results and findings. Then, an overview over artificial intelligence modellingis presented which particularly describes the concepts behind artificial neural net-works. Detailed information about backpropagation artificial neural networks areincluded in Chapter 5 where the modelling approach taken and techniques usedwithin this research project have been described.

    A research was made by He et al. (2007) to develop a model for predicting theTCP throughput for bulk TCP transfers in particular. As a testbed, their researchmade use of the MIT RON (Resilient Overlay Networks) project, which architectureis made up of 50-60 nodes distributed in universities, research labs and ISPs in theUS, Europe and Asia.

    In their research, they have initially strengthened on the difference betweenperformance estimation and performance prediction for a network path.

    2.4.1 Performance Estimation

    The estimation of TCP performance is performed after the TCP flow has startedand can be evaluated all along the transmission flow. For a certain flow, the TCPparameters and path characteristics are fed into the TCP performance evaluationmodel in order to estimate the value of the TCP throughput. This approach isconsidered non-intrusive as no additional traffic is generated on the network pathas opposed to the approach taken in performance prediction.

    2.4.2 Performance Prediction

    The objective of predicting the performance of a TCP transmission is to evalu-ate the expected TCP throughput value prior to the start of the TCP flow. This

    19

  • Chapter 2. Literature Review

    approach is usually performed using probes such as ping utility or small TCPtransfers that are generated and scheduled periodically. The measurements ob-tained from these probes are then used as inputs for TCP throughput evaluationmodels. This probing approach can be considered highly intrusive if it leads to thesaturation of the network path, and hence the probes should limited by both sizeand frequency as much as possible.

    He et al. (2007) have classified the models used to evaluate the performanceof TCP for TCP transfers into two classifications; formula-based or mathematicalmodels as the one previously described in this chapter, and history-based models,each approach having its own advantages and drawbacks.

    2.4.3 History-Based Models

    History-based models mainly depend on the previous knowledge acquired from his-torical TCP transfers. The models use adaptive learning in order to form relation-ships between observed path characteristics and the resulted TCP throughput ofeach transfer. Accordingly, history-based models are independent of the TCP im-plementation used at the server and the receiver ends, which is considered a greatadvance over mathematical models.

    In the research done by (He et al., 2007), they have developed a history-basedprediction model based on linear predictors such as Moving Average, ExponentialWeighted Moving Average, and non-seasonal Holt-Winters. Such linear predictorsperformed mathematical operation to estimate future values of TCP throughputas a linear function of previous samples. The prediction accuracy of their history-based model gave better accuracy with a RMSRE less than 0.4 for 90% of the traces.It was suggested by (He et al., 2007) to utilize hybrid predictors which would con-sider TCP transfer characteristics, as well as throughput history in order to ob-tain more accurate throughput estimates. It was also suggested to develop TCPthroughput models that consider the paths load, buffering and cross traffic na-ture as input to the model, in a way that the model would be independent of TCPconnection characteristics.

    Another research was made by Mirza et al. (2010) in which they adopted amachine learning approach to predict TCP throughput. They have used SupportVector Regression (SVR) which is a supervised learning method used for patternclassification depending on the dataset used for training the classification model.They have used a laboratory testbed consisting of end hosts connected through a

    20

  • 2.4. Previous Research and Machine Learning Approaches

    dumbbell topology with a bottleneck point to create and control congestion throughtheir experiments. Monitoring cards were placed at the congestion point to capturepackets leaving and entering the bottleneck level.

    The measurements used in their models were the available bandwidth on thecongested link, the queuing at the bottleneck node, and the loss rate. They haveused both passive and active path measurements. For the passive measurements,parameters (available bandwidth, queuing, and loss rate) were obtained from pre-captured TCP flows, and for active measurements the same parameters were ob-tained from the active monitoring cards. Their results obtained from their exper-iments indicated that for bulk TCP transfer, the predicted TCP throughput waswithin 10% of the actual value 87% of the time. For possible future work, Mirzaet al. (2010) suggested to consider other machine learning tools rather than theSVR approach. They also emphasized on the importance of finely tuning trainingsets used for developing the model used in the supervised learning process.

    A research approach for estimating TCP performance using neural networkswas adopted by (Ghita et al., 2005). In their research they have used three sourcesof captured traffic: synthetic connections generated by network simulators, semi-supervised connections which were captured from automatic retrieval tools, andunsupervised traffic which was captured from real network traffic traces. In theirresearch they have divided their training data sets into two categories, one withoutpacket losses, and another with packet losses. They have used the Stuttgart NeuralNetwork Simulator (SNNS) for the training of the datasets and for developing theneural network model. The results obtained from the neural network model haverevealed significant improvement over the mathematical model with nearly a ten-fold improvement of the relative error. On the other hand, for the traffic withpacket losses, the mathematical model has shown better performance.

    2.4.4 Artificial Neural Networks

    Artificial neural network is machine learning tool used for recognition or classifi-cation processes. The use of artificial neural networks can drastically simplify thecomplex mathematical models needed for modelling. Additionally, neural networksare recognised for improving estimation accuracy, by assuming new relationshipsamong inputs, and between inputs and associated target outputs. Neural networksare also recognised for being able to extend its recognition and classification knowl-edge, by associating new estimation output values for inputs that have not been

    21

  • Chapter 2. Literature Review

    previously encountered by the neural network either during training or validation,and hence being applicable to extended and larger datasets.

    In this research, backpropagation feed forward neural networks will primarilybe considered for the modelling process. Backpropagation, or propagation of error,is a common method of teaching artificial neural networks how to perform a giventask. It is a supervised learning method, and is an implementation of the Deltarule. It is most useful for feed-forward networks networks that have no feedback(Freeman and Skapura, 1991).

    2.5 Summary

    In this chapter, theoretical fields of study and research related to the project havebeen covered. A general overview of the transition between TCP stages, and abrief description of each stage has been provided. Different timers and associatedperiods of idle time have been justified. This was essential in order to obtain a goodunderstanding of the different conditions encountered by TCP connections.

    An overview has been made over the previous researches that were based onmachine learning approaches. The conditions that were considered such as the net-work topologies from which traffic have been captured, and the number of testbedsamples that were considered. The findings of each research have been presentedand the accuracy results obtained by the models developed have been documented.That was evident in order to have an initial baseline for the expectations of accu-racy and performance of artificial intelligence method in modelling TCP connec-tions. Some assumptions have been made regarding the approaches to considerdeveloping neural network models, such as the type of traffic to be used, and howto categorise traffic according to the presence of any loss segments. These resultsand findings are expected to be compared with the results obtained by the comple-tion of this project.

    22

  • 3 Research MethodologyAfter completing a literature review, the practical part of the project was car-ried out in three main stages; collecting and analysing different traffic captures,extracting relevant TCP parameters needed, filtering several data subsets as re-quired, and finally using these subsets train the neural network. The main stagesof the project are shown in the process diagram in Figure 3.1 and are explained inthe following sections:

    Extract Parameters

    TCP Traffic

    TCP Performance (Transmission Time &

    Throughput)

    Mathematical Model

    TCP Input Parameters

    Post-Processing

    Accuracy Analysis Correlation MSE

    Manual Analysis

    Neural Network Modelling

    Lossless Model

    Lossy Model

    Traffic Analysis

    Feedback

    Cal

    cula

    ted

    Pe

    rfo

    rman

    ce

    Esti

    mat

    ed

    Pe

    rfo

    rman

    ce

    Figure 3.1: Process diagram of research stages.

    3.1 Data Acquisition

    Using synthetic connections for testing was a possible option. However, the mainaim of the research was to study and investigate the behaviour of everyday In-ternet traffic, and hence to rely solely on connections captured from either largeenterprises or T1 lines. Several sources of captured traffic were considered in anal-

    23

  • Chapter 3. Research Methodology

    ysis and training the AI-based neural network model. The initial purpose for thatwas to cover as many types of connections with various conditions in the trainingprocess, aiming to obtain better learning rates and faster convergence of the neu-ral model. Another reason was to cover different traffic types, in order to developa robust neural model that provides better estimation accuracy. It was observedthat throughput of TCP connections not only depends on the TCP parameters foreach TCP connection, but also on the network conditions of each trace; conditionsthat are not accounted for in conventional TCP mathematical models, such as thebehavioural sending characteristics of the TCP servers, resulting in varied and in-consistent idle time periods. The following sections describe the characteristics ofthe three source of synthetic traffic used in this research.

    All analysis and modelling within the research was initially performed on anaggregated dataset including both the traffic captured from Brescia Universitycampus and the few traffic traces collected from the MAWI Group. The purposefor aggregating these traces into one dataset was to obtain a single dataset largeenough for neural network modelling, and to ensure that the number of TCP con-nections which would be considered as training and validating samples when de-veloping the neural network model would be sufficient to ensure no over-fitting ofthe models to the data available. The total number of connections that were ag-gregated from these two sources were 1,900,440 TCP connections. At later stagesof the research, the dataset collected from the campus of the Plymouth Universitywas used in order to validate the results and analysis performed. The total numberof connections captured on the campus of Plymouth University which were used forresults validation were 6,355,344 TCP connections.

    Campus Network of Brescia University (UNIBS): These traces were col-lected on the edge router of the campus network of the University of Brescia onthree consecutive working days, mainly composed of TCP (99%) and UDP traffic,which corresponds to around 79,000 flows in total (UNIBS: Data sharing, 2011).More information on these traffic traces is listed in Appendix A.1.

    MAWI Working Group Traffic Archive: These are daily traces at a trans-Pacific line (150Mbps link) (MAWI Working Group Traffic Archive, n.d.). Severaltraffics traces were selected and aggregated into a single trace. Detailed informa-tion about these traces are listed in Appendix A.2.

    24

  • 3.2. Data Pre-processing

    Campus Network of Plymouth University: Hundreds of traffic captures werecollected at the campus of Plymouth University. These captures were all aggre-gated into a single dataset after excluding incomplete connections. The datasetincluded 5,665,167 TCP connections made by local clients to remote servers, and690,186 TCP connections made by remote clients to local servers.

    3.2 Data Pre-processing

    The following tools and software were used during the research.

    3.2.1 TCPTRACE

    All collected traces have been initially processed through tcptrace in order producedatasets of TCP flow records associated to each trace. tcptrace is a network toolrunning under Linux, which can accept traffic captures from other tools such astcpdump, and produce information about each TCP connection as seen in that traf-fic. Further information on TCP parameters extraction is explained in Chapter4.

    3.2.2 Data Processing in MATLAB

    Once datasets of TCP records were available, they were imported into MATLABfor further processing. Each dataset was then divided into two subsets according towhether segment loss were identified or not. Lossless and lossy subsets have beenused separately in all stages of the research in order to provide clearer resultsand analysis about the capability of the models to estimate the performance oflossy TCP traffic. Detailed steps of the different stages of data preprocessing areincluded in Chapter 4.

    3.3 Neural Network Modelling in MATLAB

    The MATLAB Neural Network Toolbox was used to develop the neural networkmodelling due to it efficiency and simplicity in designing different model. The tool-box also provides automated visualisation tools for detailed performance measuresof the models developed.

    25

  • Chapter 3. Research Methodology

    3.4 Statistical Analysis in MATLAB

    All statistical analysis in this research were performed using the Statistics Tool-box in MATLAB. The output of the developed neural network model representingthe transmission time estimated for each TCP connection was compared with theactual value of transmission time as collected from tcptrace. This comparison wasdone using correlation analysis in MATLAB.

    3.4.1 Regression

    According to (Hair et al., 1995), regression analysis is a general statistical tech-nique used to analyse and identify a relationship between a single dependent pa-rameter and a set of other independent parameters. In this research we are mainlyconcerned to apply regression analysis between the actual throughput and esti-mated throughput by each model. This regression is represented with a simplefitting line as shown in Figure 3.2. Residual values are represented as the devia-tion from the fitted regression line.

    1 2 3 4 5 6 7 81

    2

    3

    4

    5

    6

    7

    8

    9

    Targets

    Out

    puts

    Y=TFitting LineScattered Data

    Residual value

    Figure 3.2: Regression analysis showing regression fitting line and residual values.

    3.4.2 MSE

    The performance of neural network model was continuously evaluated during thelearning process using the Mean Squared Error (MSE) between actual and esti-mated throughput values. According to (Kleinbaum et al., 1997), the MSE is ex-

    26

  • 3.5. Base Line for Analysing Model Accuracy

    pressed as the sum of squared errors divided by their corresponding degree of free-dom (n-k-1), where k is the number of independent variables in the model, and n isthe number of samples. MSE is expressed in Equation 3.1

    MSE = S2 =1

    n k 1

    (ei)2 (3.1)

    3.4.3 Absolute Relative Error

    The statistical cumulative distribution function (CDF) of absolute relative errorwas used to study the accuracy obtained by each model and compare it with othermodels or other modelling criteria.

    3.5 Base Line for Analysing Model Accuracy

    The TCP throughput estimation accuracy by the mathematical model defined byCardwell et al. (2000) was primarily used as a base line to evaluate the performanceof the neural network models developed in MATLAB. The same parameters usedfor the mathematical model were considered to be used for the neural networkmodels, in order to evaluate accuracy measurements under the same modellingconditions.

    The estimation accuracy of each of the mathematical and neural network modelprior to any sort of filtering to the TCP connections was also considered as a secondbaseline to evaluate the change in estimation accuracy prior and post applyingfiltering conditions to each model individually.

    3.6 Summary

    This chapter provided an overview over the principal stages of project and the flowof data and feedback between processing, modelling and analysis steps. Brief de-scriptions were given about the tools used for data acquisition, data pre-processing,and neural network modelling. finally, a brief explanation of the different statis-tical analysis methods that were used to evaluate the performance of the modelsdeveloped, and how the mathematical model was used as a baseline for modellingperformance evaluation.

    27

  • 4 Data Pre-processing and TrafficAnalysis

    This chapter aims to provide an overview on the sources of network traffic usedduring the research, and explain the different stages of pre-processing performedon traces prior to any analysis and prior to modelling the neural network. Thechapter also includes basic statistical analysis performed to understand the normaldistribution of TCP parameters values in order to anticipate any filtering criteriathat may be further investigated when modelling the neural network.

    4.1 Types of Traffic

    TCP application can categorized into two major types producing two different trendsof traffic data flow: Interactive data flow which is characterised by smaller segmentsizes and bidirectional flow of data, and bulk data flow which is characterisedby large segment sizes usually in one direction which is from a server to client(Stevens, 1993). TCP algorithms are expected to deal with both kinds of traffic ef-ficiently using the different algorithms summarised in Chapter 2. The percentageof each type of data flow may be represented in either the number of packets ex-changed on the Internet or the size of these data flows in bytes. A study done byCaceres et al. (1991) implied that interactive data flow packets are responsible for25-45% of all Internet traffic. However in terms of network bytes, bulk transfersrepresent 90-95% of the overall traffic.

    In this research no prior classification based on types of traffic has been madeto TCP flows, aiming to develop a robust model applicable for all sorts of connec-tions. Nevertheless, in later stages of the project, interactive data flows have beenexcluded from the training datasets in order to evaluate the contribution of theseflows to the inaccuracy of throughput estimation for both mathematical and neuralnetwork models.

    29

  • Chapter 4. Data Pre-processing and Traffic Analysis

    4.2 Extracting TCP Parameters

    All collected traffic traces have been processed using tcptrace in order to generatea complete dataset of records, each record containing information about a TCPconnection as recognised by the source and destination IP addresses and ports.Among the many parameters produced by tcptrace, the parameters listed in Table4.1 as defined by (Ramadas, 2003) were of particular interest for this research.

    Table 4.1: TCP parameters of interest as collected by tcptrace.

    TCP Parameter DescriptionSYN/FIN pkts sent The count of all the packets with SYN/FIN flags set.actual data sent The total bytes transmitted during data transfer stage, including any re-

    transmission.total packets The total number of packets, including packets exchanged during connec-

    tion establishment and termination.RTT avg The average value of RTT.avg segm size The average segment size over the lifetime of a TCP connection.initial window pkts The number of segments within the initial window advertised.max owin The maximum number of unacknowledged data in bytes observed during

    the connection lifetime. As, the TCP congestion window at the senderside cannot be determined. Hence, it is estimated using the outstandingunacknowledged data.

    max segm size The maximum segment sizetriple dupacks The total number of triple duplicate acknowledgements received by the

    sender. This number is usually used to represent the number of assumedloss segments over a TCP connection.

    avg retr time The average retransmission time between successive transmission and re-transmission of a segment.

    data xmit time Total data transmission time, excluding time spent during connection es-tablishment and termination.

    4.3 TCP Parameters Pre-processing

    The following stages of pre-processing have been applied to the collected traffictraces, in order to obtain a set of variables representing each TCP flow, and can beused subsequently for neural network modelling.

    30

  • 4.3. TCP Parameters Pre-processing

    4.3.1 Identifying Valid TCP Flows

    Available dataset have been filtered to exclude all TCP flows that either incomplete,or have TCP parameters which would be considered invalid for modelling. Thefollowing filtering criteria have been applied.

    1. Exclude incomplete TCP connection by investigating the number of SYN andFIN segments exchanged. A complete TCP connection would normally in-clude at least a single SYN/FIN set in each direction.

    2. Exclude TCP flows with no data transmitted in either direction.

    3. Exclude TCP flows for which average RTT measured is null.

    4. Exclude TCP flows for which initial sending window equal to null.

    The total number of valid TCP flows post filtering invalid and incomplete con-nections is listed in Table 4.2.

    Table 4.2: Number of valid TCP flow for both lossless and lossy subsets.

    Dataset Total Number Number of Number of Valid Number of Validof TCP Flows Valid Flows Lossless Flows Lossy Flows

    UNIBS 79630 45944 45024 920MAWI 1820810 83733 76297 7436

    Plymouth University 6355344 2471273 2413974 57299

    4.3.2 Selection of Forward Direction

    TCP connections are bidirectional, and according to RFC 3449 (Balakrishnan et al.,2002), the forward direction of a connection is characterised by more voluminousdata flow. For server-client connections, this direction is usually from the server tothe client. On the other hand, the reverse direction is characterised by less databeing transmitted and is usually used for acknowledging data sent in the forwarddirection. At this stage of data preprocessing, all TCP connections have been pro-cessed to select the forward direction based on the amount of data transmitted ineach direction. Only the forward direction has been considered for modelling.

    31

  • Chapter 4. Data Pre-processing and Traffic Analysis

    4.3.3 Classification of Lossless and Lossy Flows

    The occurrence of segment losses is indicated on a network path once the TCPsender receives triple duplicate acknowledgements (Kurose and Ross, 2009) asdemonstrated in Chapter 2. Valid TCP flows have been classified into two dif-ferent subsets according to the number of triple duplicate ACKs sent in the reversedirection: a subset with lossless connections, and another subset with lossy con-nections. The segment loss rate for lossy connections has been evaluated as perequation 4.1. Although this method of calculation may not be purely accurate, yetit has shown realistic results when introducing the calculated loss rate(p) to themathematical model as described by Cardwell et al. (2000).

    Loss rate (p) =Triple duplicateACKs

    Total number of actual data segments(4.1)

    Figure 4.1: Percentages of both lossless and lossy TCP connections within the net-work traffic captured.

    4.3.4 Computing the Mathematical Throughput Estimate

    Cardwells mathematical model demonstrated in Chapter 2 has been used to eval-uate the estimated throughput of each TCP connection. The accuracy of the math-ematical model was considered as a base line to evaluate and the performance andaccuracy of the neural network model developed.

    32

  • 4.4. Statistical Distribution of TCP Parameters

    4.3.5 Normalisation of TCP Parameters

    The actual values of TCP parameters as produced by tcptrace were found to bevarying over very large ranges. Also, the values were found to have different scalesfor each parameter. Hence it was essential to either normalise or standardise thesevalues. Several standardisation techniques have been experimented, by scaling theinput and targets to have values ranging from 0 to 1 according to the CDF of eachTCP parameter. However these techniques have not led to much improvement.Normalising the TCP parameters using a natural logarithmic function was foundto provide better accuracy figures, and faster learning convergence when trainingthe neural network.

    4.4 Statistical Distribution of TCP Parameters

    4.4.1 Throughput

    The TCP throughput of traffic collected at both Brescia University and PlymouthUniversity shows similar and even distribution. However, the TCP throughputvalues of MAWI traffic indicated relatively lower values with higher percentile ofconnection with throughput around 10 Bps as shown in Figure 4.2.

    100

    101

    102

    103

    104

    105

    106

    107

    108

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Throughput (Bps, log scale)

    % o

    f con

    nect

    ions

    Empirical CDF

    MAWI Working Group Traffic ArchiveCampus network of the University of BresciaCampus network of Plymouth University

    Figure 4.2: Cumulative distribution of throughput.

    33

  • Chapter 4. Data Pre-processing and Traffic Analysis

    4.4.2 Data Transmitted

    The amount of actual data transmitted over connections at Plymouth Universitywas relatively very low with a mean value of 329 bytes, compared to 245,257 bytesand 100,905 bytes at Brescia University and MAWI group respectively. The CDFdistribution of data transmitted is shown in Figure 4.3.

    100

    101

    102

    103

    104

    105

    106

    107

    108

    109

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Actual data transmitted (Bytes, log scale)

    % o

    f con

    nect

    ions

    Empirical CDF

    MAWI Working Group Traffic ArchiveCampus network of the University of BresciaCampus network of Plymouth University

    Figure 4.3: Cumulative distribution of data transmitted.

    4.4.3 Initial Window Size

    The initial window size (IW) in the three sources of data were observed to complywith the slow start algorithm. The mean values of IW are listed in Table 4.3. Themajority of TCP flows were observed to use either two or three segments for the IW.However, interestingly some connections used IW size larger than four segments -sometimes reaching 12 segments - as shown in Figure 4.5, which contradicts theguidelines set in RFC 5681, which define a maximum limit of four segments for theIW if the MSS value is less than 1095 bytes (Allman et al., 2009). This implies thatsome TCP implementations are not exactly following the RFC guidelines.

    34

  • 4.4. Statistical Distribution of TCP Parameters

    100

    101

    102

    103

    104

    105

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Initial window bytes (log scale)

    % o

    f con

    nect

    ions

    Empirical CDF

    MAWI Working Group Traffic ArchiveCampus network of the University of BresciaCampus network of Plymouth University

    Figure 4.4: Cumulative distribution of initial window bytes.

    0 2 4 6 8 10 12 14 16 18 200

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Number of initial window packets

    % o

    f con

    nect

    ions

    Empirical CDF

    MAWI Working Group Traffic ArchiveCampus network of the University of BresciaCampus network of Plymouth University

    Figure 4.5: Cumulative distribution of initial window packets.

    4.4.4 Maximum Segment Size

    Figure 4.6 shows that the majority of TCP flows of both UNIBS and MAWI datasetswere had a MSS value of either 1430 bytes or 1460 bytes. The dataset from Ply-mouth University was particularly constrained with a MSS of 1368 bytes. These

    35

  • Chapter 4. Data Pre-processing and Traffic Analysis

    observations were taken into consideration when filtering outliers at later stagesof the research.

    0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 15000

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    X: 1368Y: 0.1813

    Maximum Segment Size (Bytes)

    % o

    f con

    nect

    ions

    Empirical CDF

    MAWI Working Group Traffic ArchiveCampus network of the University of BresciaCampus network of Plymouth University

    Figure 4.6: Cumulative distribution of MSS.

    Table 4.3: Mean values of TCP parameters evaluated for the three datasets used.

    TCP Parameter UNIBS MAWI Plymouth UniversityMSS (bytes) 968.76 1162.02 1283.86Maximum Idle Time (sec) 30.06 6.02 23.54Average RTT (msec) 92.01 229.29 38.17Throughput (Bps) 6528.35 17751.86 33596.33Initial Windows Size (bytes) 1496.65 1922.08 2930.56Initial Windows Size (packets) 1.69 1.81 2.66Actual Data Bytes (bytes) 245256.75 100905.68 329.48Transmission Time (sec) 33.97 6.24 13.73

    4.4.5 Data Transmission Time

    The distribution fo data transmission of TCP connections in the three datasets wasfound to be evenly distributed, without any outstanding observation as shown inFigure 4.7.

    36

  • 4.4. Statistical Distribution of TCP Parameters

    103

    102

    101

    100

    101

    102

    103

    104

    105

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Transmission time (seconds, log scale)

    % o

    f con

    nect

    ions

    Empirical CDF

    MAWI Working Group Traffic ArchiveCampus network of the University of BresciaCampus network of Plymouth University

    Figure 4.7: Cumulative distribution of data transmission time.

    4.4.6 Average RTT

    Highest average RTT values were observed on the MAWI dataset with a meanvalue of 229 msec, the mean value of average RTT at Brescia University and Ply-mouth University was 92 msec and 38 msec respectively. Figure 4.8 show thebetter distribution of response time at Plymouth University.

    101

    100

    101

    102

    103

    104

    105

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Average RTT (seconds, log scale)

    % o

    f con

    nect

    ions

    Empirical CDF

    MAWI Working Group Traffic ArchiveCampus network of the University of BresciaCampus network of Plymouth University

    Figure 4.8: Cumulative distribution of RTT.

    37

  • Chapter 4. Data Pre-processing and Traffic Analysis

    4.4.7 Maximum Idle Time

    As described in section 2.2.5, there are seven conditions in which a TCP connectionmay encounter significant idle time, and the timer values as implemented may beconsidered substantially large. The only figure of idle time evaluated by tcptraceis the maximum period of idle time in a TCP connection and is only considered ifit occurred during data transfer, and neither at connection establishment nor afterdata transfer is complete and waiting for connection termination. Nevertheless,the maximum idle time still provides a good estimate or representation of the totalidle time within TCP flows. Figure 4.9 demonstrates the CDF of maximum idletime for the sources of data. Highest mean value observed was 30.06 seconds atBrescia University, compared to 23.54 seconds at Plymouth University, and only6.02 seconds for the MAWI dataset.

    100

    101

    102

    103

    104

    105

    106

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Maximum Idle Time (seconds, log scale)

    % o

    f con

    nect

    ions

    Empirical CDF

    MAWI Working Group Traffic ArchiveCampus network of the University of BresciaCampus network of Plymouth University

    Figure 4.9: Cumulative distribution of maximum idle time.

    Further statistical distribution analysis was performed on the TCP timing pa-rameters (i.e. transmission time, average RTT, and maximum idle time) usingbox-and-whisker diagrams as shown in Figures 4.10, 4.11 and 4.12. The purposeof this analysis was to:

    Identify the 2nd percentile and the 98th percentile of each TCP parameter,and exclude statistical outliers from the dataset used for analysis and neuralnetwork training in later stages of the research.