Deep Learning Website Fingerprinting Features · Deep Learning Website Fingerprinting Features Vera Rimmer Thesis submitted for the degree of Master of Science in Artiﬁcial Intelligence,

Deep Learning Website FingerprintingFeatures

Vera Rimmer

Thesis submitted for the degree ofMaster of Science in Artificial

Intelligence, option Engineering andComputer Science

Thesis supervisor:Prof. Claudia Diaz

Assessors:Prof. Frank Piessens

Günes Acar

Mentors:Marc Juarez

Ero Balsa

Academic year 2016 – 2017

c© Copyright KU Leuven

Without written permission of the thesis supervisor and the author it is forbiddento reproduce or adapt in any form or by any means any part of this publication.Requests for obtaining the right to reproduce or utilize parts of this publicationshould be addressed to the Departement Computerwetenschappen, Celestijnenlaan200A bus 2402, B-3001 Heverlee, +32-16-327700 or by email [email protected].

A written permission of the thesis supervisor is also required to use the methods,products, schematics and programs described in this work for industrial or commercialuse, and for submitting this publication in scientific contests.

Acknowledgements

I would like to express my utmost gratitude to my daily supervisor, Marc Juarez, forinitiating this research and providing continuous guidance and support throughoutthe year. I am also thankful to my second supervisor, Ero Balsa, for his valuableadvice and feedback on my work. I could not wish for better supervisors. I thankmy promotor, Prof. Claudia Diaz, for giving me the fortunate opportunity to workon this thesis.

I am especially grateful to my scientific advisor in Peter the Great St. PetersburgPolytechnic University, Prof. Vladimir Platonov, for sparking my interest in artificialintelligence.

I am greatly indebted to Bert DeKnuydt from ESAT-VISICS for providing theaccess to their infrastructure for running my experiments.

I would like to acknowledge Marc Juarez and his team for sharing their datasetsand the traffic parser and George Danezis for providing the code for the wavelettransform. I am also thankful to my fellow student and friend David Torrejon forintroducing me to the Keras framework.

My sincere gratitude goes to my family and friends for their great support andespecially to my partner for continuous encouragement and infinite patience duringthis year.

i

Contents

Acknowledgements iAbstract iiiList of Figures and Tables ivList of Abbreviations v1 Introduction 12 Related Work 3

2.1 Website Fingerprinting . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Background - Machine Learning 73.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2 Predictive Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.3 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 93.4 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 System Model 194.1 Anonymous Web Browsing . . . . . . . . . . . . . . . . . . . . . . . 194.2 Adversary Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Evaluation 315.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.2 Experimental Result . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.3 Illustration of Parameter Tuning . . . . . . . . . . . . . . . . . . . . 35

6 Discussion and Future Work 397 Conclusion 41Bibliography 43

ii

Abstract

Anonymity networks like Tor enable Internet users to browse the web anonymously.This helps citizens circumvent censorship from repressive governments, journalistscommunicate with anonymous sources or regular users to avoid tracking online.However, adversaries can try to identify anonymous users by deploying severalattacks. One of such attacks is website fingerprinting. Website fingerprinting exploitsthe ability of an adversary to generate anonymized (encrypted) traffic to a numberof websites and then compare it —based on traffic metadata such as packet timing,size and direction—, to the traffic metadata of the users the adversary wishes tode-anonymize. When a user’s traffic metadata matches the metadata the adversaryhad previously generated, the website the user visits may be revealed to the adversary,thus breaking the user’s anonymity.

In prior works, authors have identified several features that allow an adversary tofingerprint the websites visited by a user. Examples of such features include packetlength counts or the timing and volume of traffic bursts. These features were howevermanually identified, e.g., using heuristics, leaving open the question of whether thereare more identifying features or methods to fingerprint the websites visited by theanonymous users.

In this thesis we depart from prior work and design a website fingerprinting attackwith automated feature extraction, this is, we do not manually select the identifyingfeatures to fingerprint the websites but rely on machine learning methods to do so.Specifically, we use deep learning techniques to learn the best fingerprinting featuresand demonstrate the viability of our attack by deploying it on a closed-world scenarioof 100 webpages. Our results show that, with 71% website identification accuracy,adversaries can use machine learning methods to de-anonymize Tor traffic instead ofhaving to rely on the manual selection of fingerprinting features. This is a first andpromising step on a new avenue of website fingerprinting attacks.

iii

List of Figures and Tables

List of Figures

3.1 Model of a neuron[28] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 tanh, sigmoid and ReLU activation functions . . . . . . . . . . . . . . . 103.3 Feedforward neural network with one hidden layer[28] . . . . . . . . . . 113.4 Autoencoder architecture[3] . . . . . . . . . . . . . . . . . . . . . . . . . 143.5 Stacked autoencoder[3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.6 Stacked autoencoder as a classifier[3] . . . . . . . . . . . . . . . . . . . . 16

4.1 Tor traffic anonymization [17] . . . . . . . . . . . . . . . . . . . . . . . . 204.2 Website fingerprinting targeted attack scenario [17] . . . . . . . . . . . . 204.3 Histogram of traffic trace with time step 0.1 sec . . . . . . . . . . . . . . 244.4 Wavelet coefficients of a traffic instance incoming (a) and outgoing (b)

direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.1 Cross-evaluation of the SAE for the WF attack . . . . . . . . . . . . . . 345.2 Accuracy and loss (MSE) during training of autoencoder . . . . . . . . . 365.3 Accuracy and loss (cross-entropy) during training of SAE (red for

training set and blue for validation set) lr = 0.001 . . . . . . . . . . . . 375.4 Accuracy and loss (cross-entropy) during training of SAE (red for

training set and blue for validation set) lr = 0.0001 . . . . . . . . . . . . 375.5 Learning process of the final SAE for the wavelet format of data (red for

training set and blue for validation set) lr = 0.00001 . . . . . . . . . . . 38

List of Tables

3.1 Example of data in the attribute-value format. . . . . . . . . . . . . . . 8

4.1 Traffic trace meta-data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.1 Testing the classifier on data crawled two days after training . . . . . . 355.2 Performance metrics of the pre-trained autoencoders. . . . . . . . . . . . 36

iv

List of Abbreviations

WF Website fingerprintingk-NN k-Nearest NeighborTBB Tor Browser BundleNN Neural networkANN Artificial neural networkDNN Deep neural networkDL Deep learningSAE Stacked autoencoderMSE Mean squared errorSGD Stochastic gradient descentDWT Discrete wavelet transform

v

Chapter 1

Introduction

Tor is a communications network that allows you to surf the web without anyoneknowing which websites you are visiting. To do this, Tor encrypts and routes users’connections through several relays distributed over the world. Currently, Tor is thetool-to-go for journalists working in dangerous places, activists organizing popularresistance in repressive regimes and even the military. However, this is a tool onthe making, that needs to be further studied and constantly improved. Powerfuladversaries such as governments and telecommunications operators may be capableof breaking the security that Tor offers.

One of the attacks that may compromise the privacy properties that Tor aimsto provide to its users is website fingerprinting. Website fingerprinting exploits thefact that, even if Tor encrypts the content of communications, metadata such as thenumber of packets sent and received per connection, as well as the size and timingof those packets, can still be used to guess which websites a user is visiting. Thosefeatures form a “website fingerprint” that conveys patterns that are unique to thatwebsite, thereby the name of the attack.

Each of the website fingerprinting attacks proposed in the research literatureuses a different set of features, ranging from simple ones, such as raw packet lengthcounts, to more complex ones, such as traffic bursts. However, these features wereproposed based on intuition and heuristic arguments on why these features aresupposed to identify a web page. Manual feature engineering is a laborious, time-consuming process requiring expert knowledge of the the underlying HTTP, TCPand IP protocols.

The primary goal of this thesis is to investigate whether the website fingerprintingattack can be executed based on automatically extracted and selected features ofencrypted and anonymized network traffic. For this we apply a novel technique tothe anonymized traffic – a deep learning algorithm. We aim to see if it is possible tode-anonymize Tor traffic, i.e. identify the page that generated such traffic, by a deeplearning model.

We will apply the unsupervised deep learning technique to anonymized Tor trafficfor automated feature extraction and selection and supervised fine-tuning of thedeep neural network for website classification. In addition, we will evaluate the deep

1

1. Introduction

neural network model that we trained by running the de-anonymization attack on aset of unlabeled traffic instances. Our results show that this approach can achieve acomparable success rate with previous works, which proves that deep learning is anovel applicable technique to website fingerprinting that does not require the explicitdefinition of the features.

Our study is structured as follows: in Chapter 2 we review prior research on web-site fingerprinting. Next, we discuss existing work on deep learning for classificationof network traffic. Chapter 3 gives a brief overview of the machine learning domainand explains the underlying theory behind the so-called deep learning techniques.We present our system model in Chapter 4, where we define the adversary modelconsidered throughout this paper and provide the necessary background on anony-mous communications to make this text as self-contained as possible. We describethe methodology and experiments performed to evaluate our deep-learning basedattack and present our results in Chapter 5. We discuss the results, the limitationsand future challenges in Chapter 6. Chapter 7 finishes with the conclusion of thisstudy.

2

Chapter 2

Related Work

In this chapter, we review prior work on website fingerprinting and related studieson network traffic classification.

2.1 Website Fingerprinting

For the first time, the website fingerprinting attack against Tor was deployed byHerrmann et al. [12]. They applied a classical text mining classifier, the multinomialNaïve Bayes, to a fixed set of 775 webpages, which corresponds to a closed-worldsetting – the assumption that the attacker can train on all the webpages the targetuser can visit. The classifier operated on the frequency distribution of IP packet sizes,while omitting information regarding packet order and timing. They normalizedthe packet size frequency vectors in order to use them for classification. Thistechnique allowed the authors to achieve high recognition on other investigatedprivacy enhancing tools, but only 2.95% accuracy on Tor. Herrmann et al. attributethis low rate to a suboptimal configuration of the attack.

Panchenko et al. proposed new features for an attack that achieved greateraccuracy than Herrmann et al.’s under the same conditions and using the samedataset [23]. They applied a support vector machine (SVM) classifier with featuresbased on volume, time, and direction of the traffic. The authors analyzed each singletraffic trace to refine the feature set. This procedure allowed them to increase theaccuracy of the attack from 2.95% to 55%. This study was also the first one toevaluate an open-world scenario, i.e. identification of certain websites in a large setof unknown websites. They performed the first successful attack in the open-world,which is a more complex and realistic setting of the attack than a closed-world of afixed number of websites. In this study we have focused on the closed-world settingfor the sake of comparison with previous evaluations.

The SVM classifier was used again by Cai et al. who proposed a new attackbased on a new representation of the classification instances [7]: each traffic tracewas as a string of 1 and -1 representing a cell in one or another direction. TheirSVM was using the Damerau-Levenshtein edit distance and used the SVM kernel

3

2. Related Work

trick to pre-compute the distances between the traces. This classifier achieved 88%accuracy in a closed-world of 100 webpages.

This attack was further improved by Wang and Goldberg [30] who also proposeda new methodology for data collection and analysis specially suited for Tor. On thesame dataset of 100 webpages, their attack attained 90% accuracy. They suggestedthat their results could be improved by more sophisticated feature extraction methods.Indeed, a year later Wang et al. proposed an attack based on a k-Nearest Neighbor(k-NN) classifier applied on a large feature set with weight adjustment. [29]. Withthis new approach they improved the accuracy of the attack on the same set of 100webpages and reduced the time needed for training the classifier from hundred ofhours to several seconds.

Hayes et al. use yet a novel feature extraction and selection method: they userandom forests to extract robust fingerprints of webpages [11]. They conduct asystematical analysis of every single feature extracted from the random forest andmeasure its relevance for the classification problem. This thorough analysis not onlyallowed them to increase the success rate of website fingerprinting attack to 91% andreduce the time needed for training, but also concluded that simple features tend tobe more distinctive than complex features.

Finally, the state-of-the-art attack was proposed and evaluated by Panchenkoet al. [22]. The features in this work were extracted based on packet size, direction,and ordering, the extraction was manually performed at three different layers of thetraffic: application layer, transport layer and Tor cell layer.

They followed Wang et al.’s idea of using a k-NN classifier on features selectedbased on several distance metrics and their importance measure [29]. They derivedthe important features based on chosen metrics and then combined them withpreviously proposed features.

Their approach outperformed all previous methods in terms of classificationaccuracy and computational efficiency, however, it showed that website fingerprintingattack still required a thorough manual work for feature extraction and selection toensure high success rates.

2.2 Deep Learning

To the best of our knowledge, there is only one paper that has studied the effectivenessof deep learning for encrypted traffic classification [31]. Instead of identifyingwebpages, the adversary’s objective is to attribute a protocol from the network stackto an encrypted traffic trace. In addition to the existing automated classificationapproaches based on statistical features and traditional machine learning, the authorssuggest a novel method for traffic recognition based on deep learning.

Given that traffic identification is performed based on specific traffic features,extraction and selection of these features is one of the most significant difficulties,which makes this problem similar to the website fingerprinting problem. We relateto this work because we also deal with web traffic identification in our study.

4

2.2. Deep Learning

The authors rely on deep learning to perform automatic feature extraction andselection. They represent traffic traces simply as a sequence of bytes and classifythese instances using a deep neural network. This technique allows them to achievemore than 90% recognition rate, without any manual work on feature extractionand selection. This is a promising result and encourages the use of deep learningtechniques for automated extraction of the fingerprinting features.

5

Chapter 3

Background - Machine Learning

In this chapter we provide the theoretical foundation to deep learning. We startby giving a general introduction to machine learning and continue with a deeperoverview of artificial neural networks. Lastly, we introduce deep learning and giveimportant theoretical insights that explain our motivation to apply deep neuralnetworks to the website fingerprinting problem.

3.1 Introduction

Machine learning is a "process that takes certain kind of knowledge as input, andproduces another kind of knowledge as output" [6]. A machine learning system thatmanages to infer new knowledge from some previous given knowledge demonstratesthe ability to learn. In Blockeel’s words, the system can learn if "it has the capacityto improve its own performance at solving certain problems after receiving additionalinformation about the problem" [6]. Additional information often consists of pastobservations about a given phenomenon that can be used as evidence and experience.

In practice, machine learning algorithms perform automated data analysis tofind hidden patterns and correlations in data. The choice of algorithms for this tasklargely depends on the application, types of data and format of requested outputknowledge.

3.2 Predictive Learning

A common application of machine learning is predictive learning. The goal ofpredictive learning is to make predictions about future events based on previousobservations. A predictive model is a function mapping instances from input space tooutput space. The process of learning this function is called training, the instancesfrom input space used for training form the training set. If the machine learningsystem learned the function that makes an accurate prediction for any new instance(that was not present in the training set), this means that the system learned agenerally applicable prediction model.

7

3. Background - Machine Learning

Most often, the input data has an attribute-value format, i.e., the training set Tcontains elements u from the instance space U , and U is the Cartesian product of anumber of attribute domains A such that U = A1 ×A2 × . . .×AD. An attribute Ai

is a function mapping the instance u on one of its characteristics called features. Thevalues of attributes for one instance form a feature vector of this instance. Data inattribute-value learning is given in tabular format: each row represents an instance,each column represents an attribute of the instances, the entries of the table are thecorresponding features, or attribute values, of each instance. The example is given inTab. 3.1, for 3 instances and 5 attributes. In the attribute-value setting, learning a

OUTLOOK TEMPERATURE HUMIDITY WINDY PLAYsunny 23 85 false yesrain 20 90 true no

overcast 19 90 false no

Table 3.1: Example of data in the attribute-value format.

predictive model means finding a function that, for a given instance, returns the valueof its specific unknown target attribute based on values of other known attributes.For a target attribute AD, a predictive model is a mapping:

f : A1 ×A2 ×AD−1 → AD

where A1 × A2 × AD−1 and AD represent the input and output space of themodel, respectively.

For instance, for data in Tab. 3.1, the target attribute is the attribute PLAY.Its value has to be predicted based on known features: the values of the other 4attributes.

Classification is a predictive learning process where the target attribute is acategorical variable that takes its values from a set of categories or labels. If this setconsists of more than two possible values, the process is called multiclassification.

Regression is a process of predictive learning where the predicted attribute isnumerical.

Depending on the information available as input to the machine learning system,we can distinguish different types of machine learning tasks.

Supervised learning takes as input the set of observations of the form (x, y),namely, the feature vector x and the (known) value of the target attribute y asa function of the feature vector, y = f(x). Supervised learning tries to find thefunction f mapping the input values x to the output target attribute y.

Conversely, unsupervised learning does not take the y values of the target at-tributes as input. Without knowing correct values for the target attributes of traininginstances, the system does not receive any feedback based on the predicted results.The goal of unsupervised learning is therefore to search for some hidden structure orpatterns in given unlabelled data x. Unsupervised learning belongs to the exploratorytype of data analysis, which is often used as a preparatory step before performingthe supervised learning on found patterns.

8

3.3. Artificial Neural Networks

3.2.1 Generalization and Overfitting

The ability of a machine learning system to learn is defined by its generalizingcapabilities. Generalization refers to a model’s ability to describe unseen data. Themodel must fit the training set in such a way that it does not just "memorize" theinstances but truly learns the underlying relationships in data, thus generalizing alsoto other data that follows the same distribution as the training data.

The concept of generalization is strongly related to overfitting. The model thatdoes not generalize well to new data usually overfits to training data: it learns thenoise in the training set, rather than the underlying correlations in data. As result itdemonstrates a poor performance on new presented data.

A good learning model is a model that avoided overfitting to the training setduring training and hence generalizes well to new data.

3.3 Artificial Neural NetworksArtificial neural network (ANN) is a modeling tool inspired by human neural system.ANN imitates the structure and the learning process of a brain neural network,however, represent a strong mathematical abstraction of reality. It is used toestimate (or approximate) an unknown function that depends on a number of inputs.Machine learning system uses an ANN to learn a predictive model.

An ANN can be characterized by a number of parameters.

Architecture First of all, an ANN is specified by its architecture. The buildingcomponent of the network is the neuron, depicted in Fig. 3.1 and modeled as a staticnon-linear element. Neurons in the network are connected with each other, eachconnection i having an interconnection weight wi – a numerical parameter. Neuronsare ordered in layers. Topological relationships between the network neurons definethe neural net architecture. Fig. 3.3 depicts an example of fully-connected neuralnetwork with three layers of neurons.

Figure 3.1: Model of a neuron[28]

9


Activation function Activation function (or activation) is a function that is seton every neuron of the network. Each neuron computes the sum a of its input valuesx weighted by the interconnection weights and a bias term or threshold b. Then theneuron applies an activation f to the sum a and outputs the result y. The neuron inFig. 3.1 applies an activation function f to its activation a and yields the output y.To imitate the trigger of a biological neuron on collected information of incomingdata that exceeds a certain threshold, the activation function is usually chosen withthe saturation type of non-linearity [28]. For instance, activation functions tanhand sigmoid can be non-linear activation functions set on neurons, with outputvalues in range [−1, 1] and [0, 1] respectively and have the saturation moment aroundzero. The other activation function suggested in neuroscience literature that is morebiologically plausible is a rectifier [10], e.g., the rectifier function f(x) = max(0, x),also referred to as the rectifier linear unit (ReLU). ReLU is the most widely usedactivation function for deep neural networks [20]. The three described activationfunctions are shown in Fig. 3.2.

Figure 3.2: tanh, sigmoid and ReLU activation functions

Learning rule In addition to the architecture and set activation functions, theneural network is characterized by its learning rule (or learning algorithm), whichdefines the network learning behavior. The neural network learns by processing theinput data on every neuron (as described before) and changing its interconnectionweights according to a specified learning algorithm. The result of learning is the finalvalues of weights. The choice of this learning algorithm is dictated by the applicationof a predictive model, the format of input and desired output data and the overallstructure of the network.

Research on ANNs has proved that by parameterizing non-linear functions andnon-linear model structures, "we obtain universal tools for non-linear modeling" [28].

10

3.3. Artificial Neural Networks

However, there is no universal algorithm for determining either the network architec-ture or the values of its interconnection weights for a given non-linear function toestimate. Hence, the process of learning the function by a neural network is generallypractice-driven, i.e., the choice of the learning rule for a certain neural network isdefined by its architecture, application, input and output data, etc.

3.3.1 Feedforward Neural Networks

A standard neural network is a multilayer perceptron, also called a feedforward neuralnetwork. A feedforward neural network consists of an input layer, one or more hiddenlayers and an output layer; each layer with an arbitrary number of neurons. In thistype of architecture the connections between neurons never form a loop, they alwaysgo in one direction, feeding the layer’s output as input to the following layer, hencethe name "feedforward". An example of a feedforward neural net with one hiddenlayer is given in Fig. 3.3. Formally, such network can be described as [cite ANN]:

y = W · σ(V · x + β)

where x is an input feature vector, y is the output vector,W and V are interconnectionweights for the output and hidden layers, respectively, σ is an activation functionand β is a bias vector.

Figure 3.3: Feedforward neural network with one hidden layer[28]

Training Such network with continuous activation functions is capable of approxi-mating any continuous non-linear function, given non-linear activation functions seton neurons. In order to learn a function, the feedforward network has to be trained.The training algorithm of feedforward NNs is backpropagation [26]. Backpropagationis a supervised learning process based on propagating the prediction error of theoutput back to the input layer and changing the interconnection weights. Theprediction error is computed according to the chosen cost function – the objectivefunction which has to be optimized by the neural network. A traditional choice for

11


feedforward NNs is the mean squared error (MSE) computed on the training set ofN instances, namely, the summation of squared differences between the predictedoutput y and the desired output, i.e.,

MSE = 12

N∑i=1

(ydesiredi − yi)2

Such cost function is normally used for classification problems, when the outputvalues are probabilities of the instance being classified as the corresponding class.These probabilities sum up to 1. As seen from the formula of E, cross-entropydecreases when one value of the output vector stands out from all the others andincreases when all the output values are close to each other. For classification, theclass of the instance corresponds to the output value that stands out from the othervalues, which requires a minimized cross-entropy.

The goal of the backpropagation process is then to achieve the desired outputsby adjusting the interconnection weights and minimize the chosen cost function.This optimization process is usually performed iteratively, starting with arbitraryinitialized weights.

The backpropagation learning process can be generally described as follows:

1. Forward propagate the input training instances through the neural networkand generate the output.

2. Backpropagate the output through the neural network using the target outputand compute the differences between the desired and generated output valuesfor all neurons of the network.

3. Compute the gradient (derivative) of the weights by multiplying the differencesby the input values.

4. Adjust the weights by subtracting a certain ratio called learning rate from theweights gradient.

This classic algorithm is called stochastic gradient descent (SGD). The processcan be stopped whenever the error measure is sufficiently minimized, its output beingthe corresponding set of interconnection weights.

The most influential parameter of backpropagation is the learning rate η used instep 4. The learning rate defines the speed and the quality of learning. The greaterits value, the faster the change in interconnection weights, which in turn means thatthe faster the neural network learns. However, the lower the value of the learningrate, the more accurate the learning process becomes.

Performing one epoch of training means processing all of the input instances onceand adjusting the weights. Backpropagation can be done in batch mode, wherebymany propagations occur before finishing the epoch and finally adjusting the weights.The error is then accumulated over the instances in batches of a predefined sizeso that the average error of the batch is minimized. The size of the batch is thenanother hyperparameter of learning, mostly it is a power of two (it is dependent on

12

3.4. Deep Learning

the computer memory available for training the NN). The error for every batch iscomputed in parallel, hence increasing the batch size speeds up the learning process.

Neural networks can be trained with different versions of optimization algorithms,which normally represent various modifications of SGD. One of the most effectivealgorithms is the RMSProp learning rule. Introduced by Hinton [13], RMSPropapplies some modifications that improve SGD performance. RMSProp is aimedat solving the problem of SGD, namely, the fact that "the magnitude gradient canbe very different for different weights and can change during learning" and that it"makes it hard to choose a global learning rate". RMSProp modifies the algorithm bykeeping running the average of its recent gradient magnitudes for every weight anddividing the learning rate for a weight by this average. Such operation is intended tonormalize the gradient magnitudes for different interconnection weights.

Validation Validation is a part of the training process. It is used during trainingof the neural network for selection of learning parameters and to avoid overfitting.After every adjusting of weights, the network computes also the cost function on thevalidation set – a part of the data used only for validation, not overlapping with thetraining set. The NN performance on the validation set will always be lower thanperformance on the training set, but they should be as close as possible to each otherin order to avoid overfitting and to provide a good generalization of the resultingpredictive model to new data.

3.4 Deep Learning

A deep neural network (DNN) is an ANN that has multiple hidden layers. DNNshave a hierarchical architecture consisting of multiple layers of complex non-linearinformation transformations. The complexity is proportional to the amount of hiddenlayers.

Deep learning (DL) is the discipline that emerged to study the use of the DNNsfor various applications. The structure and learning behavior of DNNs distinguishesDL techniques from shallow architectures of regular neural networks used in machinelearning. Specifically, DL technologies are based on learning representations of databy means of effective algorithms for feature learning and feature extraction. Oneof significant advantages of DL is an ability to learn feature representations fromunlabeled data, in contrast to the severely limited ability of supervised learning todo so.

3.4.1 Autoencoder

An autoencoder is an example of an unsupervised learning algorithm used for buildingDNNs. The autoencoder was introduced by Hinton in 2006 as a tool to reduce thedimensionality of data [14].

The structure of an autoencoder is depicted in Fig. 3.4. It is a feedforwardneural network with an input layer, an output layer and one or more hidden layers

13


in-between. The input and output layers are of same size, which serves to the mainidea of autoencoder, i.e., reconstruction of its own input as output. More specifically,an autoencoder learns a function f which maps its input x to the output x. Thelearned function f is not the identity function but an approximation, which firstlycompresses the input data to its compressed representation h and then reconstructsh to x as the output. The level of compression is controlled by limiting the number ofhidden units. The left part of the autoencoder in Fig. 3.4 is called an encoder and itsright part is called a decoder. An autoencoder aims at learning the most resembling,

Figure 3.4: Autoencoder architecture[3]

discriminative compression of data, not simply for dimensionality reduction, but inorder to learn the underlying generative model of the input data. The very purposeof using DL algorithm based on autoencoders is to discover correlations in the inputdata. It is known that capturing these complex representations requires massiveamounts of data for the input, which is a very strong requirement of deep learningalgorithms[21].

An autoencoder hence aims to minimize the following function to ensure the bestcompression of data:

‖input− decoder(code)‖2 + sparsity(code)

The learning algorithm of an autoencoder uses backpropagation as follows. Forevery input, a feed-forward pass is done to compute activations on hidden layersand obtain an output on the last layer. The deviation of output from the input isthen measured as a squared error and backpropagated through the network withsimultaneous weight updates. Data compression learned after successful training ofthe autoencoder contains the learned features.

14

3.4. Deep Learning

3.4.2 Stacked Autoencoder

Setting multiple hidden layers in the neural network is done in order to learn morecomplex, implicit features of the input data. However, during training of neuralnetworks with many hidden layers, the final result is likely to be poor [15]. Thishappens due to the backpropagation process, that forces the network to get stuck in anon-optimal local minimum. To solve this, a greedy layer-wise unsupervised learningalgorithm was proposed by Hinton et al. [15], where a DNN works by training eachof its layers one by one, instead of training the whole network at once. During thislayer-wise training more and more useful information about the data structure inextracted with every step. Experiments conducted by Bengio et al. [5] proved thatgreedy layer-wise unsupervised learning achieves better generalization and betteroverall results by initializing weights in a region near a good local minimum. Erhanet al. [8] have also demonstrated the positive influence of unsupervised learning, i.e.,it allows to obtain a better generalization from the training dataset. This processof greedy layer-wise training is a pre-training or initializing step, i.e., it defines theinitial set of weights in the DNN.

Stacking several autoencoders on each other and training them one by one is anexample of building a greedily trained DNN. Such DNN is called a stacked autoencoder(SAE). The architecture of a SAE is depicted in Fig. 3.5. The network consists of

Figure 3.5: Stacked autoencoder[3]

multiple layers of autoencoders with one hidden layer each, where the output ofeach autoencoder is plugged in to the input of the successive layer. The decodingparts of autoencoders are discarded as it is only the compressed representations(intermediate learned features) of input data that are needed for further learningprocess. The first layer receives raw data as input, while the last layer outputs

15


the resulting representation of data. This representation contains the final featureslearned in the DNN.

As shown by Bengio et al. [4], stacking autoencoders generally performs as wellor better than stacking restricted Boltzmann machines (an alternative componentfor constructing DNNs) for unsupervised pre-training.

3.4.3 Classification with Stacked Autoencoder

A SAE, composed from greedily pre-trained autoencoders, can be used for classi-fication problem. For this the DNN has to be concluded with the final layer forclassification with the number of neurons corresponding to a desired amount ofclasses and a softmax activation function on them. The new structure is depictedin Fig. 3.6, for classification to three classes. Softmax classifier outputs a fractionalvalue between 0 and 1, which corresponds to a probability of the input instance tobelong to each class. The output probabilities sum up to 1. Since the softmax layerhas a numerical output, it performs the regression analysis. To turn regression intothe classification problem and classify the input instance to one of the labels, it isnecessary to select the neuron with the biggest probability. The training process of

Figure 3.6: Stacked autoencoder as a classifier[3]

the SAE is called a supervised fine-tuning: performing a standard backpropagationlearning of the DNN exploiting the knowledge of desired outputs (or labels, in caseof classification problem).

To learn to classify the input, the SAE has to optimize the cross-entropy functionduring training. This measure does not reflect the difference between the predictedand the desired output (as MSE did), but characterizes the predicted output. For the

16

3.4. Deep Learning

training set of N instances and predicted outputs y, cross-entropy E is calculated as:

E =−

n∑i=1

ln(yi)

n

Choosing cross-entropy as a function for minimization encourages the neural networkto ensure only one high probability among all the output neurons, which increasescertainty of the classifier prediction. After every training epoch, classification errorsare backpropagated up the network to trigger the weights’ update.

17

Chapter 4

System Model

In this chapter, we define the website fingerprinting attack and describe our networkand adversary models.

4.1 Anonymous Web Browsing

The Onion Router (Tor) is an anonymous communication network [19]. We chose Torfor our study because it is the most widely used Internet privacy tool. Tor creates adecentralized anonymization network which disguises the identity of the user andencrypts the communication traffic.

Tor uses a layered encryption technique for anonymous communication – onionrouting. A computer network that supports onion routing has all messages encryptedin several layers. Encrypted data includes not only the content of the message,but also its source and destination addresses, which provides anonymization of thenetwork traffic to the user. The encrypted data is transmitted through a Tor circuit– a series of randomly selected connected Tor nodes named onion routers or relays.On each onion router each layer of the encrypted message gets decrypted, revealingthe next destination. With this kind of routing each onion router only knows theprevious location and the next destination, so that only the first Tor node knowswhere the message originated and only the last node knows the final destination ofthe message. Thus, at every point of data transmission within the Tor circuit, thesender and the receiver of the message are never known at the same time.

Fig. 4.1 shows the principles of onion routing. The user accesses the webpageson Internet from the Tor Browser Bundle (TBB) – the browser that provides Torsoftware. TBB chooses the series of onion routers and encrypts the web trafficgenerated by the user in the corresponding amount of layers. Then TBB sends thetraffic to the Tor circuit. The first onion router in the circuit that receives the trafficis called an entry guard. For the entry guard the source address of the message isopen and the user is known. The entry guard decrypts the outermost layer of themessage, reveals the address of the next onion router and sends the traffic further tothat node of the circuit. On that onion router the next encryption layer is decryptedand the traffic is transmitted to the next node. In Fig. 4.1 the third node is a final

19

4. System Model

onion router. When arrived at the final node in the circuit, the innermost encryptionlayer is decrypted, revealing the final destination of the traffic. The node then sendsthe traffic to the webpage requested by the user, who remains anonymous to thiswebpage. We assume that the user in our model browses the Internet one webpage

Figure 4.1: Tor traffic anonymization [17]

at a time, that is, the user never attempts to visit two webpages simultaneously.Neither does the user generate any additional traffic via other web application thanTor browser. Moreover, the user may only visit a closed-world of monitored webpages,i.e., the set of pages the user visits is limited (size n) and known.

4.2 Adversary Model

We consider an adversary whose goal is to perform a website fingerprinting (WF)attack against Tor: to identify the webpages that a given user is anonymouslybrowsing.

We consider the adversary to be local, i.e., they are able to observe the trafficbetween a Tor user and Tor entry guard, as shown in Fig. 4.2; and passive, i.e., theadversary is an eavesdropper, they do not alter or seek to sabotage the user’s Torconnection (they do not drop, add or modify the connection’s packets). We assume

Figure 4.2: Website fingerprinting targeted attack scenario [17]

that the adversary is able to determine the beginning and end of a user session on agiven webpage, this is, the adversary is able to determine when the user starts andends browsing a given webpage. In addition to this, we assume that the adversary is

20

4.2. Adversary Model

always able to replicate the network and browser conditions under which the userbrowses the Internet, such as the operating system and the TBB version.

We assume that the adversary is unable to decrypt the content and addressinginformation of the packets from the user’s traffic, so that all the information collectedby the adversary is limited to the length, direction and timing of the packets – themeta-data of the Tor connection.

Note that this adversary model is consistent with prior work discussed in Chap-ter 2, enabling us to compare our evaluation with previous findings.

4.2.1 Attack strategy

We consider that the adversary relies on a deep learning (DL) predictive model toidentify the webpages visited by the user. Indeed, the WF problem can be modeledas a multiclass classification problem (see Chapter 3). If every record of trafficcorresponding to one visit to the webpage is a traffic instance, given a collectionof these traffic traces obtained from monitoring a user, the adversary attempts todetermine to which classes, or labels (i.e., webpages), these instances belong. Suchmulticlass classification problem can be solved with a neural network. In order tosolve this problem with a neural network that is capable of retrieving the most salientfeatures of the traffic trace for its identification, a deep neural network (DNN) can beused. To do this, we consider the adversary that uses a stacked autoencoder (SAE) –one of the types of DNNs, as introduced in Chapter 3.

The adversary relies on the SAE to perform the extraction and selection of salientfeatures of an input instance (a traffic trace) to correctly identify the correspondingwebpage. To classify the instance, the SAE learns the set of features that are themost useful for identification, we will call these features a fingerprint.

The adversary’s WF strategy is as follows:

Preparatory phase

1. Collecting traffic

2. Pre-processing collected data

3. Building a predictive model (SAE)

4. Pre-training the SAE

5. Training the SAE with validation

6. Testing the SAE

De-anonymization attack

7. Capturing traffic

8. Pre-processing captured data

21

4. System Model

9. Classification via the SAE

10. Evaluation of classification results

Collecting traffic

First, the adversary collects as many web traffic traces as possible by visiting themonitored websites from a closed-world of size n. The adversary collects these inthe same manner that expects to capture them from the target user, i.e., they useTor to generate anonymized web traffic. Because in this first preparatory phase itis the adversary themselves who generates the traces, they know the destination(webpages) these traces belong to.

The websites have to be crawled in several full iterations in order to avoiddependency on the dynamic content not specific to website identification (e.g.,advertisements). For this purpose, the adversary collects data in Nb iterations orbatches hours apart from each other. Within one batch, the adversary visits everyURL in the list Nv number of times and records the traffic traces of these visits.Thus, after Nb batches wit Nv visits of every URL in every batch, the adversaryobtains Nb ×Nv traffic traces for each URL. So in the end one crawl of n webpagescontains D traffic instances that is equal to Nb ×Nv × n.

Pre-processing collected data

The adversary retrieves all the available meta-data from the raw traffic traces ofthe crawl collected on the previous step. Every collected (and captured) packet ispresented in the following format:

<timestamp> <direction> <length>

where each packet has a Unix timestamp, direction (incoming or outgoing) andlength in bytes. Every traffic trace is presented by a sequence of P such packets,with Pin number of incoming and Pout of outgoing packets.

The example of a collected trace is given in Tab. 4.1, with P = Pin + Pout =14 + 11 = 25 packets. Normally, the time for loading the webpage and the amount ofsent packages vary between different URLs. Moreover, the time of connection andamount of packets vary between different visits of the same URL. Hence, with everyvisit the adversary records different amount of packets, consequently length P of thetraffic traces varies strongly for each visit of each webpage. Thus, every crawl is acollection of D instances of variable length.

The adversary will have to apply the SAE classifier to every instance in the crawl.In order to use the SAE, the adversary has to encode the collected traffic using asuitable representation that can be correctly interpreted and processed by a DNN.As explained in Chapter 3, any neural network, including the SAE, requires the inputvectors of the fixed length. The adversary cannot feed the traces in the classifier intheir initial format because they are all of different length. In order to overcome thisissue, the adversary has to normalize the size of the traffic instances first.

22


Table 4.1: Traffic trace meta-data

Transforming to histograms Adversary treats the traffic instance as two timeseries: the sequence of incoming packets of different sizes with timestamps and thesimilar sequence of outgoing packets. For each of these two time series, the adversarycalculates the amount of transmitted bytes for every unit of time in a fixed timeperiod, which results in a volume distribution (or simply bytes distribution). Namely,the adversary chooses the fixed time period t and a time unit or step st. Then forevery time step within this time period the adversary aggregates the amounts ofbytes received in this interval (or sent, depending on traffic direction). If in a certaintime interval no bytes were received (or sent), this interval corresponds to zero bytes.This procedure is done separately for every traffic direction, so each traffic trace istransformed to two distributions of the same size T = t

st . Combination of these twodistributions forms a new representation of the traffic trace, which consequentiallywill be an input vector of length Nh = T × 2 to the neural network. As a resultof this transformation, all traffic instances input vectors will have the normalizedlength Nh, regardless of the initial number of packets in each one of them. The inputvector achieved by this transformation is the initial feature vector of the instance.

In order to keep maximum information during this transformation, the adversaryhas to choose t equal to the time period of the longest trace and also choose theminimum possible step st. That is because the choice of st dictates the amount of

23

4. System Model

information lost after normalization: the bytes distribution within the interval st islost. The smaller the time step, the more accurate the resulting bytes distributionwill be. On the other hand, the smaller the time step, the longer the distribution, thebigger the input vector for the neural network, which increases the neural networksize and computations complexity. This is a trade-off the adversary has to deal withwhen pre-processing the data.

The adversary does not know in advance the maximum time duration of thetraffic traces that they will capture during the WF attack. However, it is still possibleto perform the size normalization by either choosing the maximum time duration ofthe collected traces with a margin of several seconds, or even by cropping the futurecaptured traces that are too long.

The resulting distributions can be visualized as histograms. The example ofhistograms obtained from the traffic trace in Tab. 4.1 is given in Fig. 4.3. In thisexample the adversary set t = 4.3 seconds and st = 0.1 seconds, so the histogramcontains T = 43 intervals with bytes counts for every traffic direction, which resultsin Nh = 86 values representing the traffic instance.

Figure 4.3: Histogram of traffic trace with time step 0.1 sec

Transforming to wavelet coefficients In addition to the previous steps, theadversary also explores another type of data representation by applying the discretewavelet transform (DWT) analysis [25] as an extra step in data pre-processing.Wavelet analysis is widely used for data processing in different applications. Itspositive influence on learning model performance has been repeatedly demonstratedin particular for deep neural networks [18][27].

Wavelet transformation of the signal distribution returns a time-frequency rep-resentation obtained by a multiresolution analysis [16]. Namely, a discrete waveletfunction represents a signal by its expansion coefficients, which are known for de-tecting the most salient frequencies in time series [24]. These coefficients can be

24


interpreted as time-frequency blocks in which the signal is decomposed by discretewavelet function on several signal levels, known as scales. Thus, when applied to thehistogram (traffic trace bytes distributions), obtained in the previous step, DWTreturns the wavelet coefficients that supposed to reflect all available and emphasizethe most important time-frequency features of the webpage traffic. In this casethe input initial feature vector for the adversary’s SAE will consist of the waveletcoefficients of the traffic instance, the first and the second half of the feature vectorcorresponding to the incoming and the outgoing traffic direction respectively.

In Fig. 4.4 the visualization of the same traffic trace from Tab. 4.1 is given, forboth directions. The intensity of color denotes the wavelet coefficient magnitude.The resulting two types of traffic trace representation – the histogram and the wavelet

Figure 4.4: Wavelet coefficients of a traffic instance incoming (a) and outgoing (b)direction

coefficients – consist of a different amount of elements. If the histogram containsNh positive integers values (bytes counts), the list of wavelet coefficients (calculatedby applying DWT of the same histogram) contains Nw positive fractional numbers(coefficients), where Nw is normally much bigger than Nh.

Rescaling a feature vector Regardless of which data representation was chosenout of two discussed ones, the adversary must apply one final pre-processing step tothe initial feature vector before feeding it into the SAE. As explained in Chapter 3,an autoencoder performs the reconstruction of input as its output. In order for thisto be possible, the values of the input vector have to belong to the same range asthe network output. An autoencoder output range depends on the chosen activationfunction in the hidden layer, be that sigmoid, tanh or ReLU, among others. Thevalue range of the input vector should be rescaled accordingly, e.g., the values inTab. 4.1 should be rescaled from range [621, 1516] to [0, 1] for the sigmoid activationfunction.

Preparing datasets As described in Chapter 3, the neural network is trained byusing the training and validation datasets and tested by using the test dataset. Theadversary has to decompose the crawl into the training, validation and test sets.

25

4. System Model

The training and validation datasets will be used together for performing unsu-pervised pre-training. In the supervised training of the classifier, the training datasetis used for adjusting the weights of the SAE, while the validation dataset is neededfor intermediate continuous evaluation of the training process. The test set will beused for testing the SAE in the end of the preparatory phase and evaluating thebuilt predictive model.

Building a predictive model (SAE)

In the preparatory phase, after collecting and pre-processing the data, the adversaryhas to build a predictive model. For this they have to choose the SAE architectureand its learning parameters. General architecture of the SAE includes:

1. The amount of neurons in the input layer N

2. The amount of hidden layers H.

3. The amount of neurons in the last hidden layer of features F

4. The amount of neurons in each other hidden layer Hi, where i is the index ofthe hidden layer

5. The amount of neurons in the output layer n.

6. The resulting amount of layers L = 1 +H + 1 (one input, H hidden and oneoutput layer).

7. Activation function on neurons (tanh, sigmoid, ReLU or other).

As explained in the data pre-processing section, the amount of neurons in theinput layer is precisely the size of the vector of initial features obtained after thepre-processing step, which is Nh for the histogram format or Nw for the waveletcoefficients format. Thus, N is either equal to Nh or to Nw, depending on the chosentype of pre-processing.

The amount of neurons on the output layer corresponds to the number of classes(websites) n.

The amount of neurons of the last hidden layer is the desired amount of learnedfingerprinting features. The vector of values obtained at the last hidden layer is thefingerprint of the instance (the fingerprint of the website traffic trace). The adversarycan choose how many fingerprinting features the model should learn by setting thedesired amount of neurons F on the last hidden layer.

The amount of neurons on each hidden layer should gradually cascade from thesize of the input instance down to the amount of desired fingerprinting features onthe last hidden layer and then to the layer with n classes. Thus, the bigger differencebetween N and F or n, the more hidden layers will be in the DNN. This means thatthe sizes of the input and the output layers define the amount of other layers H andneurons on these layers Hi, and thus define the complexity of computations insidethe DNN.

26


The adversary would prefer to have N as big as possible, in order to capture themost of the available information about every instance, with big precision. And atthe same time, the adversary would prefer to have F as small as possible (but notsmaller than n), because narrowing down the amount of features learned on the lasthidden layer means retrieving more abstract, high-level, complex features. However,as pointed out previously, choosing such architecture results in a very deep complexstructure, which ends up computationally very intensive during the training process.The adversary has to face this trade-off when choosing the model architecture.

Pre-training the SAE

After deciding on the SAE architecture, the adversary moves to the unsupervisedpre-training of the SAE. Pre-training of the SAE is a greedy layer-wise unsupervisedfeature extraction and selection process, as described in Chapter 3. During this stagethe adversary trains H autoencoders (one for each hidden layer). The adversary hasto set the following learning parameters for every autoencoder:

1. Learning rule (SGD, RMSProp or other).

2. Learning rate γ1 (as a parameter of the learning rule).

3. Number of training epochs e1.

4. Size of the training batch b1.

In order to narrow down the search space, the adversary in our model sets the samelearning parameters for every autoencoder.

The cost function (also called loss) used during training is a mean squared errorMSE, introduced in Chapter 3. With this error function the network evaluatesthe training process. Namely, after every training iteration (epoch) the networkevaluates the quality of learned data compressions by comparing the input with itsreconstruction (calculating MSE between them).

The adversary should find such learning parameters that ensure a good qualityof compression, that is, that minimize the MSE of reconstruction. For this theadversary performs a series of experiments in order to tune the parameters, usingboth the training and the validation datasets, without their labels. Every trainedlayer of the deep neural network reduces the dimensionality of the input vector andfeeds it into the following layer, until the last hidden layer outputs a resulting vectorof unsupervisedly derived features – a compressed representation of the initial traffictrace, that is supposed to capture the main features of the input.

Once the minimum MSE is achieved (as close to zero as possible) for everyautoencoder, the adversary saves the weights of the trained autoencoders and movesto the training stage. Note that at this point of the preparatory phase the labelsof the instances (that is, the webpages) are irrelevant: the adversary has trainedH autoencoders to retrieve the most representable compression of the traffic trace,regardless of its destination.

27

4. System Model

Training the SAE with validation

After the adversary has pre-trained H autoencoders, they should construct the SAEby stacking the autoencoders (without the decoding parts) together and concluding itwith the softmax activation layer of size n, as described in Chapter 3. After buildingthe SAE, the adversary initializes it with the pre-trained weights and performs thesupervised fine-tuning of the classifier.

The adversary uses the training dataset to perform the supervised trainingaccording to the algorithm described in Chapter 3. Knowing the real labels of thetraining set, the SAE learns to retrieve the features from the traffic traces that allowto fingerprint the websites; that is, to classify the input instances.

To achieve a good quality of learning (with high generalizing abilities of the SAEand no overfitting), the adversary has to tune the following learning parameters fortraining of the whole SAE:

1. Learning rule (SGD, RMSProp or other).

2. Learning rate γ2 (as a parameter of the learning rule).

3. Number of training epochs e2.

4. Size of the training batch b2.

The adversary has to conduct a series of experiments with varying learning parametersand evaluate the classifier performance on the training dataset and validation dataset.

The SAE learns on the training data (the network adjust its weights based on theinstances in the trainig dataset), but it is constantly evaluated on the validation data,in addition to the trainign data. The adversary in our model uses three performancemetrics for validation and evaluation of the classifier performance in the preparatoryphase: accuracy A, weighted accuracy Aw and cross-entropy (or entropy) E.

Accuracy (A). A is a percentage of correctly classified instances. It is a roughindicator of the future attack success on the real data. Accuracy is computed afterevery epoch of training for both, validation and evaluation datasets.

Weighted accuracy (Aw). Aw is a more precise indicator of classification correct-ness, since it also takes into account a certainty of made predictions. This accuracymetric is computed as a probability of correct classifications averaged over all testinstances:

Aw =

c∑i=1

Pi

n,

where n is a number of classified instances, c is a number of correctly classifiedinstances and Pi is a probability of instance i to belong to the predicted class.

In most cases, this metric is smaller than the previously defined accuracy. Itbecomes equal to the first accuracy metric if and only if all the probabilities of correctclassifications are equal to 1.

28


Cross-entropy (E). This is the loss function for the fine-tuning step. It is anaverage measure of certainty of output vectors probability distribution:

E =−

n∑i=1

ln(Pi)

n,

where n is a number of classified instances and Pi is a probability of instance i tobelong to the predicted class.

The classifier minimizes the cross-entropy between the predicted class probabilitiesand the desired distribution, where only correct class has probability 1 and all otherare zeroed. The lower the cross-entropy value, the more certain is the classifier aboutthe prediction. Based on this metric, the adversary may decide if the built predictivemodel gives trustworthy enough predictions to use it in the WF attack.

Accuracy and entropy performance metrics should only be examined in con-junction. The adversary considers the classifier performance to be sufficient onlywhen both metrics have satisfying values, that is, the maximum possible accuracyof predictions together with the minimum cross-entropy of the output probabilityvector.

The adversary computes these performance metrics after every epoch of trainingfor both, the training dataset and the validation dataset. The results for the trainingdatasets indicate how well the SAE learns the data. The results on the validationdataset is a good indicator of the classifier learning quality, that is, how well theclassifier performs on unseen data. The adversary has to analyze the classifierperformance on both datasets and tune the learning parameters in such a way thatensures good generalization on the validation dataset and avoids the overfitting onthe training dataset.

Provided well-tuned parameters of the learning process, the fine-tuning step oftraining is supposed to significantly improve the generative or discriminative abilitiesof the network, which define the quality of classification of new instances. Once theadversary finishes the fine-tuning, the building of the predictive model is completed.

Testing the SAE

Previously the adversary chose such weights of the model that show good performanceon the training and the validation datasets, but it is still important to evaluatethe model on data that was not considered during training with validation. Theadversary applies the model to the test set in order to classify the test instances. Asit was pointed out before, the adversary knows the right labels of the test set, whichallows them to evaluate the accuracy of classification.

The adversary uses the trained predictive model to classify the test set. Afterperforming classification, the SAE outputs the results and three performance metrics,introduced in the previous section. If the adversary is satisfied with the values of A,Aw and E, they consider the model to be ready for the real WF attack.

29

4. System Model

Capturing traffic

The adversary starts the WF attack by capturing the anonymized web traffic fromthe target user.

The time gap between training the classifier and performing the WF attack isvery important for the adversary. Juarez et al.’s experiments [17] prove that websitecontent changes greatly over time and this significantly affects the accuracy of theWF attack. The bigger time interval between the training and the test data, thepoorer performance demonstrated the classifier. The adversary would be willingto capture the real anonymized traffic of the target user directly after training thepredictive model. We denote the time interval between collecting the training dataand capturing the real data as dt.

Pre-processing captured data

The adversary applies the automated data pre-processing and converts each traffictrace into the initial feature vector. The captured traffic is pre-processed in thesame manner that the collected traffic was pre-processed before in order to build theclassifier. Depending on the chosen data format during the preparatory step, thecaptured data is either converted to the histograms (using the same time period tand time step st to end up with the vector of the same length as during training), orto the wavelet coefficients.

Classification via the SAE

The adversary runs the trained classifier on the pre-processed traffic traces. TheSAE takes every instance as input, computes its compressed representation with Fderived features and classifies the instance based on these features to one of n possibleclasses. Then it outputs the results of classification for every classified instance: thepredicted label, the whole output vector of predictions and the entropy measure ofthe this vector.

Evaluation of classification results

During the real WF attack, the only performance metric available to the adversarythat does not require knowledge of the real labels is a cross-entropy. The adversaryanalyzes the cross-entropy for the resulting classes predictions. If its value is satisfac-tory (small enough to ensure a high certainty of predictions), the adversary considersthe estimated labels (webpages) trustworthy, and thus, the traffic is successfullyde-anonymized.

30

Chapter 5

Evaluation

In this chapter we evaluate the suggested system model and present the final results.

5.1 Experimental Setup

5.1.1 Dataset

To simulate the traffic collected by the adversary for both, training the classifierand performing traffic de-anonymization, we use traffic traces from Juarez et al.’sexperiments [17]. These traffic traces correspond to crawls of the top 100 websites inthe Alexa list of most popular websites [1] (n = 100). In particular, the experimentsin this section are based on two specific crawls collected in 2014 from a virtualmachine with GNU/Linux operating system located within the KU Leuven network.The websites were visited by automating the Tor Browser Bundle (TBB) (version3.5.2), the network traffic was recorded with dumpcap.

The first dataset we use was crawled in Nb = 10 batches with Nv = 4 visits,so it contains D = 4000 instances. The second one was crawled two days later inNb = 5 batches with Nv = 4 visits, so D = 2000. Thus, the first dataset contains 40instances for every website, the second one contains 20 instances for every website.

The two crawls we use for our experiments were crawled under the same crawlingconditions, the only difference between these two datasets is the time of crawling:the time interval dt is equal to 2 days.

5.1.2 Software

We use the script developed by Juarez et al. [17] to parse the recorded raw networktraffic and convert it to the tabular format csv format (e.g, see Table 4.1).

We have developed a series of Python scripts for performing further data pre-processing, as described in 4. The pre-processing scripts convert the csv file with atraffic trace (its incoming and outgoing packets) to two formats: the histogram andthe wavelet coefficients. The part of our code for performing the DWT on the traffictrace was written by George Danezis.

31

5. Evaluation

We have developed a script that builds, trains and evaluates the feedforwardDNN (specifically, a SAE as described in Section 4.2), using the Python-based deeplearning tool Keras [9]. Keras works with Theano libraries specially designed forbuilding neural networks [2].

Theano libraries are capable of carrying out intensive parallel computation onGraphical Processing Units (GPU). We run the deep learning scripts on a GPUin ESAT-VISICS KU Leuven infrastructure, which gives a significant increase inexecution speed (more than 50 faster than a standard CPU).

5.1.3 Experimental Procedure

We partition the first dataset (with 4000 instances) into three non-overlapping sets:a training set, a validation set and a test set. These represent in approximateproportion 80 : 1 : 1 of the whole dataset: we use 8 crawl batches for training, 1 batchfor validation and 1 batch for testing. Obviously, there are 10 ways to decomposethe dataset into these three sets, with each way corresponding to a separate scenario.We have to experiment with every way of partitioning the given dataset becauseusing different data for training result in building a different predictive model, whileevaluating this predictive model on different data results in different performancemetrics. Then we need to average the performance metrics over all 10 scenarios.

We also use the second dataset (with 2000 instances) for testing the best predictivemodel built on the first dataset in order to evaluate its performance during the WFattack deployed two days after training.

Dataset pre-processing

We evaluate both types of data representation of traffic traces: the histogram (bytescounts) and the list of the wavelet coefficients.

We choose the parameters t and st for transformation. Time period t correspondsto the time duration of the longest trace in data. Step st is chosen empirically toprovide both, a satisfactory precision of distribution and a feasible complexity ofthe neural network, as explained in Chapter 4. We transform the traffic traces tothe histograms with t = 117, 9 seconds and st = 0.1 and then we transform thehistograms to the wavelet coefficients. The length of all the histogram is Nh = 2358,the length of the lists of the wavelet coefficients returned by the DWT procedure isNw = 4090. We end up with four datasets: two datasets in the histogram format(4000 and 2000 instances each) and the same two datasets expressed in the waveletcoefficients.

SAE structure

Every pre-processed instance, be that a histogram or a list of the wavelet coefficients,represents the initial vector of features of the traffic trace. The size of the vectordefines the size N of the input layer of the SAE classifier used by the adversary.

32

5.2. Experimental Result

We experiment with varying amounts of hidden layers and neurons in each layerfor both data representations. The last layer of the SAE classifier in all experimentshas n = 100 neurons, one for each webpage.

We experiment with all three discussed activation functions. For tanh and sigmoidwe perform rescaling of the input vectors to fit into [−1, 1] and [0, 1] value ranges,while the ReLU activation function is defined for every real number.

Pre-training the SAE and training with validation

For every built SAE structure, we perform the pre-training and training withvalidation procedures using our software, the same way as the adversary woulddo during the preparatory phase. For this we tune the learning parameters for bothstages: the unsupervised feature extraction and selection using the autoencoders andthe supervised fine-tuning of the SAE classifier.

While tuning the parameters, we vary the values of the pre-training phase γ1, e1, b1and of the trainig phase γ2, e2, b2. We vary the learning rates γ1 and γ2 from 0.1to 0.000001. The amount of training epochs e1 and e2 are varied from 100 to 1000for autoencoders in the pre-trainig stage and from 500 to 3000 for the SAE in thetrainig stage. The size of the training batch b takes values from the powers of twostarting from 16 and ending with 256.

Testing the SAE

For every trained SAE, we test the resulting classifier on the test set and analyze theresults of classification using specifically chosen performance metrics described inChapter 4 (accuracy, weighted accuracy and cross-entropy) and analyze the results.

De-anonymization attack

In our experiments, the results achieved on the test set during testing of the SAEindicate the performance of the adversary’s model on captured traffic (without knownlabels). In other words, if the adversary would have trained the model the same wayas we did during our experiments, then captured a batch of data from the targetuser and applied the classifier to this data, the result would have been the same. Sothe success of testing the SAE trained in our experiments indicates the success ofthe de-anonymization attack by the adversary training the same predictive modeland applying it to data captured directly after training.

In addition, we test the SAE on the second crawl, as if this was the data theadversary captured two days after training their model.

5.2 Experimental Result

Here we present the SAE that both demonstrated a high performance of websitefingerprinting and required a feasible amount of complex computation.

33

5. Evaluation

The following results are for the SAE designed for traces from the first datasetconverted to wavelet coefficients. Initially we conducted experiments with instancesin the histogram format, but introducing the DWT and converting the data to thewavelet coefficients format yielded better performance.

The presented SAE has 5 hidden layers, the input layer has 4090 neurons (amountof input wavelet coefficients), the following layers have 3600, 3000, 2600,1900 neurons,the final hidden layer of learned features has 1000 neurons, and the output layerhas 100 neurons corresponding to the closed-world of webpages. The last layer hassoftmax activation function set for classification, while all prior layers have a sigmoidactivation function. Our experiments showed that setting the sigmoid function onneurons generally resulted in better learning.

During our experiments, the RMSProp constantly demonstrated better resultsthan the standard SGD for both stages of training, as it was expected. So here weonly present the results for RMSProp.

The best quality of data compression of pre-training were achieved with thefollowing learning parameters: γ1 = 0.001, e1 = 150, b1 = 256.

The SAE was constructed from the autoencoders that showed the best results.The best performance of the fine-tuning with validation were achieved with theselearning parameters: γ2 = 0.00001, e2 = 1500, b2 = 128.

First we evaluate this SAE 10 times on different scenarios explained in Sec-tion 5.1.1. In Tab. 5.1 we show how we choose batches for training, validation andtest for every experiment. The light colored sections correspond to the batches inthe training set, the darker colored batches are used for validation and the blackcolored batches are used for test.

We provide the performance metrics values (A accuracy, Aw weighted accuracyand E cross-entropy) for every combination of datasets and compute their averagevalues for all 10 models. We achieve 60.86% accuracy on average and 70.59% in the

Figure 5.1: Cross-evaluation of the SAE for the WF attack

best case. Taking into account the predictions of correctly classified instances, themodel performs with 49.68% average weights accuracy. As opposed to 1% accuracy

34

5.3. Illustration of Parameter Tuning

of random classification, the model consistently demonstrates high performance. Arelatively low value of entropy expresses the certainty of the classifier. For comparison,we could not achieve more than 53% accuracy on data in histogram format. Thus,we showed that DWT can be an efficient pre-processing step for DL.

We test the most precise model among 10 (from experiment #7 with 70.59%accuracy) on a the second dataset, which was crawled two days later. The results arepresented in Tab. 5.1. We observe a 30% drop in accuracy and the entropy becametwice bigger.

A Aw E

40.05% 31.84% 3.677365

Table 5.1: Testing the classifier on data crawled two days after training

5.3 Illustration of Parameter TuningIn this section we provide an example of tuning of the model parameters which allowsto achieve higher performance. This procedure demonstrates an example of actionsthe adversary must take in order to construct the classifier for the WF attack.

We describe the process of tuning parameters for one scenario of partitioning thedataset into the training, validation and test datasets: the training dataset consistsof the first crawl’s batches from 1 to 8, the validation set is the 9th batch and thetest set is the 10th batch of the same crawl (see the experiment 1 in Tab.5.1). Totrain the first autoencoder we set the default learning rate RMSProp to 0.001 asrecommended by Hinton [13]. We choose the batch size equal to 64 and try to learnthe data compression in 200 epochs (these are just estimated values which can beadjusted further during the tuning process). In order to evaluate the learning process,we plot the learning curves of accuracy and loss of the reconstruction in Fig. 5.2. Thecharts shows that the autoencoder quickly reaches a high accuracy of reconstruction.It gets to the lowest error value around the 150th epoch and then stabilizes. Thismeans that our learning settings are appropriate for learning a good representation.We see that continuing the learning process after the 150th epoch does not yieldfurther any significant gains in MSE. Hence, in order to optimize the process, weset the amount of training epochs to 150 for all other autoencoders. Analogously,we train the remaining 5 autoencoders. The resulting accuracy and MSE valuesare listed in Tab. 5.2. After performing the described procedure, we have 5 trainedautoencoders which show a good quality of data compression.

We build the SAE and initialize it with the pre-trained weights, we train the SAEto retrieve the features from the traffic traces that allow to fingerprint the websites;that is, to perform classification of input instances.

We use the classification labels of the same training instances and perform thesupervised fine-tuning algorithm, as described in Chapters 3 and 4.

This stage is more computationally intensive because of the multiple hiddenlayers. We initially set the default learning rate to 0.001 and choose a bigger batch

35

5. Evaluation

Figure 5.2: Accuracy and loss (MSE) during training of autoencoder

# Accuracy MSE1 94.049 0.018922 92.546 0.014573 97.526 0.018344 97.087 0.018415 81.272 0.02290

Table 5.2: Performance metrics of the pre-trained autoencoders.

size (128) in order to speed up the learning process, trying to fine-tune the classifierin 1000 epochs.

Figure 5.3 shows the resulting learning curves for training set (in red) andvalidation set (in blue), that in turn demonstrate how the model overfits the trainingdata. The SAE shows almost perfect recognition of the training set already on the100th epoch of training. But for the validation set, the loss function that has to beminimized starts increasing already after the 50th epoch, and the accuracy stopsimproving after 40%. We also see that validation learning curves tend to oscillatein the beginning of learning. This is due to the learning rate value being too big:according to the backpropagation algorithm described in Chapter 3, the size of thelearning rate defines how sharply the weights are changed after every epoch. To avoidoscillation of accuracy and loss function for validation set, we adjust the learningrate to 0.0001. The results are presented in Fig. 5.4.

We see that the model still overfits to the training set, but it does so later than

36

5.3. Illustration of Parameter Tuning

Figure 5.3: Accuracy and loss (cross-entropy) during training of SAE (red fortraining set and blue for validation set) lr = 0.001

Figure 5.4: Accuracy and loss (cross-entropy) during training of SAE (red fortraining set and blue for validation set) lr = 0.0001

before, only after the 250th epoch. This means that reducing the learning rateimproved the learning process. We set a new learning rate equal to 0.00001 and

37

5. Evaluation

repeat the experiment with more training epochs – 1500. The resulting learningcurves are depicted in Fig. 5.5. On these plots we can see how accuracy and lossfor the training and validation set are simultaneously improving, with validationperformance falling not far behind from the training one. We do not observe anyoverfitting. Then we memorize the weights of the DNN at the moment of the minimalloss value and consider it to be our final model.

Figure 5.5: Learning process of the final SAE for the wavelet format of data (redfor training set and blue for validation set) lr = 0.00001

DL model performance on the validation set is a good indicator of how well themodel generalizes to unseen data. We can expect the model to demonstrate the sameperformance on the test set during the evaluation. The same way, the adversary mayuse the validation during training to estimate how well the trained SAE will performduring the WF attack on captured crawls.

38

Chapter 6

Discussion and Future Work

In this chapter we discuss our study results, the limitations and challenges of thiswork. We also suggest the following future extensions to our study.

The fundamental difference of our approach for WF from previously suggestedones lays in the feature extraction procedure. Our experiments prove that the WFattack can be successfully deployed against traffic anonymized by Tor without themanual feature engineering. Our DL model is capable of automated retrieval of WFfeatures and identifying the websites based on these features. We demonstrated asuccessful de-anonymization attack with 61% accuracy on average and with 71% asthe highest accuracy.

Our approach does not improve the websites identification results of existingworks. However, it exposes potential improvement possibilities for the adversary.Previously, every improvement reached by the adversary was due to defining morecomplex features. Our method shows that applying NNs to the WF problem can reachcomparable results without manual feature extraction. Taking measures describedfurther in future research may result in outperforming the previous methods.

We do not discard that constructing a deeper DNN (with more hidden layers ofbigger size) might further improve the presented results. However, our experimentsshowed that simply increasing the amount of hidden layers does not necessarily leadto better performance: it is also important to tune the amount of neurons and DNNlearning parameters to benefit from "deepening" the structure.

As it was pointed out earlier in Chapter 3, it is only possible to fully benefitfrom DL techniques when a large amount of training instances is available. Notonly any learning-based method requires as many as possible training instances togeneralize to new data, but also extracting complex features from input data requiresprocessing a maximum amount of data. The datasets available for our experimentsonly included 40 instances for every class from 100 possible. Hence, another directionfor improving the classification is collecting more data for training. The amount ofinstances per each class must at least exceed the number of output neurons (thenumber of possible classes), which is a common rule for NNs.

Training the DL model a long process, and only becomes longer with moretraining instances and a deeper structure. Our experiments showed that the de-

39

6. Discussion and Future Work

anonymization attack performance drops with time (we observed a 30% drops inaccuracy two days after training), so the adversary has to train the model rightbefore executing the attack. Training our model took 25 minutes on average, whichis relatively slower than the state-of-the-art methods, which take several seconds fortraining the classifier.

Moreover, the stacked autoencoder is by far not the only DNN capable ofperforming feature extraction and solving the classification problem. The adversarythat wishes to perform automated feature extraction could find inspiration in researchon other types of deep neural nets: deep-belief NNs, deep convolutional NNs, deeprecurrent NNs, and so on. Among all, we chose to implement a stacked autoencoderbecause of the existing application of this DNN to web traffic classification, discussedin Chapter 2.

Finally, the other possibility of improvement of the WF attack accuracy couldbe combining our model with manually defined fingerprinting features presented byprior works.

From the WF perspective, it has to be noted that our system model followsmany assumptions made in the previous research on Tor de-anonymization. Forinstance, we do not consider the possibility of dynamic changes on the webpage,be that different language version, third-party content such as advertisement, andsimply content changes in time. This assumption strongly simplifies the task for theattacker. As it was pointed out by Juarez at al. [17], fixing the language version ofthe webpage means assuming that the adversary knows location of the Tor exit node.Moreover, traffic generated by requesting different webpages of the same website alsodifferes, which is another fact usually ignored in WF research, including out systemmodel.

But the most unrealistic assumption made in this work is the assumption of theclosed-world of webpages. In reality, the target user may visit any webpage in theInternet (so called open-world), so the adversary may never collect the instances ofall possible websites in order to train the predictive model. The previous researchin WF often includes evaluation of proposed approaches for the open-world setting.The other future extension of our study would be to evaluate the DL model in theopen-world.

40

Chapter 7

Conclusion

In this study, our objective was to apply a novel deep learning technique to extractand select website fingerprinting features and perform de-anonymization attacks onanonymized Tor traffic. The focus of our research was on the closed-world settingwith a local passive adversary that performs a targeted attack on a Tor user.

We started from developing an adversary model that builds, trains and evaluatesa deep neural network classifier – a stacked autoencoder. The classifier works inthree stages: first, it performs an unsupervised feature extraction and selection of itsbuilding components – autoencoders – by learning the underlying patterns in traffictraces and deriving the most distinctive characteristics of the input data. The nextstep of training is a supervised fine-tuning of the whole deep neural network. Duringthis stage the model learns to classify traffic into the set of webpages based on derivedtraffic features. After training, the adversary evaluates their model on the test setbased on accuracy of website identification and other performance metrics. The finalstep is performing de-anonymization of Tor traffic using this model: capturing thetarget user’s traffic and classifying it using the trained stacked autoencoder classifier.

We proposed two types of automated traffic pre-processing that outputs the datarepresentation suitable for a neural network: the list of traffic bytes counts and thelist of the wavelet coefficients. We used the ready datasets collected by visiting aclosed-world of 100 webpages and converted them to both types of data format. Inorder to evaluate our system model, we conducted a series of experiments where weimitate the actions of the adversary preparing for the website fingerprinting attackand deploying the attack using the developed classifier. In order to enhance theperformance of the classifier, we performed the system parameter tuning on fixedranges of values.

We presented the best setup that allowed to achieve 71% of classification accuracy,which is a comparable success rate with previous works in website fingerprinting. Weobserved that performing the discrete wavelet transform of the input data yields moreaccurate website identification. In addition, our experiments showed that increasinga time interval between training the classifier and deploying the attack leads to aworse performance of the model.

Even though the presented method does not outperform existing website finger-

41

7. Conclusion

printing models, our study opens a new direction of research; until now performingthe attack has always required a manual feature extraction carried out by expertanalysis. Arbitrary choice of features was often based on a number of heuristics andexpert intuition. We showed that this time-consuming and rigorous process can bedone automatically by exploiting a deep learning technique. The fact that our modelis capable of performing de-anonymization attack without manually defining trafficfeatures suggests that by taking additional measures, such as parameter tuning,collecting more data or even combining the model with existing features from priorwork, the adversary could improve the attack.

42

Bibliography

[1] Alexa top global sites. URL: http://www.alexa.com/topsites.

[2] LISA laboratory, University of Montreal. Theano. URL: http://deeplearning.net/software/theano/#.

[3] N. Andrew, N. Jiquan, and C. S. Chuan Y. Foo, Yifan Mai. Ufldl tutorial:Stacked autoencoders. URL: http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders, 2010.

[4] Y. Bengio. Learning deep architectures for ai. Foundations and trends R© inMachine Learning, 2(1):1–127, 2009.

[5] Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, et al. Greedy layer-wisetraining of deep networks. Advances in neural information processing systems,19:153, 2007.

[6] H. Blockeel. Machine Learning and Inductive Inference. Acco, 2010. ISBN: 97890 334 8297 7.

[7] X. Cai, X. C. Zhang, B. Joshi, and R. Johnson. Touching from a distance:Website fingerprinting attacks and defenses. In Proceedings of the 2012 ACMconference on Computer and communications security, pages 605–616. ACM,2012.

[8] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, and S. Bengio.Why does unsupervised pre-training help deep learning? Journal of MachineLearning Research, 11(Feb):625–660, 2010.

[9] Francois Chollet. Keras: Deep learning library for theano and tensorflow. URL:https://keras.io/k.

[10] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks.Aistats, 15(106):275, 2011.

[11] D. Hayes. Website fingerprinting at scale. University College of London (UCL),number: Technical report, 2015.

43

http://www.alexa.com/topsites

http://deeplearning.net/software/theano/#

http://deeplearning.net/software/theano/#

http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders

http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders

https://keras.io/k

Bibliography

[12] D. Herrmann, R. Wendolsky, and H. Federrath. Website fingerprinting: attack-ing popular privacy enhancing technologies with the multinomial naïve-bayesclassifier. In Proceedings of the 2009 ACM workshop on Cloud computing security,pages 31–42. ACM, 2009.

[13] G. Hinton. Neural networks. URL: https://www.coursera.org/learn/neural-networks.

[14] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data withneural networks. Science, 313(5786):504–507, 2006.

[15] G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deepbelief nets. Neural Computation, 18:1527–1554, 2006.

[16] B. Jawerth and W. Sweldens. An overview of wavelet based multiresolutionanalyses. SIAM review, 36(3):377–412, 1994.

[17] M. Juarez, S. Afroz, G. Acar, C. Diaz, and R. Greenstadt. A critical evaluationof website fingerprinting attacks. In Proceedings of the 2014 ACM SIGSACConference on Computer and Communications Security, pages 263–274. ACM,2014.

[18] S. Kaitwanidvilai, C. Pothisarn, C. Jettanasen, P. Chiradeja, and A. Ngao-pitakkul. Discrete wavelet transform and back-propagation neural networksalgorithm for fault classification in underground cable. In Proceedings of theInternational MultiConference of Engineers and Computer Scientists, volume 2.Citeseer, 2011.

[19] N. R. Laboratory. Tor. URL: https://www.torproject.org/about/torusers.html.en.

[20] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444,2015.

[21] M. M. Najafabadi, F. Villanustre, T. M. Khoshgoftaar, N. Seliya, R. Wald,and E. Muharemagic. Deep learning applications and challenges in big dataanalytics. Journal of Big Data, 2(1):1, 2015.

[22] A. Panchenko, F. Lanze, A. Zinnen, M. Henze, J. Pennekamp, K. Wehrle, andT. Engel. Website fingerprinting at internet scale. In Proceedings of the 23rdInternet Society (ISOC) Network and Distributed System Security Symposium(NDSS 2016), 2016.

[23] A. Panchenko, L. Niessen, A. Zinnen, and T. Engel. Website fingerprinting inonion routing based anonymization networks. In Proceedings of the 10th annualACM workshop on Privacy in the electronic society, pages 103–114. ACM, 2011.

[24] D. B. Percival and A. T. Walden. Wavelet methods for time series analysis,volume 4. Cambridge university press, 2006.

44

https://www.coursera.org/learn/neural-networks

https://www.coursera.org/learn/neural-networks

Bibliography

[25] R. Perez, J. Mattingly, and J. Perez. Wavelet transform techniques and signalanalysis. Technical report, Oak Ridge National Lab., TN (United States), 1993.

[26] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representationsby back-propagating errors. Cognitive modeling, 5(3):1, 1988.

[27] S. Sihag and P. K. Dutta. Faster method for deep belief network based objectclassification using dwt. arXiv preprint arXiv:1511.06276, 2015.

[28] J. Suykens. Artificial Neural Networks. Katholieke Universiteit Leuven Depart-ment of Electrical Engineering, ESAT-STADIUS, 2013.

[29] T. Wang, X. Cai, R. Nithyanand, R. Johnson, and I. Goldberg. Effective attacksand provable defenses for website fingerprinting. In 23rd USENIX SecuritySymposium (USENIX Security 14), pages 143–157, 2014.

[30] T. Wang and I. Goldberg. Improved website fingerprinting on tor. In Proceedingsof the 12th ACM workshop on Workshop on privacy in the electronic society,pages 201–212. ACM, 2013.

[31] Z. Wang. The applications of deep learning on traffic identification, 2010.

45

KU Leuven 2016 – 2017

Master thesis filing card

Student: Vera Rimmer

Title: Deep Learning Website Fingerprinting Features

UDC : 621.3

Abstract:Anonymity networks like Tor enable Internet users to browse the web anonymously.This helps citizens circumvent censorship from repressive governments, journalistscommunicate with anonymous sources or regular users to avoid tracking online.However, adversaries can try to identify anonymous users by deploying severalattacks. One of such attacks is website fingerprinting. Website fingerprintingexploits the ability of an adversary to generate anonymized (encrypted) traffic toa number of websites and then compare it —based on traffic metadata such aspacket timing, size and direction—, to the traffic metadata of the users the adversarywishes to de-anonymize. When a user’s traffic metadata matches the metadata theadversary had previously generated, the website the user visits may be revealed tothe adversary, thus breaking the user’s anonymity. In prior works, authors haveidentified several features that allow an adversary to fingerprint the websites visitedby a user. Examples of such features include packet length counts or the timingand volume of traffic bursts. These features were however manually identified, e.g.,using heuristics, leaving open the question of whether there are more identifyingfeatures or methods to fingerprint the websites visited by the anonymous users. Inthis thesis we depart from prior work and design a website fingerprinting attackwith automated feature extraction, this is, we do not manually select the identifyingfeatures to fingerprint the websites but rely on machine learning methods to do so.We rely on deep learning techniques to learn the best fingerprinting features anddemonstrate the viability of our attack by deploying it on a closed-world scenarioof 100 webpages. Our results show that, with 71% website identification accuracy,adversaries can use machine learning methods to de-anonymize Tor traffic, insteadof having to rely on manual selection of fingerprinting features.

Thesis submitted for the degree of Master of Science in Artificial Intelligence, optionEngineering and Computer ScienceThesis supervisor : Prof. Claudia DiazAssessors: Prof. Frank Piessens

Güneş AcarMentors: Marc Juarez

Ero Balsa

Documents

Deep Learning Website Fingerprinting Features · Deep Learning Website Fingerprinting Features Vera Rimmer Thesis submitted for the degree of Master of Science in Artiﬁcial Intelligence,