20
Toyota InfoTechnology Center U.S.A, Inc. 1 Mixture Models of End-host Network Traffic John Mark Agosta, Jaideep Chandrashekar, Mark Crovella, Nina Taft and Daniel Ting Toyota-ITC, Technicolor, Boston U., Technicolor, Facebook

Mixture Models of End-host Network Traffic

  • Upload
    corbin

  • View
    40

  • Download
    0

Embed Size (px)

DESCRIPTION

Mixture Models of End-host Network Traffic John Mark Agosta, Jaideep Chandrashekar , Mark Crovella , Nina Taft and Daniel Ting Toyota-ITC, Technicolor, Boston U., Technicolor, Facebook. Outline. - PowerPoint PPT Presentation

Citation preview

Page 1: Mixture Models of  End-host  Network  Traffic

Toyota InfoTechnology Center U.S.A, Inc. 1

Mixture Models of End-host Network Traffic

John Mark Agosta, Jaideep Chandrashekar, Mark Crovella, Nina Taft and Daniel Ting

Toyota-ITC, Technicolor, Boston U., Technicolor, Facebook

Page 2: Mixture Models of  End-host  Network  Traffic

Toyota InfoTechnology Center U.S.A, Inc.

Outline

We collected traffic at the end-host; something rarely monitored. Conventional distributions don’t fit heavy tailed data The dense part of the distribution doesn’t look Pareto, & just fitting the Pareto tail

doesn’t describe the data. Fit by mixture models – but not the typical Gaussian mixtures – of a Pareto tail

with exponentials as a proxy for the dense part. Model Selection – best number of components constrained by complexity penalty

& returns a model of the entire distribution. Uses:

Better tail parameter estimates than conventional measures. Soft clustering – assign traffic to exponential v/s Pareto components, by

protocol More stable threshold setting

2

Page 3: Mixture Models of  End-host  Network  Traffic

Toyota InfoTechnology Center U.S.A, Inc.

Data collection effort

End-host flows: Collected at Laptop network port Collection moved around with device Assembled from packet trace headers On enterprise XP build Periodic server uploads Logged with user & CPU activity, to

eliminate off periods.

Data Sets:270 personal machine data sets 90% laptops5 week duration400G raw data, total.Flow initiation counts are binned

in intervals from 4 to 512 seconds

Removed zero-count intervalsMedian sample 9800 pointsMax sample size 264k

3

Page 4: Mixture Models of  End-host  Network  Traffic

Toyota InfoTechnology Center U.S.A, Inc.

Heavy tailed data is extremely wide compared to conventional distributions.

Fitting any exponential family distribution (e.g. Gaussian, Poisson…) fails. Any exponential tail is too steep.

Fitting a mixture of exponential families requires an impractical number of components.

But just fitting the power law tail ignores most of the probability mass

4

Best fit normal

Page 5: Mixture Models of  End-host  Network  Traffic

Toyota InfoTechnology Center U.S.A, Inc.

Heavy tailed data is extremely wide compared to conventional distributions.

Fitting any exponential family distribution (e.g. Gaussian, Poisson…) fails. Any exponential tail is too steep.

Fitting a mixture of exponential families requires an impractical number of components.

But just fitting the power law tail ignores most of the probability mass

5

Best fit normal

Page 6: Mixture Models of  End-host  Network  Traffic

Toyota InfoTechnology Center U.S.A, Inc.

The distribution looks like an exponential above and a power law below

6

good fit

bad fit

bad fit

good fit

Power law fit Exponential fit

Page 7: Mixture Models of  End-host  Network  Traffic

Toyota InfoTechnology Center U.S.A, Inc.

Exponential – Pareto mixture models.

A mixture model is a hierarchical model where the mixing weights determine the probability of each of the component models, which in turn generate the sample points.

Since all components share the same support, any sample point could in principle have been generated by any component, by its mixing probability.

We consider three models:Pareto: One power-law component

Exponential – Pareto: One of each

2 Exponentials, one Pareto:

Any more exponential components cannot be resolved.

7

Page 8: Mixture Models of  End-host  Network  Traffic

Toyota InfoTechnology Center U.S.A, Inc.

By modeling the entire data set, mixture models give more accurate tail α-parameter estimates than methods that consider only the tail data.

8

When tested on synthetic Pareto-tailed data, EP mixture model estimator performs significantly better than the well-known AEST method. (AEST estimates are shown on the left, and EP-based estimates on the right in each pane.)

Page 9: Mixture Models of  End-host  Network  Traffic

Toyota InfoTechnology Center U.S.A, Inc.

Model Selection versus Goodness-of-Fit

Goodness-of-fit tests, while useful for initial characterization, don’t have an explicit acceptance criterion, and, as data set size increases, will eventually reject all models.

A Model selection is a relative, pairwise criterion that derives from comparison of likelihoods.

We use the Bayes Information Criterion to approximate the Bayes Factor terms. It penalizes the maximum likelihood by the model degrees of freedom, d, so that models of different number of parameters can be compared.

9

The Bayes Factor is the ratio of the marginal likelihood of one model (EP) to another (P). For instance a log Bayes Factor of 5 indicates the probability of the data given one model versus the other is over a 100:1.

With the BIC approximation, the log Bayes Factor becomes

Page 10: Mixture Models of  End-host  Network  Traffic

Toyota InfoTechnology Center U.S.A, Inc.

Pairwise BIC comparisons of the reveal large log BF values for EP vs P and smaller values for EEP vs EP

10

Boxplot of BIC comparison for Pareto vs. EP Mixture Model.

Boxplot of BIC comparison for EP vs. EEP Mixture Model.

EP

P

EEP

EP

Page 11: Mixture Models of  End-host  Network  Traffic

Toyota InfoTechnology Center U.S.A, Inc.

Model Selection Results

Model selection results based on Bayes Factors, over all users. Each bar represents the same user set with a different binning time window.

For the P, EP, and EEP models -- P: Only a handful of users are given the Pareto-only model, EP: Overall, the EP model is selected for 50-85% of the users, depending upon the bin size, andEEP: Between 15%-40% of user machines are best modeled by EEP, again depending upon the bin size.

11

P

EP

EEP

Page 12: Mixture Models of  End-host  Network  Traffic

Toyota InfoTechnology Center U.S.A, Inc.

Histograms of Heavy-Tail Parameters’ Variation, EP Model.

12

• The difference across users is significant.

Page 13: Mixture Models of  End-host  Network  Traffic

Toyota InfoTechnology Center U.S.A, Inc.

Partitioning traffic into Exponential and Pareto ranges

Mixture fractions as a function of connections indicate (soft) membership of the data into a component.

In this example, bins with less than 82 counts are almost entirely exponential, and those with greater than 82, almost entirely Pareto.

This way different sources of the traffic can be characterized as heavy-tailed or not.

13

Mixture Fractions, User 256

mPareto

mexp

P(traffic)

Page 14: Mixture Models of  End-host  Network  Traffic

Toyota InfoTechnology Center U.S.A, Inc.

Traffic Fractions, in Exponential and Pareto Components, by Protocol

14

Although Exponential traffic dominates in all cases, the long tail (i.e. Pareto) traffic appears largely from bursts of ICMP, DNS and web traffic flows.

Page 15: Mixture Models of  End-host  Network  Traffic

Toyota InfoTechnology Center U.S.A, Inc.

In summary

1. We have modeled traffic as flow initiations from end hosts in an enterprise,using mixture models, employing model selection.

2. We have discoveredStrong evidence that the traffic, is almost always heavy-tailed, with the Pareto component contributing about 1/4 of the probability mass. and with power law scaling parameter with mean = 1.6

that varies widely, between 1.0 and 2.0.

3. Apparently DNS, ICMP and some web traffic make up the tail component.

15

http://arxiv.org/abs/1212.2744

See the full paper at

Page 16: Mixture Models of  End-host  Network  Traffic

BACKUP

16

Page 17: Mixture Models of  End-host  Network  Traffic

Toyota InfoTechnology Center U.S.A, Inc.

Pareto & Exponential components of selected users

17

Page 18: Mixture Models of  End-host  Network  Traffic

Toyota InfoTechnology Center U.S.A, Inc. 18

Page 19: Mixture Models of  End-host  Network  Traffic

Toyota InfoTechnology Center U.S.A, Inc.

Anomaly thresholds derived from models are more stable than empirical thresholds.

19

Page 20: Mixture Models of  End-host  Network  Traffic

Toyota InfoTechnology Center U.S.A, Inc.

Component parameters are independent

20

This implies that the exponential and Pareto components are generated by separate sources.