Towards Reliable Traffic Classification Using Visual Motifs€¦ · Using Visual Motifs to Classify Encrypted Tra c, Wright et al. 2006 Intelligent Classi cation and Visualization

BackgroundVisual Motifs

Traffic ClassificationEvaluation

Towards Reliable Traffic Classification UsingVisual Motifs

Wilson Lian1 John McHugh1,2 Fabian Monrose1

1University of North Carolina at Chapel Hill

2RedJack, LLC

FloCon 2010

Wilson Lian, John McHugh, Fabian Monrose Towards Reliable Traffic Classification UsingVisual Motifs



Overview

Background

Visual Motifs

Traffic Classification

Evaluation




Motivation

Internet

Network Administrator




Motivation

GET /index.ht...

d3b07384d113e...

d41d8cd98f00b...

7d8ad5cb9c940...

MAIL FROM: foo@...

d41d8cd98f00b...




Goals

Port 22

36fd6d8c3f5af4...Port 22

fc2394c1a922...Port 22

f4d6d8c3f5a36...Port 22

222394c1a9fc...Port 22

5ad6d8c3ff436...Port 22

a92394c122fc...

Port 25

MAIL FROM: [email protected] 25

f98698466c3ef...Port 25

ef8698466c3f9...Port 25

DATA\r\nSubject: fo...Port 25

f98698466c3ef...

Port 80

b314caafaa3e...Port 80


POST /login.ph...Port 80


POST /AuthSv...Port 80

GET /index.ht...

Port 1214

113edec49eaa...Port 1214

006f7b3db8f4f...Port 1214

aa3edec49e11...Port 1214

4f006f7b3db8f0...Port 1214

9e3edec4aa11...Port 1214

b8006f7b3d4ff0...




Assumptions

Reliable transport via TCP

Stream Cipher

No access to payloadLength preservation

Negligible packet loss & retransmission




Related Work

Scatter (and other) Plots for Visualizing User Profiling Dataand Network Traffic, Goldring 2004.

Using Visual Motifs to Classify Encrypted Traffic, Wright etal. 2006

Intelligent Classification and Visualization of Network ScansMuelder et al. 2008.

FloVis: A Network Security Visualization Framework, Taylor2009.




Timeline Heatmaps

Image credit: Wright et al. 2006




Timeline Heatmaps





Timeline Heatmaps





Timeline Heatmaps





Timeline Heatmaps





Timeline Heatmaps





Timeline Heatmaps





Unigram Heatmaps





Bigram Heatmaps




Heatmap Construction

Time

SYN 48 bytes

SYN-ACK 48 bytes

ACK 40 bytesHTTP Request 891 bytes

1500 bytes

40 bytes

40 bytes

1500 bytes270 bytes40 bytes

Client Server

48-4840

891-40

-270-1500

40-1500

40

(48, -48)(-48, 40)(40, 891)(891, -40)(-40, -270)

(-270, -1500)(-1500, 40)(40, -1500)(-1500, 40)





(48, -48)(-48, 40)(40, 891)(891, -40)(-40, -270)

(-270, -1500)(-1500, 40)(40, -1500)(-1500, 40)

(40, 891)

(48, -48)

(-48, 40)(40, 891)

(891, -40)(-40, -270)(-270, -1500)

(-1500, 40)

(40, -1500)

(-1500, 40)





(48, -48)(-48, 40)(40, 891)(891, -40)(-40, -270)

(-270, -1500)(-1500, 40)(40, -1500)(-1500, 40)

(40, 891)

(48, -48)

(-48, 40)(40, 891)

(891, -40)(-40, -270)(-270, -1500)

(-1500, 40)

(40, -1500)

(-1500, 40)3/9 = 33.3%

2/9 = 22.2%

1/9 = 11.1%

3/9 = 33.3%








Bigram Heatmaps




Modeling Protocol Behavior

(40, 891)

(48, -48)

(-48, 40)(40, 891)

(891, -40)(-40, -270)(-270, -1500)

(-1500, 40)

(40, -1500)

(-1500, 40)3/9 = 33.3%

2/9 = 22.2%

1/9 = 11.1%

3/9 = 33%

1

3 4

2




Modeling Protocol Behavior

0 1 2 3 4

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Bin

Prob

abili

ty

.333

.111

.222

.333




Comparing Protocol Models

1 2 3 4

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Bin

Prob

abili

ty

.333

.100 .111

.222

.333

.700

.150

.050





Atotal =n∑

k=1

Ak

Btotal =n∑

k=1

Bk

ScoreA↔B =n∑

i=1

∣∣∣∣ Ai

Atotal− Bi

Btotal

∣∣∣∣=

1

Atotal · Btotal

n∑i=1

|Ai · Btotal − Bi · Atotal |





50 1 2 3 4

Prob

abili

ty

.333

.100 .111

.222

.333

.700

.150

.050

Score = .233+.589+.072+.283 = 1.177

|.333-.100| = .233

|.111-.700| = .589

|.333-.050| = .283

|.222-.150| = .072

Diff

eren

ce





1 2 3 4

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Bin

Prob

abili

ty

.333

.400

.111

.222

.333

.150 .150

.300





50 1 2 3 4

-0.2

-0.1

-0

0.1

0.2

0.3

0.4

0.5

0.6

0.7Pr

obab

ility

.333

.400

.111

.222

.333

.150 .150

.300

Score = .067+.039+.072+.033 = .211

|.333-.400| = .067

|.111-.150| = .039 |.333-.300| = .033

|.222-.150| = .072

Diff

eren

ce




Classifying Samples: Easy as 1-2-3

1 Create training models for desired protocols

2 Build distribution for sample network trace

3 Find training model with lowest difference score

ScoreA↔B =n∑

i=1

∣∣∣∣ Ai

Atotal− Bi

Btotal

∣∣∣∣=

1

Atotal · Btotal

n∑i=1

|Ai · Btotal − Bi · Atotal |




Evaluation

How much traffic must be collected for:

TrainingTesting

Precision?

true positives

true positives + false positives

Recall?

true positives

true positives + false negatives




Data

CRAWDAD Dataset

Weekdays: January 19, 2004 – February 6, 2004

Ports with sufficient traffic

≥ 1M packets0.3% of ports → 95.21% of packets

Keep top 10 ports by number of sessions observed

No ground truth

Total Packets 1.3 Billion

Traffic Volume 707 GB

Observed Ports 64,214

Sessions 5.2 Million

Port 80 Sessions 1.7 Million




Methodology

Trial :=1 Randomly sample some percentage of available data for each

port and train classifier2 Randomly sample some number of the remaining data points

for each port and create testing samples3 Classify testing samples

50 Trials

80 110 445

TrainingData

TestingData




Training Size Selection

160 2 4 6 8 10 12 14

0.88

0.89

0.9

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1

Training Sample Size (%)

Aver

age

Reca

ll

.952

.884

Average Recall for Varying Training Size

6.8% improvement




Training Size Selection

160 2 4 6 8 10 12 14

0.88

0.89

0.9

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1

Training Sample Size (%)

Aver

age

Prec

isio

n

.887

.953

Average Precision for Varying Training Size

6.6% improvement




Testing Size Selection

0 10,000 20,000 30,000 40,000 50,000

0.88

0.89

0.9

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1

Testing Sample Size (Data Points)

Aver

age

Reca

ll

.898

.926

.954

Average Recall for Varying Testing Size

5.6% improvement




Testing Size Selection

0 10,000 20,000 30,000 40,000 50,000

0.88

0.89

0.9

0.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1

Testing Sample Size (Data Points)

Aver

age

Prec

isio

n

.961

.914

4.7% improvement

Average Precision for Varying Testing Size




Results

50 Trials

15% TrainingSet Size

50,000 DataPoints TestingSet Size

96.5% Precision

96.0% Recall




Classification Confidence Threshold

Goal: Eliminate close calls

Require 1st place candidate to lead 2nd place by certainamount to make decision

Standard deviation of scores




Methodology v2.0

Randomly sample some percentage of available data for eachport and train classifier

Randomly sample some number of the remaining data pointsfor each port and create testing samples

Attempt to classify testing samples

If all testing samples reach threshold, done.If any testing sample fails, rebuild testing samples and tryagain.




Classification Confidence

50 Trials

5% Training SetSize

35,000 DataPoints TestingSet Size

1.0 LeadThreshold

96.9% Precision

96.6% Recall




Classification Confidence

0 0.25 0.5 0.75 1 1.25

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

3,500,000

4,000,000

4,500,000

5,000,000

5,500,000

Lead Threshold

Test

ing

Sess

ions

Sam

pled

1.0 M 1.0 M 1.2 M

2.4 M

Testing Session Sampled For 50 Trials35,000 Data Point Testing Samples

5.4 M




Ground Truth Testing

MIT Lincoln Labs DARPA Data50 trials, 5% training sample size, 35,000 data point testingsample size, 1.25 lead threshold

Precision: 98.3%

Recall: 98.0%




Results

50 Trials

5% Training SetSize

35,000 DataPoints TestingSize

1.25 LeadThreshold

Ground Truth




Evasion

One might attempt to thwart our technique by padding all packetsto MTU.Reduces problem to 4-quadrant problem.Can still make decisions based on relative prevalence of eachquadrant.




Current/Future Work

Packet loss/re-transmission may cause unpredictable results

On-line classification

Training and testing from separate datasets

UDP

Subcategorization




Conclusion

Modeling protocol behavior using only packet size, direction,and order

Resistant to encryption and padding

Average precision and recall > 97%

Quick and reliable traffic inspection

Useful for pre-screening traffic for deeper analysis




Questions?

Thanks for listening.Q & A

[email protected]


[email protected]

Documents

Towards Reliable Traffic Classification Using Visual Motifs€¦ · Using Visual Motifs to Classify Encrypted Tra c, Wright et al. 2006 Intelligent Classi cation and Visualization