25
1 LD-Sketch: A Distributed Sketching Design for Accurate and Scalable Anomaly Detection in Network Data Streams Qun Huang and Patrick P. C. Lee The Chinese University of Hong Kong, Hong Kong INFOCOM’14

Qun Huang and Patrick P. C. Lee The Chinese University of Hong Kong, Hong Kong INFOCOM’14

  • Upload
    gwyn

  • View
    74

  • Download
    0

Embed Size (px)

DESCRIPTION

LD-Sketch: A Distributed Sketching Design for Accurate and Scalable Anomaly Detection in Network Data Streams. Qun Huang and Patrick P. C. Lee The Chinese University of Hong Kong, Hong Kong INFOCOM’14. Motivation. Network traffic: a stream of (key, value) tuples - PowerPoint PPT Presentation

Citation preview

Page 1: Qun  Huang  and Patrick P. C. Lee The Chinese University of Hong Kong, Hong Kong INFOCOM’14

1

LD-Sketch: A Distributed Sketching Design for Accurate and Scalable Anomaly Detection in Network Data Streams

Qun Huang and Patrick P. C. Lee

The Chinese University of Hong Kong, Hong Kong

INFOCOM’14

Page 2: Qun  Huang  and Patrick P. C. Lee The Chinese University of Hong Kong, Hong Kong INFOCOM’14

Network traffic: a stream of (key, value) tuples• Keys: src IPs, five-tuple flows• Value: # of packets, payload bytes

Heavy keys - classical anomalies in network traffic• Heavy hitters: keys with large volume in one period

• e.g. SLA violation• Heavy changers: keys with large volume change across two

periods• e.g. DoS attacks, component failures

Goal:• identify heavy keys in real time

Motivation

2

Page 3: Qun  Huang  and Patrick P. C. Lee The Chinese University of Hong Kong, Hong Kong INFOCOM’14

Challenges Enormous key space

• e.g., 5-tuple IPv4 flows are drawn from key domain of size • Per-key tracking is infeasible

Line-rate processing• Single machine fails to keep pace with line rate

Seamless distributed detection• Apply single-machine detection in distributed architecture• Open issue:

• How to achieve both scalability and accuracy ?

3

Page 4: Qun  Huang  and Patrick P. C. Lee The Chinese University of Hong Kong, Hong Kong INFOCOM’14

Related Works Counter-based techniques

• Misra-Gries algorithm [Misra & Gries 82]; Lossy Counting [Manku et al. 02]; Space Saving [Metwally et al. 05]; Probalistic Lossy Count [Dimitropoulos et al. 08]

• Only address for heavy hitter detection in single machine

Sketch-based techniques• Multi-stage filter [Estan et al. 03]; CGT [Cormode et al. 04]; Reversible

Sketch [Schweller et al. 06]; SeqHash [Tian et al. 07]; Fast Sketch [Liu et al. 12]

• Only work in single machine

Distributed detection• [Cormode et al. 2005]• [Manjhi et al. 2005]• [Yi et al. 2009]• Only address heavy hitter detection

4

Page 5: Qun  Huang  and Patrick P. C. Lee The Chinese University of Hong Kong, Hong Kong INFOCOM’14

Our Work

5

LD-Sketch: a new sketching design for heavy key detection in a distributed architecture

A sketch technique for local detection• High accuracy• High speed• Low space complexity

A distributed detection scheme not only achieves scalability but also improves accuracy

Experiments on real-world traces

Page 6: Qun  Huang  and Patrick P. C. Lee The Chinese University of Hong Kong, Hong Kong INFOCOM’14

Problem Formulation Perform detection in each time period (epoch) Input data: a stream (key, value) tuple True sum :

• sum of values of key in the time period

True change :• absolute value of difference of in current and last epochs

Heavy hitters: all with Heavy changers: all with Problem: infeasible to track and in real-time with

limited memory6

Page 7: Qun  Huang  and Patrick P. C. Lee The Chinese University of Hong Kong, Hong Kong INFOCOM’14

Architecture

7

Remotesite

Remotesite

Remotesite

Remotesite

Remotesite

Datasource

Datasource

Datasource

Datasource

Datasource

WorkerWorkerWorkerLocal

detectionLocal

detectionLocal

detection

Local detection

results Final detection results

Distributed detection

Page 8: Qun  Huang  and Patrick P. C. Lee The Chinese University of Hong Kong, Hong Kong INFOCOM’14

Local Detection

For each data item • select a bucket for row by

hashing key with function • update the bucket with the

data item

8

Update phase

Examine the buckets and report heavy keys

Detection phase

key rows

buckets

h1

h2

h𝑟

Structure of rows, with buckets each

LD-Sketch

Page 9: Qun  Huang  and Patrick P. C. Lee The Chinese University of Hong Kong, Hong Kong INFOCOM’14

Inside a Bucket

9

Bucket

length:

𝑘𝑒𝑦 1𝑣𝑎𝑙𝑢𝑒𝑘𝑒𝑦 2𝑣𝑎𝑙𝑢𝑒

𝑒𝑚𝑝𝑡𝑦…

Array𝑉 𝑖 , 𝑗

Total sum:

𝑒𝑖 , 𝑗

Error:

Expansion parameter

Basic ideas• Track significant keys in a bucket with array • Increment length based of total sum and parameter • Record error due to dropping insignificant keys

Page 10: Qun  Huang  and Patrick P. C. Lee The Chinese University of Hong Kong, Hong Kong INFOCOM’14

Update Bucket with

10

Case 1: • Update directly:

Case 2: but has empty slots• Insert key into , and set

Cases 3 & 4: , is full• Expansion number • Based on and :

• Case 3: decrement keys in • Case 4: expand dynamically

Four cases

Page 11: Qun  Huang  and Patrick P. C. Lee The Chinese University of Hong Kong, Hong Kong INFOCOM’14

Case 3: Example

• Bucket

• New data item

Procedure• Step 1: calculate decrement value

Decrement Keys

11

y 5𝐴𝑖 , 𝑗

𝑙𝑖 , 𝑗=1 𝑒𝑖 , 𝑗=2

�̂�={3 ,𝑖𝑓 𝑣 𝑥=35 ,𝑖𝑓 𝑣 𝑥=55 , 𝑖𝑓 𝑣𝑥=8

Step 1

Page 12: Qun  Huang  and Patrick P. C. Lee The Chinese University of Hong Kong, Hong Kong INFOCOM’14

Procedure (cont.)• Step 2: Update • Step 3: Update

• , for all • Remove all with • Insert key with if

Decrement Keys

12

emptyAfter

𝑣 𝑥=3

x 3After

y 2After

y 5Before 𝑣 𝑥=5

𝑣 𝑥=8

𝑒𝑖 , 𝑗={5 ,𝑖𝑓 𝑣𝑥=37 , 𝑖𝑓 𝑣𝑥=57 ,𝑖𝑓 𝑣𝑥=8

Step 3

Step 2

Page 13: Qun  Huang  and Patrick P. C. Lee The Chinese University of Hong Kong, Hong Kong INFOCOM’14

Case 4: • Add new counters to • Set • Insert key with

Dynamic Expansion

13

𝑙𝑖 , 𝑗=5

𝑦 1𝐴𝑖 , 𝑗Before 𝑦 2𝑦 3

𝑙𝑖 , 𝑗=11

𝐴𝑖 , 𝑗After 𝑥

𝑦 4𝑦 5

𝑦 3𝑦 4𝑦 5𝑦 1𝑦 2

Page 14: Qun  Huang  and Patrick P. C. Lee The Chinese University of Hong Kong, Hong Kong INFOCOM’14

Estimate True Sum or Change Estimate in bucket : a pair of values

Estimate in bucket

• Estimate change:

14

Bucket at 1st epoch

and

Bucket at 2nd epoch

and

Page 15: Qun  Huang  and Patrick P. C. Lee The Chinese University of Hong Kong, Hong Kong INFOCOM’14

Identify Heavy Key

15

Bucket

Key point: consider keys tracked by buckets Enumerate all buckets

𝑉 𝑖 , 𝑗≥𝜙, check key

Check key for heavy hitters• for all row

Check key for heavy changers• for all row

Page 16: Qun  Huang  and Patrick P. C. Lee The Chinese University of Hong Kong, Hong Kong INFOCOM’14

Analysis

16

Let maximum number of heavy keys = On accuracy

• Zero false negative rate• Upper bound of false positive rate

On complexity• time complexity to update a data item: • time complexity to identify heavy keys: • space complexity:

Page 17: Qun  Huang  and Patrick P. C. Lee The Chinese University of Hong Kong, Hong Kong INFOCOM’14

Distributed Detection

17

Remotesite

WorkerLocal

detection

Local detection results

Final detection results

Goal• Scalability: reduce

complexity• Accuracy: reduce

false positive rate

Remote Site• How to partition data

streams

Final results• How to aggregate

local detection results

Page 18: Qun  Huang  and Patrick P. C. Lee The Chinese University of Hong Kong, Hong Kong INFOCOM’14

Remote Sites Two-step partitioning

For same , the same workers are selected in all remote sites

18

Data item

Worker Worker Worker WorkerWorker

Step 1: select workers based on

Step 2: select one from the workers uniformly

Worker Worker Worker

Page 19: Qun  Huang  and Patrick P. C. Lee The Chinese University of Hong Kong, Hong Kong INFOCOM’14

Detection and Aggregation Detection in workers

• For key , each selected worker expects to receive of • Perform local detection in each worker with threshold

Aggregate results

19

All workers report in the local detection

For key

Report as a heavy key

Page 20: Qun  Huang  and Patrick P. C. Lee The Chinese University of Hong Kong, Hong Kong INFOCOM’14

Analysis

20

Let• Maximum number of heavy keys = • Total number of worker =

On accuracy• Reduce false positive rate• Introduce a small false negative rate due to unfair

partitioning

On complexity• time complexity to update a data item: • time complexity to identify heavy keys: • space complexity:

Page 21: Qun  Huang  and Patrick P. C. Lee The Chinese University of Hong Kong, Hong Kong INFOCOM’14

Experimental Results Trace

• 3G UMTS network in mainland China in December 2010• 1.1 billion packets, 600GB traffic

Approach• Local detection: compare LD-Sketch with CGT, SeqHash, Fast

Sketch, all of which are allocated same amount of memory• Distributed detection: vary the value of

Metrics• Recall:

• (# of returned true heavy keys) / (# of true heavy keys)• Precision:

• (# of returned true heavy keys) / (# of return keys)• Update throughput

21

Page 22: Qun  Huang  and Patrick P. C. Lee The Chinese University of Hong Kong, Hong Kong INFOCOM’14

Accuracy of Local Detection: Heavy Changer

22

LD-Sketch achieves 100% recall LD-Sketch has a little lower precision than CGT and

Seqhash, but we can improve with distributed detection

Page 23: Qun  Huang  and Patrick P. C. Lee The Chinese University of Hong Kong, Hong Kong INFOCOM’14

Accuracy of Distributed Detection: Heavy Changer

23

When , the precision is similar to local detection When , the precision significantly increases while lose a

little recall

Page 24: Qun  Huang  and Patrick P. C. Lee The Chinese University of Hong Kong, Hong Kong INFOCOM’14

Throughput

24

LD-Sketch has a little lower throughput than CGT and Fast Sketch in local detection

LD-Sketch can scale linearly in distributed detection

Local detection Distributed detection

Page 25: Qun  Huang  and Patrick P. C. Lee The Chinese University of Hong Kong, Hong Kong INFOCOM’14

Conclusions

25

Propose LD-Sketch, a sketching approach for real-time heavy key detection in a distributed architecture• Composed of local detection and distributed detection

Propose a sketch structure for local detection• High accuracy• Low complexity in space and time• Seamlessly deployed in distributed architecture

Propose a distributed detection scheme• Reduce complexity• Improve accuracy