44
http:// aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U. Washington Crowdsourcing Service-Level Network Event Detection

Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

Embed Size (px)

Citation preview

Page 1: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

http://aqualab.cs.northwestern.edu

David Choffnes Fabián Bustamante Zihui GeNorthwestern University* Northwestern University AT&T Labs

*currently at U. Washington

Crowdsourcing Service-Level Network Event Detection

Page 2: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 2

Internet driven by services

Internet activity increasingly driven by services– VoIP, video streaming, games, file downloads

User experience as a key benchmark– Largely determined by frequency, duration and

severity of network problems (events)

To minimize impact on users– Identify problems affecting end-to-end performance– Online, reliable detection and isolation -- potentially

across networks

Crowdsourcing Network Monitoring

Page 3: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 3

Detecting network events

Variety of existing detection approaches– Internal link/router monitoring

• Limited to single administrative domain

– BGP monitoring • Identifies only control-plane issues

– Distributed active probing • Overhead/cost scales with size of monitored network

Limited or no visibility into end-to-end performance– Particularly problematic for edge networks

Crowdsourcing Network Monitoring

Need a scalable solution that captures what user sees

Page 4: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 4

Crowdsourcing Event Monitoring (CEM)

Push monitoring to the edge systems themselves– Monitor from inside network-intensive applications– Detect drops in performance

Crowdsourcing Network Monitoring

If enough hosts see• Same performance problem, • at the same time, • in the same network….

The problem is likely to be the network

Page 5: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 5

Outline

Crowdsourcing network monitoring– General approach– Case study using confirmed network problems– Wide area evaluation

System implementation– BitTorrent extension (NEWS) installed by >48k users

Conclusion

Crowdsourcing Network Monitoring

Page 6: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 6

System Requirements

Scalability– Use passive monitoring, fully distribute detection

Localization in time and space– Online detection– Isolation to network regions

Privacy– Use only network location, not identity

Reliability from uncontrolled hosts– Probability analysis to identify likely real events

Adoption– Build inside popular applications and/or use incentives

Crowdsourcing Network Monitoring

Page 7: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 7

Approach and architecture

Crowdsourcing Network Monitoring

Distributed System

Local Event

Detection

Performance signals

[e.g., upload/ download rate]

Page 8: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 8

Approach and architecture

Local detection– Passively monitor local performance information

(signals)• General (e.g. transfer rates) and application specific (e.g.

content availability in BitTorrent)

– Detect drops in performance• E.g., dropped video frame, sudden drop in throughput• Filter out cases that are normal application behavior

– E.g., BitTorrent peer finishes downloading but still seeds

– Publish information only about these suspected local events

Crowdsourcing Network Monitoring

Page 9: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 9

System Architecture

Crowdsourcing Network Monitoring

Local Event

Detection

Dis

trib

uted

Sto

rage

Performance signals Local

Events

Distributed System

Page 10: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 10

Approach and architecture

Group corroboration– Gather information about local events in same network– Identify synchronous problems that are unlikely to occur

by chance– Likelihood ratio to distinguish network events from

coincidence

Who can identify network events?– Each user can detect events separately– Any third party with access to distributed storage can do

the same (e.g., network operators)

Crowdsourcing Network Monitoring

Page 11: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 11

System Architecture

Crowdsourcing Network Monitoring

ISP Operator

Local Event

Detection

GroupCorroboration

Dis

trib

uted

Sto

rage

Performance signals

Confirmed

LocalEvents

Tap on DS

RemoteEvents

Distributed System

Page 12: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 12

Evaluating the approach

Participatory monitoring challenges– Needs large-scale adoption, active users, dist. storage– Edge traces are rare

P2P applications are a natural fit– Used worldwide, generates diverse flows– BitTorrent is one of the most popular

• Consumes large amounts of bandwidth• Vuze client allows extensibility, piggyback on existing users• Built-in distributed storage (DHT)

Ono dataset for traces– Installed by more than 1 million BitTorrent users– Network and BitTorrent-specific information from

hundreds of thousands of users worldwide

Crowdsourcing Network Monitoring

Page 13: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 13

CEM Case Study

Crowdsourcing Network Monitoring

Evaluate effectiveness of our approach using BitTorrent

How (well) does it work?– Case study: British Telecom (BT) network– Provides confirmed events through a Web interface

– 27 April 2009, 3:54 PM• “We are aware of a network problem which may be affecting

access to the internet in certain areas. Our engineers are working to resolve the problem. We apologize for any inconvenience this may cause.”

• Resolved: 27 April 2009, 8:50 PM– Similar to events seen in other networks

“Enough users complained about the network being slow and we’re looking into it.”“As of 9PM, we’re pretty sure we fixed the

problem so we marked it resolved.”

Page 14: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 14

Local detection in BitTorrent

Crowdsourcing Network Monitoring

Peers monitor multiple performance signals– General – e.g. transfer rates, number of connected peers– Protocol specific – Torrent availability

Detect drops in throughput as local events

Individual signals – Noisy– Uncontrolled duration– Wide range of values

Page 15: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 15

Moving-average smoothing reveals events

Crowdsourcing Network Monitoring

Performance drops around 10:54

Further drop at 14:50

Final recovery at ~17:30

Page 16: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 16

Group corroboration

Given locally detected events, why would they occur at the same time?1. Service-specific problems (e.g., no seeder)

Use application level information

2. Coincidence (e.g., noisy local detection) Union probability

3. Problem isolated to one or more networks Group hosts according to network location

Crowdsourcing Network Monitoring

Page 17: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 17

Coincidence in network events

Coincidence assumes that local events are occurring independently

Calculate this using union probability (Pu)

– P(Lh): probability for host h seeing a local event

For large n, likelihood of coincidence is very small

Crowdsourcing Network Monitoring

Page 18: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 18

Likelihood ratio

Are detected network problems occurring more often than chance?

Comparing probabilities– Coincidence (Pu)

– Network (Pe)• Measure how often synchronized events occur in same network

Likelihood ratio: LR = Pe/Pu

LR > 1: Events more likely due to the network than chance– Empirically derive a stricter LR threshold– Use LR as a tuning knob to control rate of event

detection

Crowdsourcing Network Monitoring

Page 19: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 19

Likelihood ratios in BT Yahoo

Crowdsourcing Network Monitoring

W=10, σ=1.5

W=20, σ=2.0

Congestion event after recovery

Most events no more likely than chance

All LR>1 correspond to actual network events!

Page 20: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 20

Wide-area evaluation

Gold standard: False positives/negatives…– Almost no ISPs want to publish records

of network problems– Existing approaches do not target service-level events– In short, there is no “ground truth”– Affects all research in this domain

What we can do– Find ISPs reporting network events via public interfaces– Work with ISPs under NDAs

Compare our approach with ISP information– Only works where we have coverage (users)

Crowdsourcing Network Monitoring

Page 21: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 21

Evaluation criteria

Coverage – Confirmed events– Number of networks covered worldwide– Cross-network events

Efficiency– Event detection rate– Overhead

Crowdsourcing Network Monitoring

Page 22: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 22

Effectiveness – BT Yahoo

One month from BT Yahoo– Detected: 181 events– 54 occur during confirmed problems– Remaining are not necessarily false positives

• Even if so, about 4 events per day

Crowdsourcing Network Monitoring

Page 23: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 23

Effectiveness – North American ISP

One month of outage data– Varies according to number of subscribers (S) in each

region– S > 10000

• We detect 50% (38% more may be detected but we don’t have enough users to confirm)

– 10000 > S > 1000• 67% may be detected but not sufficient corroboration

Crowdsourcing Network Monitoring

Page 24: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 24

Robustness to parameters

Robust to various detection settings, populations– Number of users in a network is not strongly correlated

with number of events detected– Network problems detected only 2% of the time for

small MA deviations, 0.75% of the time for large ones• Can be filtered with likelihood ratio threshold

Crowdsourcing Network Monitoring

Sensitive local detection (MA settings: 1.5σ, w=10)

Less sensitive detection

(MA settings: 2.2σ, w=20)

Page 25: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 25

Summary

Service-level monitoring through crowdsourcing– Push monitoring to applications at the edge– Scalable, distributed approach to detection– Evaluation using large-scale P2P trace data

Crowdsourcing Network Monitoring

Page 26: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 26

NEWS implementation and deployment

Plugin for the Vuze BitTorrent client– More than 48,000 installs– Core classes for event detection only ~1,000 LOC– Lots more code for UI, user notifications

Event detection– Local detection based on 15-second samples of

performance information• Transfer rates• Torrent state (leech/seed)• Content availability

– Group corroboration and localization• Publishes event information to built-in DHT• Uses BGP prefix, ASN information for group corroboration

(already collected by Vuze)

Crowdsourcing Network Monitoring

Page 27: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 27

Food for thought

Open issues– Which network groupings are best?

• Whois listings, topologies, ISP-specific…

– Where is the ground truth? • Crowdsourcing event labeling (Newsight)

– Can we apply these principles to other services?• VoIP, video streaming, CDNs

Crowdsourcing Network Monitoring

Questions?

Page 28: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 28

Questions?

Crowdsourcing Network Monitoring

Page 29: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 29

Backups

Crowdsourcing Network Monitoring

Page 30: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 30

Related work

Crowdsourcing– Human computation [von Ahn]– Intrusion detection [Dash et al.]

Events detected– Layer-3 and below [Lakhina et al., Mahajan et al.]– End-to-end [Madhyastha et al., Zhang et al.]

Monitoring location– In network (Netflow)– Distributed probing [Feamster et al., Katz-Bassett et al.]– Edge systems [Shavitt et al., Simpson et al.]

Measurement technique– Active [Andersen et al.]– Passive [Zhang et al., Casado et al., …]

Crowdsourcing Network Monitoring

Page 31: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 31

Snapshot of what we collect

Crowdsourcing Network Monitoring

Started (proper) collection in December 2007

Daily stats (approximate)– 3 to 4 GB of compressed data– About 10 to 20 GB raw data– 2.5-3M traceroutes– 100-150M connection samples

9-Feb

-200

5

20-F

eb-2

005

3-M

ar-2

005

14-M

ar-2

005

25-M

ar-2

005

5-Apr

-200

5

16-A

pr-2

005

27-A

pr-2

005

8-M

ay-2

005

19-M

ay-2

005

30-M

ay-2

005

10-J

un-2

005

21-J

un-2

005

2-Ju

l-200

5

13-J

ul-20

05

24-J

ul-20

05

4-Aug

-200

5

15-A

ug-2

005

26-A

ug-2

005

6-Sep

-200

5

17-S

ep-2

005

28-S

ep-2

005

9-Oct

-200

5

20-O

ct-2

005

31-O

ct-2

005

11-N

ov-2

005

22-N

ov-2

005

3-Dec

-200

5

14-D

ec-2

005

25-D

ec-2

005

5-Ja

n-20

06

16-J

an-2

006

27-J

an-2

006

0

50,000,000

100,000,000

150,000,000

200,000,000

250,000,000

Per-connection Samples

9-Feb

-200

5

20-F

eb-2

005

3-M

ar-2

005

14-M

ar-2

005

25-M

ar-2

005

5-Apr

-200

5

16-A

pr-2

005

27-A

pr-2

005

8-M

ay-2

005

19-M

ay-2

005

30-M

ay-2

005

10-J

un-2

005

21-J

un-2

005

2-Ju

l-200

5

13-J

ul-20

05

24-J

ul-20

05

4-Aug

-200

5

15-A

ug-2

005

26-A

ug-2

005

6-Sep

-200

5

17-S

ep-2

005

28-S

ep-2

005

9-Oct

-200

5

20-O

ct-2

005

31-O

ct-2

005

11-N

ov-2

005

22-N

ov-2

005

3-Dec

-200

5

14-D

ec-2

005

25-D

ec-2

005

5-Ja

n-20

06

16-J

an-2

006

27-J

an-2

006

020,000,00040,000,00060,000,00080,000,000

100,000,000120,000,000140,000,000160,000,000

Per-Download Samples

9-Feb

-200

5

20-F

eb-2

005

3-M

ar-2

005

14-M

ar-2

005

25-M

ar-2

005

5-Apr

-200

5

16-A

pr-2

005

27-A

pr-2

005

8-M

ay-2

005

19-M

ay-2

005

30-M

ay-2

005

10-J

un-2

005

21-J

un-2

005

2-Ju

l-200

5

13-J

ul-20

05

24-J

ul-20

05

4-Aug

-200

5

15-A

ug-2

005

26-A

ug-2

005

6-Sep

-200

5

17-S

ep-2

005

28-S

ep-2

005

9-Oct

-200

5

20-O

ct-2

005

31-O

ct-2

005

11-N

ov-2

005

22-N

ov-2

005

3-Dec

-200

5

14-D

ec-2

005

25-D

ec-2

005

5-Ja

n-20

06

16-J

an-2

006

27-J

an-2

006

0

1,000,000

2,000,000

3,000,000

4,000,000

5,000,000

6,000,000

7,000,000

Traceroutes

Page 32: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 32

Wide area events

Detected problems in the US, Europe and Asia

Identified potential cross network events– Use ISP relationships and correlate per-ISP events– Detected cases in seven countries

Crowdsourcing Network Monitoring

Page 33: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 33

Robustness to parameters

Robust to various detection settings, populations– Number of users in a network is not strongly correlated

with number of events detected

Crowdsourcing Network Monitoring

Ordered by # users Ordered by # events

Page 34: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 34

How much does it cost?

Simulate using 3 stddev thresholds and 2 window sizes in parallel – Allows NEWS to detect multiple types of events– Model caching in DHT

Count number of DHT operations at any time

Goals– Low cost (does not affect user’s transfers)– Privacy preserving– No reliance on infrastructure

Crowdsourcing Network Monitoring

Page 35: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 35

Events at each time step

Read and writes are clustered– Expected for a system that detects events– Diurnal pattern

Crowdsourcing Network Monitoring

Page 36: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 36

How much does it cumulatively cost?

One read every 10 seconds, one write every two minutes (spread over hundreds of users)

Reasonable load on a DHT– Kademlia caches values close to host– 38 bytes per read/write, about 4 B/s overhead

Crowdsourcing Network Monitoring

Page 37: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 37

Strawman cost

Decentralized approach not using summaries– 13 signals x 4 bytes = 52 bytes, every 15 seconds– Sharing incurs O(N2) cost– 1000 hosts: 34.6MB/s

Centralized approach to collecting data– 13 signals x 4 bytes = 52 bytes, every 15 seconds– 1000 hosts: 4 KB/s– Plus CPU/memory costs for processing this data– Ignores important issues

• Who hosts this?• Privacy?• Cross-network events?

Crowdsourcing Network Monitoring

Page 38: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 38

NEWS UI

Notifications through non-intrusive interface

List of events for historical information

Crowdsourcing Network Monitoring

Page 39: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 39

NEWS Usage

Crowdsourcing Network Monitoring

Page 40: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 40

The view from 1 million users

Crowdsourcing Network Monitoring

231,000

547,000

35,0001,096

Page 41: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 41

Extension to multiple signals

Leverage independent signals at each host– For example, upload rate and download rate– Even more unlikely that both signals affected at same

time by coincidence

Crowdsourcing Network Monitoring

Page 42: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 42

Detecting throughput drops

Moving average– Low cost– Few parameters– Well understood

Key parameters– Window size, deviation

Approach– Find mean for each window– Find how much next sample deviates from mean

Key questions– Can we find good window sizes, threshold deviations?

Crowdsourcing Network Monitoring

Page 43: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 43

Simulate different numbers of hosts seeing events independently with a normally dist. probability

Relative likelihood of coincidence

Crowdsourcing Network Monitoring

The more peers seeing an event at the same time, the less likely it occurs by coincidence

Five orders of magnitude between 3 peers and 9 peers corroborating an event

Page 44: Http://aqualab.cs.northwestern.edu David Choffnes Fabián Bustamante Zihui Ge Northwestern University* Northwestern University AT&T Labs *currently at U

David Choffnes 44Crowdsourcing Network Monitoring

Ground truth hard to come by

Can we crowdsource event labeling?

Make information available to community

Newsight

http://aqualab.cs.northwestern.edu/projects/news/newsight.html