Upload
brandon-beasley
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
http://aqualab.cs.northwestern.edu
David Choffnes Fabián Bustamante Zihui GeNorthwestern University* Northwestern University AT&T Labs
*currently at U. Washington
Crowdsourcing Service-Level Network Event Detection
David Choffnes 2
Internet driven by services
Internet activity increasingly driven by services– VoIP, video streaming, games, file downloads
User experience as a key benchmark– Largely determined by frequency, duration and
severity of network problems (events)
To minimize impact on users– Identify problems affecting end-to-end performance– Online, reliable detection and isolation -- potentially
across networks
Crowdsourcing Network Monitoring
David Choffnes 3
Detecting network events
Variety of existing detection approaches– Internal link/router monitoring
• Limited to single administrative domain
– BGP monitoring • Identifies only control-plane issues
– Distributed active probing • Overhead/cost scales with size of monitored network
Limited or no visibility into end-to-end performance– Particularly problematic for edge networks
Crowdsourcing Network Monitoring
Need a scalable solution that captures what user sees
David Choffnes 4
Crowdsourcing Event Monitoring (CEM)
Push monitoring to the edge systems themselves– Monitor from inside network-intensive applications– Detect drops in performance
Crowdsourcing Network Monitoring
If enough hosts see• Same performance problem, • at the same time, • in the same network….
The problem is likely to be the network
David Choffnes 5
Outline
…
Crowdsourcing network monitoring– General approach– Case study using confirmed network problems– Wide area evaluation
System implementation– BitTorrent extension (NEWS) installed by >48k users
Conclusion
Crowdsourcing Network Monitoring
David Choffnes 6
System Requirements
Scalability– Use passive monitoring, fully distribute detection
Localization in time and space– Online detection– Isolation to network regions
Privacy– Use only network location, not identity
Reliability from uncontrolled hosts– Probability analysis to identify likely real events
Adoption– Build inside popular applications and/or use incentives
Crowdsourcing Network Monitoring
David Choffnes 7
Approach and architecture
Crowdsourcing Network Monitoring
Distributed System
Local Event
Detection
Performance signals
[e.g., upload/ download rate]
David Choffnes 8
Approach and architecture
Local detection– Passively monitor local performance information
(signals)• General (e.g. transfer rates) and application specific (e.g.
content availability in BitTorrent)
– Detect drops in performance• E.g., dropped video frame, sudden drop in throughput• Filter out cases that are normal application behavior
– E.g., BitTorrent peer finishes downloading but still seeds
– Publish information only about these suspected local events
Crowdsourcing Network Monitoring
David Choffnes 9
System Architecture
Crowdsourcing Network Monitoring
Local Event
Detection
Dis
trib
uted
Sto
rage
Performance signals Local
Events
Distributed System
David Choffnes 10
Approach and architecture
Group corroboration– Gather information about local events in same network– Identify synchronous problems that are unlikely to occur
by chance– Likelihood ratio to distinguish network events from
coincidence
Who can identify network events?– Each user can detect events separately– Any third party with access to distributed storage can do
the same (e.g., network operators)
Crowdsourcing Network Monitoring
David Choffnes 11
System Architecture
Crowdsourcing Network Monitoring
ISP Operator
Local Event
Detection
GroupCorroboration
Dis
trib
uted
Sto
rage
Performance signals
Confirmed
LocalEvents
Tap on DS
RemoteEvents
Distributed System
David Choffnes 12
Evaluating the approach
Participatory monitoring challenges– Needs large-scale adoption, active users, dist. storage– Edge traces are rare
P2P applications are a natural fit– Used worldwide, generates diverse flows– BitTorrent is one of the most popular
• Consumes large amounts of bandwidth• Vuze client allows extensibility, piggyback on existing users• Built-in distributed storage (DHT)
Ono dataset for traces– Installed by more than 1 million BitTorrent users– Network and BitTorrent-specific information from
hundreds of thousands of users worldwide
Crowdsourcing Network Monitoring
David Choffnes 13
CEM Case Study
Crowdsourcing Network Monitoring
Evaluate effectiveness of our approach using BitTorrent
How (well) does it work?– Case study: British Telecom (BT) network– Provides confirmed events through a Web interface
– 27 April 2009, 3:54 PM• “We are aware of a network problem which may be affecting
access to the internet in certain areas. Our engineers are working to resolve the problem. We apologize for any inconvenience this may cause.”
• Resolved: 27 April 2009, 8:50 PM– Similar to events seen in other networks
“Enough users complained about the network being slow and we’re looking into it.”“As of 9PM, we’re pretty sure we fixed the
problem so we marked it resolved.”
David Choffnes 14
Local detection in BitTorrent
Crowdsourcing Network Monitoring
Peers monitor multiple performance signals– General – e.g. transfer rates, number of connected peers– Protocol specific – Torrent availability
Detect drops in throughput as local events
Individual signals – Noisy– Uncontrolled duration– Wide range of values
David Choffnes 15
Moving-average smoothing reveals events
Crowdsourcing Network Monitoring
Performance drops around 10:54
Further drop at 14:50
Final recovery at ~17:30
David Choffnes 16
Group corroboration
Given locally detected events, why would they occur at the same time?1. Service-specific problems (e.g., no seeder)
Use application level information
2. Coincidence (e.g., noisy local detection) Union probability
3. Problem isolated to one or more networks Group hosts according to network location
Crowdsourcing Network Monitoring
David Choffnes 17
Coincidence in network events
Coincidence assumes that local events are occurring independently
Calculate this using union probability (Pu)
– P(Lh): probability for host h seeing a local event
For large n, likelihood of coincidence is very small
Crowdsourcing Network Monitoring
David Choffnes 18
Likelihood ratio
Are detected network problems occurring more often than chance?
Comparing probabilities– Coincidence (Pu)
– Network (Pe)• Measure how often synchronized events occur in same network
Likelihood ratio: LR = Pe/Pu
LR > 1: Events more likely due to the network than chance– Empirically derive a stricter LR threshold– Use LR as a tuning knob to control rate of event
detection
Crowdsourcing Network Monitoring
David Choffnes 19
Likelihood ratios in BT Yahoo
Crowdsourcing Network Monitoring
W=10, σ=1.5
W=20, σ=2.0
Congestion event after recovery
Most events no more likely than chance
All LR>1 correspond to actual network events!
David Choffnes 20
Wide-area evaluation
Gold standard: False positives/negatives…– Almost no ISPs want to publish records
of network problems– Existing approaches do not target service-level events– In short, there is no “ground truth”– Affects all research in this domain
What we can do– Find ISPs reporting network events via public interfaces– Work with ISPs under NDAs
Compare our approach with ISP information– Only works where we have coverage (users)
Crowdsourcing Network Monitoring
David Choffnes 21
Evaluation criteria
Coverage – Confirmed events– Number of networks covered worldwide– Cross-network events
Efficiency– Event detection rate– Overhead
Crowdsourcing Network Monitoring
David Choffnes 22
Effectiveness – BT Yahoo
One month from BT Yahoo– Detected: 181 events– 54 occur during confirmed problems– Remaining are not necessarily false positives
• Even if so, about 4 events per day
Crowdsourcing Network Monitoring
David Choffnes 23
Effectiveness – North American ISP
One month of outage data– Varies according to number of subscribers (S) in each
region– S > 10000
• We detect 50% (38% more may be detected but we don’t have enough users to confirm)
– 10000 > S > 1000• 67% may be detected but not sufficient corroboration
Crowdsourcing Network Monitoring
David Choffnes 24
Robustness to parameters
Robust to various detection settings, populations– Number of users in a network is not strongly correlated
with number of events detected– Network problems detected only 2% of the time for
small MA deviations, 0.75% of the time for large ones• Can be filtered with likelihood ratio threshold
Crowdsourcing Network Monitoring
Sensitive local detection (MA settings: 1.5σ, w=10)
Less sensitive detection
(MA settings: 2.2σ, w=20)
David Choffnes 25
Summary
Service-level monitoring through crowdsourcing– Push monitoring to applications at the edge– Scalable, distributed approach to detection– Evaluation using large-scale P2P trace data
Crowdsourcing Network Monitoring
David Choffnes 26
NEWS implementation and deployment
Plugin for the Vuze BitTorrent client– More than 48,000 installs– Core classes for event detection only ~1,000 LOC– Lots more code for UI, user notifications
Event detection– Local detection based on 15-second samples of
performance information• Transfer rates• Torrent state (leech/seed)• Content availability
– Group corroboration and localization• Publishes event information to built-in DHT• Uses BGP prefix, ASN information for group corroboration
(already collected by Vuze)
Crowdsourcing Network Monitoring
David Choffnes 27
Food for thought
Open issues– Which network groupings are best?
• Whois listings, topologies, ISP-specific…
– Where is the ground truth? • Crowdsourcing event labeling (Newsight)
– Can we apply these principles to other services?• VoIP, video streaming, CDNs
Crowdsourcing Network Monitoring
Questions?
David Choffnes 28
Questions?
Crowdsourcing Network Monitoring
David Choffnes 29
Backups
Crowdsourcing Network Monitoring
David Choffnes 30
Related work
Crowdsourcing– Human computation [von Ahn]– Intrusion detection [Dash et al.]
Events detected– Layer-3 and below [Lakhina et al., Mahajan et al.]– End-to-end [Madhyastha et al., Zhang et al.]
Monitoring location– In network (Netflow)– Distributed probing [Feamster et al., Katz-Bassett et al.]– Edge systems [Shavitt et al., Simpson et al.]
Measurement technique– Active [Andersen et al.]– Passive [Zhang et al., Casado et al., …]
Crowdsourcing Network Monitoring
David Choffnes 31
Snapshot of what we collect
Crowdsourcing Network Monitoring
Started (proper) collection in December 2007
Daily stats (approximate)– 3 to 4 GB of compressed data– About 10 to 20 GB raw data– 2.5-3M traceroutes– 100-150M connection samples
9-Feb
-200
5
20-F
eb-2
005
3-M
ar-2
005
14-M
ar-2
005
25-M
ar-2
005
5-Apr
-200
5
16-A
pr-2
005
27-A
pr-2
005
8-M
ay-2
005
19-M
ay-2
005
30-M
ay-2
005
10-J
un-2
005
21-J
un-2
005
2-Ju
l-200
5
13-J
ul-20
05
24-J
ul-20
05
4-Aug
-200
5
15-A
ug-2
005
26-A
ug-2
005
6-Sep
-200
5
17-S
ep-2
005
28-S
ep-2
005
9-Oct
-200
5
20-O
ct-2
005
31-O
ct-2
005
11-N
ov-2
005
22-N
ov-2
005
3-Dec
-200
5
14-D
ec-2
005
25-D
ec-2
005
5-Ja
n-20
06
16-J
an-2
006
27-J
an-2
006
0
50,000,000
100,000,000
150,000,000
200,000,000
250,000,000
Per-connection Samples
9-Feb
-200
5
20-F
eb-2
005
3-M
ar-2
005
14-M
ar-2
005
25-M
ar-2
005
5-Apr
-200
5
16-A
pr-2
005
27-A
pr-2
005
8-M
ay-2
005
19-M
ay-2
005
30-M
ay-2
005
10-J
un-2
005
21-J
un-2
005
2-Ju
l-200
5
13-J
ul-20
05
24-J
ul-20
05
4-Aug
-200
5
15-A
ug-2
005
26-A
ug-2
005
6-Sep
-200
5
17-S
ep-2
005
28-S
ep-2
005
9-Oct
-200
5
20-O
ct-2
005
31-O
ct-2
005
11-N
ov-2
005
22-N
ov-2
005
3-Dec
-200
5
14-D
ec-2
005
25-D
ec-2
005
5-Ja
n-20
06
16-J
an-2
006
27-J
an-2
006
020,000,00040,000,00060,000,00080,000,000
100,000,000120,000,000140,000,000160,000,000
Per-Download Samples
9-Feb
-200
5
20-F
eb-2
005
3-M
ar-2
005
14-M
ar-2
005
25-M
ar-2
005
5-Apr
-200
5
16-A
pr-2
005
27-A
pr-2
005
8-M
ay-2
005
19-M
ay-2
005
30-M
ay-2
005
10-J
un-2
005
21-J
un-2
005
2-Ju
l-200
5
13-J
ul-20
05
24-J
ul-20
05
4-Aug
-200
5
15-A
ug-2
005
26-A
ug-2
005
6-Sep
-200
5
17-S
ep-2
005
28-S
ep-2
005
9-Oct
-200
5
20-O
ct-2
005
31-O
ct-2
005
11-N
ov-2
005
22-N
ov-2
005
3-Dec
-200
5
14-D
ec-2
005
25-D
ec-2
005
5-Ja
n-20
06
16-J
an-2
006
27-J
an-2
006
0
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
7,000,000
Traceroutes
David Choffnes 32
Wide area events
Detected problems in the US, Europe and Asia
Identified potential cross network events– Use ISP relationships and correlate per-ISP events– Detected cases in seven countries
Crowdsourcing Network Monitoring
David Choffnes 33
Robustness to parameters
Robust to various detection settings, populations– Number of users in a network is not strongly correlated
with number of events detected
Crowdsourcing Network Monitoring
Ordered by # users Ordered by # events
David Choffnes 34
How much does it cost?
Simulate using 3 stddev thresholds and 2 window sizes in parallel – Allows NEWS to detect multiple types of events– Model caching in DHT
Count number of DHT operations at any time
Goals– Low cost (does not affect user’s transfers)– Privacy preserving– No reliance on infrastructure
Crowdsourcing Network Monitoring
David Choffnes 35
Events at each time step
Read and writes are clustered– Expected for a system that detects events– Diurnal pattern
Crowdsourcing Network Monitoring
David Choffnes 36
How much does it cumulatively cost?
One read every 10 seconds, one write every two minutes (spread over hundreds of users)
Reasonable load on a DHT– Kademlia caches values close to host– 38 bytes per read/write, about 4 B/s overhead
Crowdsourcing Network Monitoring
David Choffnes 37
Strawman cost
Decentralized approach not using summaries– 13 signals x 4 bytes = 52 bytes, every 15 seconds– Sharing incurs O(N2) cost– 1000 hosts: 34.6MB/s
Centralized approach to collecting data– 13 signals x 4 bytes = 52 bytes, every 15 seconds– 1000 hosts: 4 KB/s– Plus CPU/memory costs for processing this data– Ignores important issues
• Who hosts this?• Privacy?• Cross-network events?
Crowdsourcing Network Monitoring
David Choffnes 38
NEWS UI
Notifications through non-intrusive interface
List of events for historical information
Crowdsourcing Network Monitoring
David Choffnes 39
NEWS Usage
Crowdsourcing Network Monitoring
David Choffnes 40
The view from 1 million users
Crowdsourcing Network Monitoring
231,000
547,000
35,0001,096
David Choffnes 41
Extension to multiple signals
Leverage independent signals at each host– For example, upload rate and download rate– Even more unlikely that both signals affected at same
time by coincidence
Crowdsourcing Network Monitoring
David Choffnes 42
Detecting throughput drops
Moving average– Low cost– Few parameters– Well understood
Key parameters– Window size, deviation
Approach– Find mean for each window– Find how much next sample deviates from mean
Key questions– Can we find good window sizes, threshold deviations?
Crowdsourcing Network Monitoring
David Choffnes 43
Simulate different numbers of hosts seeing events independently with a normally dist. probability
Relative likelihood of coincidence
Crowdsourcing Network Monitoring
The more peers seeing an event at the same time, the less likely it occurs by coincidence
Five orders of magnitude between 3 peers and 9 peers corroborating an event
David Choffnes 44Crowdsourcing Network Monitoring
Ground truth hard to come by
Can we crowdsource event labeling?
Make information available to community
Newsight
http://aqualab.cs.northwestern.edu/projects/news/newsight.html