Inference, monitoring and recovery of large scale networks
CSE DepartmentPennState University
Institute for Networking and Security Research
Faculty: Thomas La Porta
Post-Doc: Simone Silvestri
Ph.D. Students: Srikar Tati, Brett Holbert, Michael Lin
Problems and challenges in large scale networks
2Inference, monitoring and recovery of large scale networks
INSR Industry Day 2014
Research problems
InferencingMonitoringRecovery
Challenges
Large scalePartial informationInterdependent networksConstraints (time, cost, ..)
This research is sponsored by:Defense Threat Reduction Agency (DTRA)Army Research Lab and UK Ministry of Defence - ITA Program
Internet router level topologyMerlin Tool
Inferencing: motivation
3Inference, monitoring and recovery of large scale networks
INSR Industry Day 2014
The lack of global knowledge of the Internet topology Hinders network diagnostics (losses, failures, bottlenecks) Inflates IP path lengths Reduces accuracy of models Encourages overlay networks to ignore underlay
Network operators rarely publish their topologies
Current inference approaches rely on tools such as Traceroute
Traceroute provides only partial information The network is only partially observable
Previous approaches fail or peform poorly
Our problem: infer the routing topology in the presence of partial information
Inferencing: our approach - iTop
4Inference, monitoring and recovery of large scale networks
INSR Industry Day 2014
iTop algorithm: Fills unobservable parts of the network with virtual links/routers
Analyzes the traces to determine properties of the real topology
Iteratively merges links to infer the real network
Ground Truth topology
Virtual topology
Mergingalgorithm
iTop
+
Inferred topology
Trace analysis
Inferencing: our approach - Results
5Inference, monitoring and recovery of large scale networks
INSR Industry Day 2014
We compare our approach to state-of-art inferencing approaches: X. Jin, W.-P. Yiu, S.-H. Chan, and Y. Wang, “Network topology inference based on end-to-end measurements,” IEEE
Journal on Selected Areas in Communications, vol. 24, no. 12, pp. 2182–2195, 2006 B. Yao, R. Viswanathan, F. Chang, and D. Waddington, “Topology inference in the presence of anonymous routers,”
IEEE Infocom, 2003. We consider realistic networks
We also show how iTop improves the performance of failure diagnosis algorithms in the presence of partial information
Monitoring: motivation (1)
6Inference, monitoring and recovery of large scale networks
INSR Industry Day 2014
Accurate knowledge of the internal network state enables Performance diagnosis Resoruce allocation Efficient routing Congestion control
Monitoring large scale networks may incur high overhead
Network tomography
Infer internal network from end-to-end measurements
Solve a linear system
Enables efficient monitoring probing only a basis of the system
=
Monitoring: motivation (2)
7Inference, monitoring and recovery of large scale networks
INSR Industry Day 2014
Failures are common events in modern networks
Failures can significantly affect the performanceof network tomography
Probing incurs a cost, often a maximum budget is available
Our problem: select a set of probing paths to maximize the performance of network tomography under
failures with a limited budget
Monitoring: our approach
8Inference, monitoring and recovery of large scale networks
INSR Industry Day 2014
We translate the problem into a maximization of a submodular function under budget constraint
We propose the algorithm RoMe Makes use of recent advances in submodular maximiztion theory
Has an approximation factor (1-1/e)/2
It is optimal with additional constraint of linear independency
Assumes knowledge of the failure distribution
We consider the case of unknown failure distribution
We propose the algorithm LSR (Learning with Submodular Rewards) Reinforcement learning approach
Learns path availabilities
Performance guarantees
Init
Update path availabilities
Select paths Collect measurements
Monitoring: results
9Inference, monitoring and recovery of large scale networks
INSR Industry Day 2014
We compare our approach to state-of-art path selection algorithms Y. Chen, D. Bindel, H. Song, and R. H. Katz, “An algebraic approach to practical and scalable overlay network monitoring,”
ACM SIGCOMM Comp. Com. Rev., 2004. We consider realistic topologies and failure models
Recovery: motivation
10Inference, monitoring and recovery of large scale networks
INSR Industry Day 2014
Modern networks are highly interdependent The Internet and the smart grid Water supply, transportaion, fuel and power stations are coupled together
Interdependent networks are extremely sensitive to failures
Failures may create performance degradation
Degradation can also propagate in the surviving network
Electrical blackout that occurred in Italy in September 2003
Recovery: research problems (1)
11Inference, monitoring and recovery of large scale networks
INSR Industry Day 2014
Recovery algorithms for overlay networks
Two networks sharing the same infrastructure
Failures occur in the underlay network and affect the overlay
Models an emergency urban communication network after a weapon of mass destruction attack
We aim at restoring the functionality of the overlay network repairing the underlay
Objectives & constrains Bandwith Time Cost Utility
Recovery: research problems (2)
12Inference, monitoring and recovery of large scale networks
INSR Industry Day 2014
Models for temporal propagation of failures
Two general interdependent networks
Failures propagate over time Backup batteries/generators Local solar plant supply
Given the initial failure our model will: Estimate the probability that one element fails at a given time Estimate the expected time at which one element fails Estimate the expected number of failed elements at a given time
These information will be used to design recovery strategies
These models will be mapped and validated with real interdependent networks
Recovery: research problems (3)
13Inference, monitoring and recovery of large scale networks
INSR Industry Day 2014
Improve network robustness:
Re-design existing networks
Design new networks less prone to cascading effects
Models and recovery strategies for performance degradation over time
Partial knowledge
Partial control
Multiple interdependent networks
Thank you! Any question?
14Inference, monitoring and recovery of large scale networks
INSR Industry Day 2014