14
Inference, monitoring and recovery of large scale networks CSE Department PennState University Institute for Networking and Security Research Faculty: Thomas La Porta Post-Doc: Simone Silvestri Ph.D. Students: Srikar Tati, Brett Holbert, Michael Lin

Inference, monitoring and recovery of large scale networks CSE Department PennState University Institute for Networking and Security Research Faculty:

Embed Size (px)

Citation preview

Page 1: Inference, monitoring and recovery of large scale networks CSE Department PennState University Institute for Networking and Security Research Faculty:

Inference, monitoring and recovery of large scale networks

CSE DepartmentPennState University

Institute for Networking and Security Research

Faculty: Thomas La Porta

Post-Doc: Simone Silvestri

Ph.D. Students: Srikar Tati, Brett Holbert, Michael Lin

Page 2: Inference, monitoring and recovery of large scale networks CSE Department PennState University Institute for Networking and Security Research Faculty:

Problems and challenges in large scale networks

2Inference, monitoring and recovery of large scale networks

INSR Industry Day 2014

Research problems

InferencingMonitoringRecovery

Challenges

Large scalePartial informationInterdependent networksConstraints (time, cost, ..)

This research is sponsored by:Defense Threat Reduction Agency (DTRA)Army Research Lab and UK Ministry of Defence - ITA Program

Internet router level topologyMerlin Tool

Page 3: Inference, monitoring and recovery of large scale networks CSE Department PennState University Institute for Networking and Security Research Faculty:

Inferencing: motivation

3Inference, monitoring and recovery of large scale networks

INSR Industry Day 2014

The lack of global knowledge of the Internet topology Hinders network diagnostics (losses, failures, bottlenecks) Inflates IP path lengths Reduces accuracy of models Encourages overlay networks to ignore underlay

Network operators rarely publish their topologies

Current inference approaches rely on tools such as Traceroute

Traceroute provides only partial information The network is only partially observable

Previous approaches fail or peform poorly

Our problem: infer the routing topology in the presence of partial information

Page 4: Inference, monitoring and recovery of large scale networks CSE Department PennState University Institute for Networking and Security Research Faculty:

Inferencing: our approach - iTop

4Inference, monitoring and recovery of large scale networks

INSR Industry Day 2014

iTop algorithm: Fills unobservable parts of the network with virtual links/routers

Analyzes the traces to determine properties of the real topology

Iteratively merges links to infer the real network

Ground Truth topology

Virtual topology

Mergingalgorithm

iTop

+

Inferred topology

Trace analysis

Page 5: Inference, monitoring and recovery of large scale networks CSE Department PennState University Institute for Networking and Security Research Faculty:

Inferencing: our approach - Results

5Inference, monitoring and recovery of large scale networks

INSR Industry Day 2014

We compare our approach to state-of-art inferencing approaches: X. Jin, W.-P. Yiu, S.-H. Chan, and Y. Wang, “Network topology inference based on end-to-end measurements,” IEEE

Journal on Selected Areas in Communications, vol. 24, no. 12, pp. 2182–2195, 2006 B. Yao, R. Viswanathan, F. Chang, and D. Waddington, “Topology inference in the presence of anonymous routers,”

IEEE Infocom, 2003. We consider realistic networks

We also show how iTop improves the performance of failure diagnosis algorithms in the presence of partial information

Page 6: Inference, monitoring and recovery of large scale networks CSE Department PennState University Institute for Networking and Security Research Faculty:

Monitoring: motivation (1)

6Inference, monitoring and recovery of large scale networks

INSR Industry Day 2014

Accurate knowledge of the internal network state enables Performance diagnosis Resoruce allocation Efficient routing Congestion control

Monitoring large scale networks may incur high overhead

Network tomography

Infer internal network from end-to-end measurements

Solve a linear system

Enables efficient monitoring probing only a basis of the system

=

Page 7: Inference, monitoring and recovery of large scale networks CSE Department PennState University Institute for Networking and Security Research Faculty:

Monitoring: motivation (2)

7Inference, monitoring and recovery of large scale networks

INSR Industry Day 2014

Failures are common events in modern networks

Failures can significantly affect the performanceof network tomography

Probing incurs a cost, often a maximum budget is available

Our problem: select a set of probing paths to maximize the performance of network tomography under

failures with a limited budget

Page 8: Inference, monitoring and recovery of large scale networks CSE Department PennState University Institute for Networking and Security Research Faculty:

Monitoring: our approach

8Inference, monitoring and recovery of large scale networks

INSR Industry Day 2014

We translate the problem into a maximization of a submodular function under budget constraint

We propose the algorithm RoMe Makes use of recent advances in submodular maximiztion theory

Has an approximation factor (1-1/e)/2

It is optimal with additional constraint of linear independency

Assumes knowledge of the failure distribution

We consider the case of unknown failure distribution

We propose the algorithm LSR (Learning with Submodular Rewards) Reinforcement learning approach

Learns path availabilities

Performance guarantees

Init

Update path availabilities

Select paths Collect measurements

Page 9: Inference, monitoring and recovery of large scale networks CSE Department PennState University Institute for Networking and Security Research Faculty:

Monitoring: results

9Inference, monitoring and recovery of large scale networks

INSR Industry Day 2014

We compare our approach to state-of-art path selection algorithms Y. Chen, D. Bindel, H. Song, and R. H. Katz, “An algebraic approach to practical and scalable overlay network monitoring,”

ACM SIGCOMM Comp. Com. Rev., 2004. We consider realistic topologies and failure models

Page 10: Inference, monitoring and recovery of large scale networks CSE Department PennState University Institute for Networking and Security Research Faculty:

Recovery: motivation

10Inference, monitoring and recovery of large scale networks

INSR Industry Day 2014

Modern networks are highly interdependent The Internet and the smart grid Water supply, transportaion, fuel and power stations are coupled together

Interdependent networks are extremely sensitive to failures

Failures may create performance degradation

Degradation can also propagate in the surviving network

Electrical blackout that occurred in Italy in September 2003

Page 11: Inference, monitoring and recovery of large scale networks CSE Department PennState University Institute for Networking and Security Research Faculty:

Recovery: research problems (1)

11Inference, monitoring and recovery of large scale networks

INSR Industry Day 2014

Recovery algorithms for overlay networks

Two networks sharing the same infrastructure

Failures occur in the underlay network and affect the overlay

Models an emergency urban communication network after a weapon of mass destruction attack

We aim at restoring the functionality of the overlay network repairing the underlay

Objectives & constrains Bandwith Time Cost Utility

Page 12: Inference, monitoring and recovery of large scale networks CSE Department PennState University Institute for Networking and Security Research Faculty:

Recovery: research problems (2)

12Inference, monitoring and recovery of large scale networks

INSR Industry Day 2014

Models for temporal propagation of failures

Two general interdependent networks

Failures propagate over time Backup batteries/generators Local solar plant supply

Given the initial failure our model will: Estimate the probability that one element fails at a given time Estimate the expected time at which one element fails Estimate the expected number of failed elements at a given time

These information will be used to design recovery strategies

These models will be mapped and validated with real interdependent networks

Page 13: Inference, monitoring and recovery of large scale networks CSE Department PennState University Institute for Networking and Security Research Faculty:

Recovery: research problems (3)

13Inference, monitoring and recovery of large scale networks

INSR Industry Day 2014

Improve network robustness:

Re-design existing networks

Design new networks less prone to cascading effects

Models and recovery strategies for performance degradation over time

Partial knowledge

Partial control

Multiple interdependent networks

Page 14: Inference, monitoring and recovery of large scale networks CSE Department PennState University Institute for Networking and Security Research Faculty:

Thank you! Any question?

14Inference, monitoring and recovery of large scale networks

INSR Industry Day 2014