A Network-State Management Service · 2014-12-10 · A Network-State Management Service Peng Sun...

Preview:

Citation preview

A Network-State Management Service

Peng SunRatul Mahajan, Jennifer Rexford,

Lihua Yuan, Ming Zhang, Ahsan ArefinPrinceton & Microsoft

Complex Infrastructure

1

Complex Infrastructure

1

Number of 2010

Data Center A few

NetworkDevice 1,000s

NetworkCapacity 10s of Tbps

Microsoft Azure

Complex Infrastructure

1

Number of 2010 2014

Data Center A few 10s

NetworkDevice 1,000s 10s of 1,000s

NetworkCapacity 10s of Tbps Pbps

Microsoft Azure

Complex Infrastructure

Variety of vendors/models/time1

Number of 2010 2014

Data Center A few 10s

NetworkDevice 1,000s 10s of 1,000s

NetworkCapacity 10s of Tbps Pbps

Microsoft Azure

Management Applications

2

Management Applications

2

Traffic Engineering

Management Applications

2

Traffic Engineering

Load Balancing

Management Applications

2

Traffic Engineering

Load Balancing Link

Corruption Mitigation

Management Applications

2

Traffic Engineering

Load Balancing Link

Corruption MitigationDevice

Firmware Upgrade

……

Our Question

How to safely run multiple management applications on shared infrastructure

3

Naïve Solution

• Run independently

4

Traffic Engineering

Link Corruption Mitigation

Firmware Upgrade

Network Devices

• It does not work due to 2 problems

Naïve Solution

4

Traffic Engineering

Link Corruption Mitigation

Firmware Upgrade

Network Devices

AggA

ToRs

AggB

Core1 2

Problem #1: Conflict

5

AggA

ToRs

AggB

Core1 2

Problem #1: Conflict

5

Link-corruption-mitigation adjusts traffic away from Core1

AggA

ToRs

AggB

Core1 2

Problem #1: Conflict

5

Link-corruption-mitigation adjusts traffic away from Core1

TE tunes traffic among links to Core1, 2

AggA

ToRs

AggB

Core1 2

Problem #2: Safety Violation

6

AggA

ToRs

AggB

Core1 2

Problem #2: Safety Violation

6

Link-corruption-mitigation shuts down faulty Agg A

AggA

ToRs

AggB

Core1 2

Problem #2: Safety Violation

6

Link-corruption-mitigation shuts down faulty Agg A

Firmware-upgrade schedules Agg B to upgrade

Potential Solution #1

7

Traffic Engineering

Firmware Upgrade

Link Corruption Mitigation

Potential Solution #1

• One monolithic application

7

Traffic Engineering

Firmware Upgrade

Link Corruption Mitigation

Potential Solution #1

• One monolithic application

• Central control of all actions

7

Traffic Engineering

Firmware Upgrade

Link Corruption Mitigation

Too Complex to Build

• Difficult to develop• Combine all applications that are

already individually complicated

8

Too Complex to Build

• Difficult to develop• Combine all applications that are

already individually complicated

• High maintenance cost• for such huge software in practice

8

Potential Solution #2

9

Traffic Engineering

Firmware Upgrade

Link Corruption Mitigation

Potential Solution #2• Explicit coordination among

applications

9

Traffic Engineering

Firmware Upgrade

Link Corruption Mitigation

Potential Solution #2• Explicit coordination among

applications

• Consensus over network changes

9

Traffic Engineering

Firmware Upgrade

Link Corruption Mitigation

Still Too Complex

• Hard to understand each other• Diverse network interactions

10

Still Too Complex

• Hard to understand each other• Diverse network interactions

10

Application Routing Device Config

TrafficEngineering

Firmwareupgrade

Still Too Complex

• Hard to understand each other• Diverse network interactions

10

Application Routing Device Config

TrafficEngineering

Firmwareupgrade

Still Too Complex

• Hard to understand each other• Diverse network interactions

10

Application Routing Device Config

TrafficEngineering

Firmwareupgrade

Main Enemy: Complexity

• Application development

• Application coordination

11

Main Enemy: Complexity

• Application development

• Application coordination

11

MonolithicIndepen-dent

Explicitlycoordinate

Simple Complex

What We Advocate

• Loose coupling of applications

• Design principle:• Simplicity with safety guarantees

12

What We Advocate

• Loose coupling of applications

• Design principle:• Simplicity with safety guarantees

• Forgo joint optimization• Worthwhile tradeoff for simplicity• Applications could do it out-of-band

12

Overview of Statesman

• Network operating system for safe multi-application operation

13

Overview of Statesman

• Network operating system for safe multi-application operation

• Uses network state abstraction• Three views of network state

13

Overview of Statesman

• Network operating system for safe multi-application operation

• Uses network state abstraction• Three views of network state• Dependency model of states

13

The “State” in Statesman

• Complexity of dealing with devices• Heterogeneity• Device-specific commands

14

Network Devices

The “State” in Statesman

• Complexity of dealing with devices• Heterogeneity• Device-specific commands

14

Network Devices

Network State

State Variable Examples

State Variable Value

Device Power Status Up, down

Device Firmware Version number

Device SDN Agent Boot Up, down

Device Routing State Routing rules

Link Admin Status Up, down

Link Control Plane BGP, OpenFlow, …15

Simplify Device InteractionPast Now

16

Network Devices Network Devices

Network State

Application Application

Simplify Device InteractionPast Now

16

SNMP, OF, vendor API, …

Network Devices Network Devices

Network State

Application

Device Statistics

Application

Simplify Device InteractionPast Now

16

SNMP, OF, vendor API, …

Network Devices Network Devices

Network State

Application

Device Statistics

Application

Device-specificcmds

Simplify Device InteractionPast Now

16

SNMP, OF, vendor API, …

Read

Network Devices Network Devices

Network State

Application

Device Statistics

Application

Device-specificcmds

Simplify Device InteractionPast Now

16

SNMP, OF, vendor API, …

Read Write

Network Devices Network Devices

Network State

Application

Device Statistics

Application

Device-specificcmds

Views of Network State

17Network Devices

Network State

ApplicationApplicationApplication

Views of Network State

17Network Devices

Observed State

Observed State Actual state of the whole network

Target State Desired state to be updated on the whole network

Target State

ApplicationApplicationApplication

Network Devices

Two Views Are Not Enough

18

Observed State

Target State

ApplicationApplicationApplication

Network Devices

Two Views Are Not Enough

18

Observed State

Target State

One More View

Proposed State A group of entity-variable-valuesdesired by an application

Proposed State

Network Devices

Two Views Are Not Enough

18

Observed State

Target State

One More View

Proposed State A group of entity-variable-valuesdesired by an application

Proposed State

ApplicationApplicationApplication

Network Devices

Two Views Are Not Enough

18

Observed State

Target State

One More View

Proposed State A group of entity-variable-valuesdesired by an application

Proposed State

ApplicationApplicationApplication

How Merging Works• Combine multiple proposed states

into a safe target state

19

How Merging Works• Combine multiple proposed states

into a safe target state

• Conflict resolution• Last-writer-wins• Priority-based locking• Sufficient for current deployment

19

How Merging Works• Combine multiple proposed states

into a safe target state

• Conflict resolution• Last-writer-wins• Priority-based locking• Sufficient for current deployment

• Safety invariant checking• Partial rejection & Skip update

19

Choose Safety Invariants

20

Choose Safety Invariants

20

TightLoose

Choose Safety Invariants

20

Hinder application too frequently

TightLoose

Choose Safety Invariants

20

Hinder application too frequently

TightLoose

Cannot protect network operation

Choose Safety Invariants

• Our current choice• Connectivity: Every pair of ToRs in

one DC is connected• Capacity: 99% of ToR pairs have at

least 50% capacity

20

Hinder application too frequently

TightLoose

Cannot protect network operation

Recap of Three-View Model• Simplify network management

21

Observed State

Target StateProposed

State

Recap of Three-View Model• Simplify network management

21

Observed State

Target StateProposed

State

ApplicationApplicationApplication

Recap of Three-View Model• Simplify network management

21

Observed State

Target StateProposed

State

What we see from

the network

ApplicationApplicationApplication

Recap of Three-View Model• Simplify network management

21

Observed State

Target StateProposed

State

What we see from

the network

What we want the network

to be

ApplicationApplicationApplication

Recap of Three-View Model• Simplify network management

21

Observed State

Target StateProposed

State

What we see from

the network

What we want the network

to be

What can be actually done on the network

StatesmanApplicationApplicationApplication

Yet Another Problem

• What’s in Proposed State• Small number of state variables that

application cares

22

Yet Another Problem

• What’s in Proposed State• Small number of state variables that

application cares

• Implicit conflicts arises

22

Yet Another Problem

• What’s in Proposed State• Small number of state variables that

application cares

• Implicit conflicts arises• Caused by state dependency

22

Implicit Conflict

23

A

B C

D

Implicit Conflict

23

A

B C

D

Implicit Conflict

23

TE writes new value of routing state of B for tunneling traffic

A

B C

D

Implicit Conflict

23

TE writes new value of routing state of B for tunneling traffic

Firmware-upgrade writes new value of firmware state of B

Dependency Relations

24

Device

Link

Dependency Relations

24

PowerState Device

Link

Dependency Relations

24

PowerState

FirmwareVersion

Device

Link

Dependency Relations

24

PowerState

FirmwareVersion

ConfigurationState

Device

Link

Dependency Relations

24

PowerState

FirmwareVersion

ConfigurationState

Device

Link

bgpd SDN

Dependency Relations

24

PowerState

FirmwareVersion

ConfigurationState AdminState

ConfigurationState

Device

Link

Dependency Relations

24

PowerState

FirmwareVersion

ConfigurationState

RoutingState

AdminState

ConfigurationState

Device

Link

Dependency Relations

24

PowerState

FirmwareVersion

ConfigurationState

RoutingState

AdminState

ConfigurationState

PathState

Device

Link

Build in Dependency Model

• Statesman calculates it internally

• Only exposes the result for each state variable• Whether the variable is controllable

25

Statesman System

26

TargetState

Proposed State

Observed State

Statesman System

26

TargetState

Proposed State

Observed State

Storage Service

Statesman System

26

TargetState

Monitor

Proposed State

Observed State

Storage Service

Statesman System

26

TargetState

Monitor

Checker

Proposed State

Observed State

Storage Service

Statesman System

26

TargetState

Monitor Updater

Checker

Proposed State

Observed State

Storage Service

Deployment Overview

• Operational in Microsoft Azure for 12 months

• Cover 10 DCs of 20K devices

27

Deployment Overview

• Operational in Microsoft Azure for 12 months

• Cover 10 DCs of 20K devices

27

Production Applications

• 3 diverse applications built• Device firmware upgrade• Link corruption mitigation• Traffic engineering

28

Production Applications

• 3 diverse applications built• Device firmware upgrade• Link corruption mitigation• Traffic engineering

• Finish within months

• Only thousands of lines of code

28

Case #1: Resolve ConflictInter-DC TE &

Firmware-upgrade

29

BR 1

BR 2DC 1

BR 8

BR 7DC 4

BR 3BR 4

DC 2

BR 5

DC 3

BR 6

DC = Data CenterBR = Border Router

30

……

……

30

……

……

30

Firmware-upgrade acquires lock of BR1

……

……

30

TE fails to acquire lock, and moves traffic away

……

……

30

TE fails to acquire lock, and moves traffic away

……

……

30

BR1 firmware upgrade starts

……

……

30

BR1 firmware upgrade starts

BR1 firmware upgrade ends. Lock released.

……

……

30

BR1 firmware upgrade starts

TE re-acquires lock, and moves traffic back

……

……

30

BR1 firmware upgrade starts

TE re-acquires lock, and moves traffic back

……

……

Case #1 Summary

• Each application: • Simple logic• Unaware of the other

• Statesman enables: • Conflict resolution• Necessary coordination

31

Case #2: Maintain Capacity Invariant

Firmware-upgrade & Link-corruption-mitigation

32

ToR

Agg

…… …

Core

…Pod 4

41

1 n…Pod 1

41

1 n …Pod 10

41

1 n

1 4

Link corrupting packets

33

……

……

33

Upgrade proceeds in normal speed in Pod 3 and 5

……

……

33

Upgrade proceeds in normal speed in Pod 3 and 5

……

……

33

Upgrade proceeds in normal speed in Pod 3 and 5

……

……

33

Upgrade proceeds in normal speed in Pod 3 and 5

Upgrade in Pod 4 is slowed down by checker due to lost

capacity

……

……

33

Upgrade proceeds in normal speed in Pod 3 and 5

Upgrade in Pod 4 is slowed down by checker due to lost

capacity

……

……

Case #2 Summary

• Statesman:• Automatically adjusts application

progresses• Keeps the network within safety

requirements

34

Conclusion

• Need network operating system for multiple management applications

35

Conclusion

• Need network operating system for multiple management applications

• Statesman• Loose coupling of applications• Network state abstraction

35

Conclusion

• Need network operating system for multiple management applications

• Statesman• Loose coupling of applications• Network state abstraction

• Deployed and operational in Azure

35

36

Thanks!

Questions?Check paper for related works

Recommended