35
Study on SelfHealing and Self Optimization in Software Defined Networking for Heterogeneous Networks Kevin Andrea Image: Birmingham Rail & Locomotive http://www.bhamrail.com/frogswitch/turnouts.asp

Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Embed Size (px)

Citation preview

Page 1: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Study  on  Self-­‐Healing  and  Self-­‐Optimization  in  Software  Defined  Networking  for  Heterogeneous  

Networks

Kevin Andrea

Image:  Birmingham  Rail  &  Locomotive  http://www.bhamrail.com/frogswitch/turnouts.asp

Page 2: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Overview• Introduction• Background• Software  Defined  Networking

• Overview• Current  Applications• Issues

• New  Networking  Landscape• IoT,  VANET,  Smart  Grid• Issues

• Solutions• Self-­‐Healing• Self-­‐Optimization

• Conclusion• References Image:  Institute  for  Communication   Systems,  University  of  Surrey,  UK.

http://www.surrey.ac.uk/ics/activity/facilities/futureinternet/  

Page 3: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Introduction

• Problem  Statement• Current  networking  infrastructure  is  complex• Managing  requires  Autonomic  Systems• IBM

• Self-­‐Configuration• Self-­‐Protection• Self-­‐Healing• Self-­‐Optimization

• Exploration  of  new  networking  technologies.• Software  Defined  Networks

Image: Ivan Pepelnjak, NIL Data Communications, TechTargethttp://searchtelecom.techtarget.com/feature/BGP-­‐essentials-­‐The-­‐protocol-­‐that-­‐makes-­‐the-­‐

Internet-­‐work

bility feature of SDN can be used to achieve self-* attributesof the autonomic systems. The combination of autonomicsystem and SDN can be used to control and manage the net-work infrastructure. The purpose of both the technologiesis to overcome the growing complexity of the network. Theapplication of autonomic properties on SDN can unleash thetrue potential of future networks. The reliability of SDN canbe improved with the application of self-healing principle.SDN is described by ONF [1] as an architecture in which

the control and data planes are decoupled, network intelli-gence and state are logically centralized, and the underlyingnetwork infrastructure is abstracted from the applications.Figure 1 illustrates the layered architecture of SDN.

Data Plane Layer

Control Plane Layer

Application Layer

Network Switches

Network Applications

SDN controller/ Network Operating System

OpenFlow

Figure 1: Architecture of SDN

The SDN is a layered architecture which separates thedata plane from control plane. The separation of data planefrom control plane and centralizing the intelligence simplifiesthe management of the network. But it poses reliabilityissues in the network. A network failure can hamper thetraffic forwarding which results in packet loss and serviceunavailability. In an SDN based network the faults can becategorized into three areas (1) Data plane, where a switchor link between switches fails, (2) control plane, where linkconnecting controller and switch fails, and (3) controller,where the controller machine itself fails. A lot of researchis going on for exploring the services and functionality thatSDN can provide to leverage the network. But the area offault management in SDN is still not much explored. Thereare some solutions proposed for handling the failure in SDNbut they are not practical in the actual network consideringthe enormous traffic.In this paper, we present the existing work in the field

of fault management domain for SDN and present its lim-itations in terms of applicability for the present networks.We also propose an optimized self-healing (OSH) frameworkfor SDN which ensures optimal state and continuous avail-ability of the network after recovering from a failure. Afterthat we present the functionality of the rapid failure recov-ery scheme. Performance analysis based on an analyticalmodel of the rapid failure recovery is given by consideringthe factors like failure recovery time and the memory sizerequired for the backup flow rules.

2. PREVIOUS WORKThere has been some research done in the area of fault

management for SDN. Most of them rely on traditional wellknown approach of failure recovery i.e. restoration and pro-tection. In the case of restoration, alternative paths are

established after a failure occurs. In the case of protection,the alternative paths are established even before the fail-ure occurs in the network. Most of the existing scheme [7][8] [9] [11] [12], relies on adding flow entries for installingthe backup path for each of the disrupted flow on the failedlink. In Andrea S. et al [13], for each new flow, controllerinstalls backup path for each of the link which is a part ofthe primary path. This solution is not practical for the net-work having thousands of flows. According to [18] [19], ina modern data center with 100,000+ compute nodes, thenumber of flows in the network will be in the millions. Insuch a case, installing backup flow rule for each of the flowsmay overload the centralized controller and create process-ing bottlenecks. Considering the present switch hardwares,storing the millions of flow rules is impractical. An addi-tional TCAMmemory can used to store the OpenFlow rules,but because of its cost, commercial switches do not supportmore than 5000 to 6000 TCAM based flow rules [14].In path protection scheme, the immediate local recovery

is not possible. Because after the link failure, the switchwhich reroutes the traffic from a primary path to a backuppath should receive the failure notification. The link failurenotification time adds up to the total recovery time, whichresults in delayed recovery and ultimately higher packet loss.After switching to the backup path due to a failure, the flowentries for the disrupted flows become obsolete and needsextra controller involvement to explicitly remove them fromthe flow table. The authors of [10] [13] relies on restorationmechanism for failure recovery. Restoration mechanism re-quires more time to recover than the protection mechanismbecause the controller has to install the flow rules in all theswitches which are part of the alternate path. It also putssignificant load on the controller which ultimately delays therecovery process.CORONET [11] relies on LLDP messages for detecting

the changes in topology caused by the failure. But theLLDPs monitoring message processing overloads the con-troller. Yang Y. et al [12] calculates an optimal backuppath after a link failure is detected and then uses restora-tion mechanism for the link recovery. The path calculationand installation process delays the recovery process. Byconsidering all the issues in existing research, we propose anOSH approach for failure recovery in SDN. It addresses allthe issues presented in the existing schemes and provide anapproach to optimally recover a network from failures. Ourproposed RR scheme does not need a full-state controllerintervention upon the failure occurrence which reduces theload on the controller. Thus, it makes the failure recoveryprocess faster by eliminating the overhead of communicationbetween switches and controller. Because of the less com-munication, the overall congestion in the network reducessignificantly.

3. PROPOSED DESIGN ARCHITECTUREAND MECHANISM

3.1 Autonomic Self-healing Architecture forSDN

We propose a self-healing system for SDN which is capableof optimally handling the failures in SDN. Our proposed ar-chitecture of OSH mechanism is shown in the figure 2. Thearchitecture is divided into data plane and control plane.

Image: [2]

Page 4: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Introduction

• Problem  Statement• Current  networking  infrastructure  is  complex• Managing  requires  Autonomic  Systems• IBM

• Self-­‐Configuration• Self-­‐Protection• Self-­‐Healing• Self-­‐Optimization

• Exploration  of  new  networking  technologies.• Software  Defined  Networks

Image: Ivan Pepelnjak, NIL Data Communications, TechTargethttp://searchtelecom.techtarget.com/feature/BGP-­‐essentials-­‐The-­‐protocol-­‐that-­‐makes-­‐the-­‐

Internet-­‐work

bility feature of SDN can be used to achieve self-* attributesof the autonomic systems. The combination of autonomicsystem and SDN can be used to control and manage the net-work infrastructure. The purpose of both the technologiesis to overcome the growing complexity of the network. Theapplication of autonomic properties on SDN can unleash thetrue potential of future networks. The reliability of SDN canbe improved with the application of self-healing principle.SDN is described by ONF [1] as an architecture in which

the control and data planes are decoupled, network intelli-gence and state are logically centralized, and the underlyingnetwork infrastructure is abstracted from the applications.Figure 1 illustrates the layered architecture of SDN.

Data Plane Layer

Control Plane Layer

Application Layer

Network Switches

Network Applications

SDN controller/ Network Operating System

OpenFlow

Figure 1: Architecture of SDN

The SDN is a layered architecture which separates thedata plane from control plane. The separation of data planefrom control plane and centralizing the intelligence simplifiesthe management of the network. But it poses reliabilityissues in the network. A network failure can hamper thetraffic forwarding which results in packet loss and serviceunavailability. In an SDN based network the faults can becategorized into three areas (1) Data plane, where a switchor link between switches fails, (2) control plane, where linkconnecting controller and switch fails, and (3) controller,where the controller machine itself fails. A lot of researchis going on for exploring the services and functionality thatSDN can provide to leverage the network. But the area offault management in SDN is still not much explored. Thereare some solutions proposed for handling the failure in SDNbut they are not practical in the actual network consideringthe enormous traffic.In this paper, we present the existing work in the field

of fault management domain for SDN and present its lim-itations in terms of applicability for the present networks.We also propose an optimized self-healing (OSH) frameworkfor SDN which ensures optimal state and continuous avail-ability of the network after recovering from a failure. Afterthat we present the functionality of the rapid failure recov-ery scheme. Performance analysis based on an analyticalmodel of the rapid failure recovery is given by consideringthe factors like failure recovery time and the memory sizerequired for the backup flow rules.

2. PREVIOUS WORKThere has been some research done in the area of fault

management for SDN. Most of them rely on traditional wellknown approach of failure recovery i.e. restoration and pro-tection. In the case of restoration, alternative paths are

established after a failure occurs. In the case of protection,the alternative paths are established even before the fail-ure occurs in the network. Most of the existing scheme [7][8] [9] [11] [12], relies on adding flow entries for installingthe backup path for each of the disrupted flow on the failedlink. In Andrea S. et al [13], for each new flow, controllerinstalls backup path for each of the link which is a part ofthe primary path. This solution is not practical for the net-work having thousands of flows. According to [18] [19], ina modern data center with 100,000+ compute nodes, thenumber of flows in the network will be in the millions. Insuch a case, installing backup flow rule for each of the flowsmay overload the centralized controller and create process-ing bottlenecks. Considering the present switch hardwares,storing the millions of flow rules is impractical. An addi-tional TCAMmemory can used to store the OpenFlow rules,but because of its cost, commercial switches do not supportmore than 5000 to 6000 TCAM based flow rules [14].In path protection scheme, the immediate local recovery

is not possible. Because after the link failure, the switchwhich reroutes the traffic from a primary path to a backuppath should receive the failure notification. The link failurenotification time adds up to the total recovery time, whichresults in delayed recovery and ultimately higher packet loss.After switching to the backup path due to a failure, the flowentries for the disrupted flows become obsolete and needsextra controller involvement to explicitly remove them fromthe flow table. The authors of [10] [13] relies on restorationmechanism for failure recovery. Restoration mechanism re-quires more time to recover than the protection mechanismbecause the controller has to install the flow rules in all theswitches which are part of the alternate path. It also putssignificant load on the controller which ultimately delays therecovery process.CORONET [11] relies on LLDP messages for detecting

the changes in topology caused by the failure. But theLLDPs monitoring message processing overloads the con-troller. Yang Y. et al [12] calculates an optimal backuppath after a link failure is detected and then uses restora-tion mechanism for the link recovery. The path calculationand installation process delays the recovery process. Byconsidering all the issues in existing research, we propose anOSH approach for failure recovery in SDN. It addresses allthe issues presented in the existing schemes and provide anapproach to optimally recover a network from failures. Ourproposed RR scheme does not need a full-state controllerintervention upon the failure occurrence which reduces theload on the controller. Thus, it makes the failure recoveryprocess faster by eliminating the overhead of communicationbetween switches and controller. Because of the less com-munication, the overall congestion in the network reducessignificantly.

3. PROPOSED DESIGN ARCHITECTUREAND MECHANISM

3.1 Autonomic Self-healing Architecture forSDN

We propose a self-healing system for SDN which is capableof optimally handling the failures in SDN. Our proposed ar-chitecture of OSH mechanism is shown in the figure 2. Thearchitecture is divided into data plane and control plane.

Image: [2]

Page 5: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Background  – Traditional  Networks

• Network  Flow• Sequence  of  Packets

• Source  to  Destination

• Routing  Protocols• Link  State  Routing  Protocols

• OSPF  (Dijkstra Algorithm)• Connectivity  Graph  of  Routers

• Distance  Vector  Routing  Protocol• RIP  (Bellman-­‐Ford  Algorithm)

• Neighbor  Graph  of  Routers• Exterior  Gateway  Protocol

• BGP• Routing  Between  Autonomous  Systems  

Switch  A Switch  B Switch  C

A  -­‐>  C

C  –>  A

C  –>  A

A  -­‐>  C

Page 6: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Background  – Traditional  Networks

• Traditional  Network  Router• Maintains  a  mapping  between  Destination  Address  and  Port

• Encapsulates  Two  Functions• Control

• Experts  configure  router• Protocols  build  the  Routing  Table

• Data  Transfer• Uses  the  Routing  Table• Forwards  Flows (Packets)

Image: Computer Desktop Encyclopedia, The Computer Language Company, Inc.http://homepages.uel.ac.uk/u0116401/RouterDefinition.htm

Page 7: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

• Traditional  Network  Router• Maintains  a  mapping  between  Destination  Address  and  Port

• Encapsulates  Two  Functions• Control

• Experts  configure  router• Protocols  build  the  Routing  Table

• Data  Transfer• Uses  the  Routing  Table• Forwards  Flows  (Packets)

Background  – Traditional  Networks

Image:  Birmingham  Rail  &  Locomotive  http://www.bhamrail.com/frogswitch/turnouts.asp

Control

Data  LineData  Line

Data  Transfer  

Page 8: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Software  Defined  Network  (SDN)

• Spearheaded  by  Sun  in  1995• Designed  to  allow  software  switching  for  networks.• Provided  as  a  means  for  experimenting  with  new  algorithms  and  protocols.

• Decouples  Control  and  Data• All  control  logic  is  moved  into  a  centralized  controller.• Hardware  is  replaced  with  Software

• Operators  have  a  global  view  of  the  network  state.

Image: [2]

bility feature of SDN can be used to achieve self-* attributesof the autonomic systems. The combination of autonomicsystem and SDN can be used to control and manage the net-work infrastructure. The purpose of both the technologiesis to overcome the growing complexity of the network. Theapplication of autonomic properties on SDN can unleash thetrue potential of future networks. The reliability of SDN canbe improved with the application of self-healing principle.SDN is described by ONF [1] as an architecture in which

the control and data planes are decoupled, network intelli-gence and state are logically centralized, and the underlyingnetwork infrastructure is abstracted from the applications.Figure 1 illustrates the layered architecture of SDN.

Data Plane Layer

Control Plane Layer

Application Layer

Network Switches

Network Applications

SDN controller/ Network Operating System

OpenFlow

Figure 1: Architecture of SDN

The SDN is a layered architecture which separates thedata plane from control plane. The separation of data planefrom control plane and centralizing the intelligence simplifiesthe management of the network. But it poses reliabilityissues in the network. A network failure can hamper thetraffic forwarding which results in packet loss and serviceunavailability. In an SDN based network the faults can becategorized into three areas (1) Data plane, where a switchor link between switches fails, (2) control plane, where linkconnecting controller and switch fails, and (3) controller,where the controller machine itself fails. A lot of researchis going on for exploring the services and functionality thatSDN can provide to leverage the network. But the area offault management in SDN is still not much explored. Thereare some solutions proposed for handling the failure in SDNbut they are not practical in the actual network consideringthe enormous traffic.In this paper, we present the existing work in the field

of fault management domain for SDN and present its lim-itations in terms of applicability for the present networks.We also propose an optimized self-healing (OSH) frameworkfor SDN which ensures optimal state and continuous avail-ability of the network after recovering from a failure. Afterthat we present the functionality of the rapid failure recov-ery scheme. Performance analysis based on an analyticalmodel of the rapid failure recovery is given by consideringthe factors like failure recovery time and the memory sizerequired for the backup flow rules.

2. PREVIOUS WORKThere has been some research done in the area of fault

management for SDN. Most of them rely on traditional wellknown approach of failure recovery i.e. restoration and pro-tection. In the case of restoration, alternative paths are

established after a failure occurs. In the case of protection,the alternative paths are established even before the fail-ure occurs in the network. Most of the existing scheme [7][8] [9] [11] [12], relies on adding flow entries for installingthe backup path for each of the disrupted flow on the failedlink. In Andrea S. et al [13], for each new flow, controllerinstalls backup path for each of the link which is a part ofthe primary path. This solution is not practical for the net-work having thousands of flows. According to [18] [19], ina modern data center with 100,000+ compute nodes, thenumber of flows in the network will be in the millions. Insuch a case, installing backup flow rule for each of the flowsmay overload the centralized controller and create process-ing bottlenecks. Considering the present switch hardwares,storing the millions of flow rules is impractical. An addi-tional TCAMmemory can used to store the OpenFlow rules,but because of its cost, commercial switches do not supportmore than 5000 to 6000 TCAM based flow rules [14].

In path protection scheme, the immediate local recoveryis not possible. Because after the link failure, the switchwhich reroutes the traffic from a primary path to a backuppath should receive the failure notification. The link failurenotification time adds up to the total recovery time, whichresults in delayed recovery and ultimately higher packet loss.After switching to the backup path due to a failure, the flowentries for the disrupted flows become obsolete and needsextra controller involvement to explicitly remove them fromthe flow table. The authors of [10] [13] relies on restorationmechanism for failure recovery. Restoration mechanism re-quires more time to recover than the protection mechanismbecause the controller has to install the flow rules in all theswitches which are part of the alternate path. It also putssignificant load on the controller which ultimately delays therecovery process.

CORONET [11] relies on LLDP messages for detectingthe changes in topology caused by the failure. But theLLDPs monitoring message processing overloads the con-troller. Yang Y. et al [12] calculates an optimal backuppath after a link failure is detected and then uses restora-tion mechanism for the link recovery. The path calculationand installation process delays the recovery process. Byconsidering all the issues in existing research, we propose anOSH approach for failure recovery in SDN. It addresses allthe issues presented in the existing schemes and provide anapproach to optimally recover a network from failures. Ourproposed RR scheme does not need a full-state controllerintervention upon the failure occurrence which reduces theload on the controller. Thus, it makes the failure recoveryprocess faster by eliminating the overhead of communicationbetween switches and controller. Because of the less com-munication, the overall congestion in the network reducessignificantly.

3. PROPOSED DESIGN ARCHITECTUREAND MECHANISM

3.1 Autonomic Self-healing Architecture forSDN

We propose a self-healing system for SDN which is capableof optimally handling the failures in SDN. Our proposed ar-chitecture of OSH mechanism is shown in the figure 2. Thearchitecture is divided into data plane and control plane.

Page 9: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Software  Defined  Network  (SDN)

• Ease  of  Configuration• Autonomic  Systems  are  able  to  change  any  network  node’s  forwarding  rules.• Without  having  to  configure  individual  switches.

• Fundamentally  new  Architecture• If  an  entry  exists  in  the  Data  Plane  Flow  Table,  the  packet  is  forwarded.• If  no  entry,  the  packet  is  sent  to  the  Control  Plane,  which  generates  a  new  rule.

Image: [2]

bility feature of SDN can be used to achieve self-* attributesof the autonomic systems. The combination of autonomicsystem and SDN can be used to control and manage the net-work infrastructure. The purpose of both the technologiesis to overcome the growing complexity of the network. Theapplication of autonomic properties on SDN can unleash thetrue potential of future networks. The reliability of SDN canbe improved with the application of self-healing principle.SDN is described by ONF [1] as an architecture in which

the control and data planes are decoupled, network intelli-gence and state are logically centralized, and the underlyingnetwork infrastructure is abstracted from the applications.Figure 1 illustrates the layered architecture of SDN.

Data Plane Layer

Control Plane Layer

Application Layer

Network Switches

Network Applications

SDN controller/ Network Operating System

OpenFlow

Figure 1: Architecture of SDN

The SDN is a layered architecture which separates thedata plane from control plane. The separation of data planefrom control plane and centralizing the intelligence simplifiesthe management of the network. But it poses reliabilityissues in the network. A network failure can hamper thetraffic forwarding which results in packet loss and serviceunavailability. In an SDN based network the faults can becategorized into three areas (1) Data plane, where a switchor link between switches fails, (2) control plane, where linkconnecting controller and switch fails, and (3) controller,where the controller machine itself fails. A lot of researchis going on for exploring the services and functionality thatSDN can provide to leverage the network. But the area offault management in SDN is still not much explored. Thereare some solutions proposed for handling the failure in SDNbut they are not practical in the actual network consideringthe enormous traffic.In this paper, we present the existing work in the field

of fault management domain for SDN and present its lim-itations in terms of applicability for the present networks.We also propose an optimized self-healing (OSH) frameworkfor SDN which ensures optimal state and continuous avail-ability of the network after recovering from a failure. Afterthat we present the functionality of the rapid failure recov-ery scheme. Performance analysis based on an analyticalmodel of the rapid failure recovery is given by consideringthe factors like failure recovery time and the memory sizerequired for the backup flow rules.

2. PREVIOUS WORKThere has been some research done in the area of fault

management for SDN. Most of them rely on traditional wellknown approach of failure recovery i.e. restoration and pro-tection. In the case of restoration, alternative paths are

established after a failure occurs. In the case of protection,the alternative paths are established even before the fail-ure occurs in the network. Most of the existing scheme [7][8] [9] [11] [12], relies on adding flow entries for installingthe backup path for each of the disrupted flow on the failedlink. In Andrea S. et al [13], for each new flow, controllerinstalls backup path for each of the link which is a part ofthe primary path. This solution is not practical for the net-work having thousands of flows. According to [18] [19], ina modern data center with 100,000+ compute nodes, thenumber of flows in the network will be in the millions. Insuch a case, installing backup flow rule for each of the flowsmay overload the centralized controller and create process-ing bottlenecks. Considering the present switch hardwares,storing the millions of flow rules is impractical. An addi-tional TCAMmemory can used to store the OpenFlow rules,but because of its cost, commercial switches do not supportmore than 5000 to 6000 TCAM based flow rules [14].

In path protection scheme, the immediate local recoveryis not possible. Because after the link failure, the switchwhich reroutes the traffic from a primary path to a backuppath should receive the failure notification. The link failurenotification time adds up to the total recovery time, whichresults in delayed recovery and ultimately higher packet loss.After switching to the backup path due to a failure, the flowentries for the disrupted flows become obsolete and needsextra controller involvement to explicitly remove them fromthe flow table. The authors of [10] [13] relies on restorationmechanism for failure recovery. Restoration mechanism re-quires more time to recover than the protection mechanismbecause the controller has to install the flow rules in all theswitches which are part of the alternate path. It also putssignificant load on the controller which ultimately delays therecovery process.

CORONET [11] relies on LLDP messages for detectingthe changes in topology caused by the failure. But theLLDPs monitoring message processing overloads the con-troller. Yang Y. et al [12] calculates an optimal backuppath after a link failure is detected and then uses restora-tion mechanism for the link recovery. The path calculationand installation process delays the recovery process. Byconsidering all the issues in existing research, we propose anOSH approach for failure recovery in SDN. It addresses allthe issues presented in the existing schemes and provide anapproach to optimally recover a network from failures. Ourproposed RR scheme does not need a full-state controllerintervention upon the failure occurrence which reduces theload on the controller. Thus, it makes the failure recoveryprocess faster by eliminating the overhead of communicationbetween switches and controller. Because of the less com-munication, the overall congestion in the network reducessignificantly.

3. PROPOSED DESIGN ARCHITECTUREAND MECHANISM

3.1 Autonomic Self-healing Architecture forSDN

We propose a self-healing system for SDN which is capableof optimally handling the failures in SDN. Our proposed ar-chitecture of OSH mechanism is shown in the figure 2. Thearchitecture is divided into data plane and control plane.

Page 10: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Software  Defined  Network  (SDN)

• Features• Fast  convergence  times  when  powered  on.• Centralized  controller  provides  fine-­‐grained  control  for  managing  complex  infrastructures.• Simplifies  network  devices.

• Any  device  can  now  be  a  network  device.

• Simple  packet  forwarders.

Image: [2]

bility feature of SDN can be used to achieve self-* attributesof the autonomic systems. The combination of autonomicsystem and SDN can be used to control and manage the net-work infrastructure. The purpose of both the technologiesis to overcome the growing complexity of the network. Theapplication of autonomic properties on SDN can unleash thetrue potential of future networks. The reliability of SDN canbe improved with the application of self-healing principle.SDN is described by ONF [1] as an architecture in which

the control and data planes are decoupled, network intelli-gence and state are logically centralized, and the underlyingnetwork infrastructure is abstracted from the applications.Figure 1 illustrates the layered architecture of SDN.

Data Plane Layer

Control Plane Layer

Application Layer

Network Switches

Network Applications

SDN controller/ Network Operating System

OpenFlow

Figure 1: Architecture of SDN

The SDN is a layered architecture which separates thedata plane from control plane. The separation of data planefrom control plane and centralizing the intelligence simplifiesthe management of the network. But it poses reliabilityissues in the network. A network failure can hamper thetraffic forwarding which results in packet loss and serviceunavailability. In an SDN based network the faults can becategorized into three areas (1) Data plane, where a switchor link between switches fails, (2) control plane, where linkconnecting controller and switch fails, and (3) controller,where the controller machine itself fails. A lot of researchis going on for exploring the services and functionality thatSDN can provide to leverage the network. But the area offault management in SDN is still not much explored. Thereare some solutions proposed for handling the failure in SDNbut they are not practical in the actual network consideringthe enormous traffic.In this paper, we present the existing work in the field

of fault management domain for SDN and present its lim-itations in terms of applicability for the present networks.We also propose an optimized self-healing (OSH) frameworkfor SDN which ensures optimal state and continuous avail-ability of the network after recovering from a failure. Afterthat we present the functionality of the rapid failure recov-ery scheme. Performance analysis based on an analyticalmodel of the rapid failure recovery is given by consideringthe factors like failure recovery time and the memory sizerequired for the backup flow rules.

2. PREVIOUS WORKThere has been some research done in the area of fault

management for SDN. Most of them rely on traditional wellknown approach of failure recovery i.e. restoration and pro-tection. In the case of restoration, alternative paths are

established after a failure occurs. In the case of protection,the alternative paths are established even before the fail-ure occurs in the network. Most of the existing scheme [7][8] [9] [11] [12], relies on adding flow entries for installingthe backup path for each of the disrupted flow on the failedlink. In Andrea S. et al [13], for each new flow, controllerinstalls backup path for each of the link which is a part ofthe primary path. This solution is not practical for the net-work having thousands of flows. According to [18] [19], ina modern data center with 100,000+ compute nodes, thenumber of flows in the network will be in the millions. Insuch a case, installing backup flow rule for each of the flowsmay overload the centralized controller and create process-ing bottlenecks. Considering the present switch hardwares,storing the millions of flow rules is impractical. An addi-tional TCAMmemory can used to store the OpenFlow rules,but because of its cost, commercial switches do not supportmore than 5000 to 6000 TCAM based flow rules [14].

In path protection scheme, the immediate local recoveryis not possible. Because after the link failure, the switchwhich reroutes the traffic from a primary path to a backuppath should receive the failure notification. The link failurenotification time adds up to the total recovery time, whichresults in delayed recovery and ultimately higher packet loss.After switching to the backup path due to a failure, the flowentries for the disrupted flows become obsolete and needsextra controller involvement to explicitly remove them fromthe flow table. The authors of [10] [13] relies on restorationmechanism for failure recovery. Restoration mechanism re-quires more time to recover than the protection mechanismbecause the controller has to install the flow rules in all theswitches which are part of the alternate path. It also putssignificant load on the controller which ultimately delays therecovery process.

CORONET [11] relies on LLDP messages for detectingthe changes in topology caused by the failure. But theLLDPs monitoring message processing overloads the con-troller. Yang Y. et al [12] calculates an optimal backuppath after a link failure is detected and then uses restora-tion mechanism for the link recovery. The path calculationand installation process delays the recovery process. Byconsidering all the issues in existing research, we propose anOSH approach for failure recovery in SDN. It addresses allthe issues presented in the existing schemes and provide anapproach to optimally recover a network from failures. Ourproposed RR scheme does not need a full-state controllerintervention upon the failure occurrence which reduces theload on the controller. Thus, it makes the failure recoveryprocess faster by eliminating the overhead of communicationbetween switches and controller. Because of the less com-munication, the overall congestion in the network reducessignificantly.

3. PROPOSED DESIGN ARCHITECTUREAND MECHANISM

3.1 Autonomic Self-healing Architecture forSDN

We propose a self-healing system for SDN which is capableof optimally handling the failures in SDN. Our proposed ar-chitecture of OSH mechanism is shown in the figure 2. Thearchitecture is divided into data plane and control plane.

Page 11: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Software  Defined  Network  (SDN)

• Current  Applications• Google  uses  SDN  as  its  internal  backbone  in  interconnecting  their  Data  Center  Networks  (DCN)s• 93%  of  mobile  providers  expect  Mobile  SDN  globally  implemented  within  5  years  (2019)• Current  papers  have  focused  on  DCN  applications  and  optimizations.

Image: [2]

bility feature of SDN can be used to achieve self-* attributesof the autonomic systems. The combination of autonomicsystem and SDN can be used to control and manage the net-work infrastructure. The purpose of both the technologiesis to overcome the growing complexity of the network. Theapplication of autonomic properties on SDN can unleash thetrue potential of future networks. The reliability of SDN canbe improved with the application of self-healing principle.SDN is described by ONF [1] as an architecture in which

the control and data planes are decoupled, network intelli-gence and state are logically centralized, and the underlyingnetwork infrastructure is abstracted from the applications.Figure 1 illustrates the layered architecture of SDN.

Data Plane Layer

Control Plane Layer

Application Layer

Network Switches

Network Applications

SDN controller/ Network Operating System

OpenFlow

Figure 1: Architecture of SDN

The SDN is a layered architecture which separates thedata plane from control plane. The separation of data planefrom control plane and centralizing the intelligence simplifiesthe management of the network. But it poses reliabilityissues in the network. A network failure can hamper thetraffic forwarding which results in packet loss and serviceunavailability. In an SDN based network the faults can becategorized into three areas (1) Data plane, where a switchor link between switches fails, (2) control plane, where linkconnecting controller and switch fails, and (3) controller,where the controller machine itself fails. A lot of researchis going on for exploring the services and functionality thatSDN can provide to leverage the network. But the area offault management in SDN is still not much explored. Thereare some solutions proposed for handling the failure in SDNbut they are not practical in the actual network consideringthe enormous traffic.In this paper, we present the existing work in the field

of fault management domain for SDN and present its lim-itations in terms of applicability for the present networks.We also propose an optimized self-healing (OSH) frameworkfor SDN which ensures optimal state and continuous avail-ability of the network after recovering from a failure. Afterthat we present the functionality of the rapid failure recov-ery scheme. Performance analysis based on an analyticalmodel of the rapid failure recovery is given by consideringthe factors like failure recovery time and the memory sizerequired for the backup flow rules.

2. PREVIOUS WORKThere has been some research done in the area of fault

management for SDN. Most of them rely on traditional wellknown approach of failure recovery i.e. restoration and pro-tection. In the case of restoration, alternative paths are

established after a failure occurs. In the case of protection,the alternative paths are established even before the fail-ure occurs in the network. Most of the existing scheme [7][8] [9] [11] [12], relies on adding flow entries for installingthe backup path for each of the disrupted flow on the failedlink. In Andrea S. et al [13], for each new flow, controllerinstalls backup path for each of the link which is a part ofthe primary path. This solution is not practical for the net-work having thousands of flows. According to [18] [19], ina modern data center with 100,000+ compute nodes, thenumber of flows in the network will be in the millions. Insuch a case, installing backup flow rule for each of the flowsmay overload the centralized controller and create process-ing bottlenecks. Considering the present switch hardwares,storing the millions of flow rules is impractical. An addi-tional TCAMmemory can used to store the OpenFlow rules,but because of its cost, commercial switches do not supportmore than 5000 to 6000 TCAM based flow rules [14].

In path protection scheme, the immediate local recoveryis not possible. Because after the link failure, the switchwhich reroutes the traffic from a primary path to a backuppath should receive the failure notification. The link failurenotification time adds up to the total recovery time, whichresults in delayed recovery and ultimately higher packet loss.After switching to the backup path due to a failure, the flowentries for the disrupted flows become obsolete and needsextra controller involvement to explicitly remove them fromthe flow table. The authors of [10] [13] relies on restorationmechanism for failure recovery. Restoration mechanism re-quires more time to recover than the protection mechanismbecause the controller has to install the flow rules in all theswitches which are part of the alternate path. It also putssignificant load on the controller which ultimately delays therecovery process.

CORONET [11] relies on LLDP messages for detectingthe changes in topology caused by the failure. But theLLDPs monitoring message processing overloads the con-troller. Yang Y. et al [12] calculates an optimal backuppath after a link failure is detected and then uses restora-tion mechanism for the link recovery. The path calculationand installation process delays the recovery process. Byconsidering all the issues in existing research, we propose anOSH approach for failure recovery in SDN. It addresses allthe issues presented in the existing schemes and provide anapproach to optimally recover a network from failures. Ourproposed RR scheme does not need a full-state controllerintervention upon the failure occurrence which reduces theload on the controller. Thus, it makes the failure recoveryprocess faster by eliminating the overhead of communicationbetween switches and controller. Because of the less com-munication, the overall congestion in the network reducessignificantly.

3. PROPOSED DESIGN ARCHITECTUREAND MECHANISM

3.1 Autonomic Self-healing Architecture forSDN

We propose a self-healing system for SDN which is capableof optimally handling the failures in SDN. Our proposed ar-chitecture of OSH mechanism is shown in the figure 2. Thearchitecture is divided into data plane and control plane.

Page 12: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Software  Defined  Network  (SDN)

• Motivating  Concerns• Existing  SDN  solutions  for  DCNs  assume  TCP,  Anycast traffic  that  is  loosely  correlated.• Existing  SDN  work  assumes  static,  homogeneous  devices.• SDN  also  poses  reliability  concerns.

• Faults  may  occur  at  the  Controller  Machine,  the  Control  Plane,  or  the  Data  Plane

• Fault  Management  research  has  not  been  well  explored. Image: [2]

bility feature of SDN can be used to achieve self-* attributesof the autonomic systems. The combination of autonomicsystem and SDN can be used to control and manage the net-work infrastructure. The purpose of both the technologiesis to overcome the growing complexity of the network. Theapplication of autonomic properties on SDN can unleash thetrue potential of future networks. The reliability of SDN canbe improved with the application of self-healing principle.SDN is described by ONF [1] as an architecture in which

the control and data planes are decoupled, network intelli-gence and state are logically centralized, and the underlyingnetwork infrastructure is abstracted from the applications.Figure 1 illustrates the layered architecture of SDN.

Data Plane Layer

Control Plane Layer

Application Layer

Network Switches

Network Applications

SDN controller/ Network Operating System

OpenFlow

Figure 1: Architecture of SDN

The SDN is a layered architecture which separates thedata plane from control plane. The separation of data planefrom control plane and centralizing the intelligence simplifiesthe management of the network. But it poses reliabilityissues in the network. A network failure can hamper thetraffic forwarding which results in packet loss and serviceunavailability. In an SDN based network the faults can becategorized into three areas (1) Data plane, where a switchor link between switches fails, (2) control plane, where linkconnecting controller and switch fails, and (3) controller,where the controller machine itself fails. A lot of researchis going on for exploring the services and functionality thatSDN can provide to leverage the network. But the area offault management in SDN is still not much explored. Thereare some solutions proposed for handling the failure in SDNbut they are not practical in the actual network consideringthe enormous traffic.In this paper, we present the existing work in the field

of fault management domain for SDN and present its lim-itations in terms of applicability for the present networks.We also propose an optimized self-healing (OSH) frameworkfor SDN which ensures optimal state and continuous avail-ability of the network after recovering from a failure. Afterthat we present the functionality of the rapid failure recov-ery scheme. Performance analysis based on an analyticalmodel of the rapid failure recovery is given by consideringthe factors like failure recovery time and the memory sizerequired for the backup flow rules.

2. PREVIOUS WORKThere has been some research done in the area of fault

management for SDN. Most of them rely on traditional wellknown approach of failure recovery i.e. restoration and pro-tection. In the case of restoration, alternative paths are

established after a failure occurs. In the case of protection,the alternative paths are established even before the fail-ure occurs in the network. Most of the existing scheme [7][8] [9] [11] [12], relies on adding flow entries for installingthe backup path for each of the disrupted flow on the failedlink. In Andrea S. et al [13], for each new flow, controllerinstalls backup path for each of the link which is a part ofthe primary path. This solution is not practical for the net-work having thousands of flows. According to [18] [19], ina modern data center with 100,000+ compute nodes, thenumber of flows in the network will be in the millions. Insuch a case, installing backup flow rule for each of the flowsmay overload the centralized controller and create process-ing bottlenecks. Considering the present switch hardwares,storing the millions of flow rules is impractical. An addi-tional TCAMmemory can used to store the OpenFlow rules,but because of its cost, commercial switches do not supportmore than 5000 to 6000 TCAM based flow rules [14].

In path protection scheme, the immediate local recoveryis not possible. Because after the link failure, the switchwhich reroutes the traffic from a primary path to a backuppath should receive the failure notification. The link failurenotification time adds up to the total recovery time, whichresults in delayed recovery and ultimately higher packet loss.After switching to the backup path due to a failure, the flowentries for the disrupted flows become obsolete and needsextra controller involvement to explicitly remove them fromthe flow table. The authors of [10] [13] relies on restorationmechanism for failure recovery. Restoration mechanism re-quires more time to recover than the protection mechanismbecause the controller has to install the flow rules in all theswitches which are part of the alternate path. It also putssignificant load on the controller which ultimately delays therecovery process.

CORONET [11] relies on LLDP messages for detectingthe changes in topology caused by the failure. But theLLDPs monitoring message processing overloads the con-troller. Yang Y. et al [12] calculates an optimal backuppath after a link failure is detected and then uses restora-tion mechanism for the link recovery. The path calculationand installation process delays the recovery process. Byconsidering all the issues in existing research, we propose anOSH approach for failure recovery in SDN. It addresses allthe issues presented in the existing schemes and provide anapproach to optimally recover a network from failures. Ourproposed RR scheme does not need a full-state controllerintervention upon the failure occurrence which reduces theload on the controller. Thus, it makes the failure recoveryprocess faster by eliminating the overhead of communicationbetween switches and controller. Because of the less com-munication, the overall congestion in the network reducessignificantly.

3. PROPOSED DESIGN ARCHITECTUREAND MECHANISM

3.1 Autonomic Self-healing Architecture forSDN

We propose a self-healing system for SDN which is capableof optimally handling the failures in SDN. Our proposed ar-chitecture of OSH mechanism is shown in the figure 2. Thearchitecture is divided into data plane and control plane.

Page 13: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Software  Defined  Network  (SDN)

• Resilience• SDNs  currently  use  existing  solutions:  Protection  and  Restoration• Existing  schemes  add  backup  paths  for  each  flow  entry.

• For  non-­‐DCN  operations,  this  is  simply  not  practical and  would  overload  the  controller.

• After  a  failure,  restoration  techniques  assess  the  routers  and  the  controller  must  install  new  rules  on  each  switch.• Overloads  the  controller  and  delays  recovery.

Image: [2]

bility feature of SDN can be used to achieve self-* attributesof the autonomic systems. The combination of autonomicsystem and SDN can be used to control and manage the net-work infrastructure. The purpose of both the technologiesis to overcome the growing complexity of the network. Theapplication of autonomic properties on SDN can unleash thetrue potential of future networks. The reliability of SDN canbe improved with the application of self-healing principle.SDN is described by ONF [1] as an architecture in which

the control and data planes are decoupled, network intelli-gence and state are logically centralized, and the underlyingnetwork infrastructure is abstracted from the applications.Figure 1 illustrates the layered architecture of SDN.

Data Plane Layer

Control Plane Layer

Application Layer

Network Switches

Network Applications

SDN controller/ Network Operating System

OpenFlow

Figure 1: Architecture of SDN

The SDN is a layered architecture which separates thedata plane from control plane. The separation of data planefrom control plane and centralizing the intelligence simplifiesthe management of the network. But it poses reliabilityissues in the network. A network failure can hamper thetraffic forwarding which results in packet loss and serviceunavailability. In an SDN based network the faults can becategorized into three areas (1) Data plane, where a switchor link between switches fails, (2) control plane, where linkconnecting controller and switch fails, and (3) controller,where the controller machine itself fails. A lot of researchis going on for exploring the services and functionality thatSDN can provide to leverage the network. But the area offault management in SDN is still not much explored. Thereare some solutions proposed for handling the failure in SDNbut they are not practical in the actual network consideringthe enormous traffic.In this paper, we present the existing work in the field

of fault management domain for SDN and present its lim-itations in terms of applicability for the present networks.We also propose an optimized self-healing (OSH) frameworkfor SDN which ensures optimal state and continuous avail-ability of the network after recovering from a failure. Afterthat we present the functionality of the rapid failure recov-ery scheme. Performance analysis based on an analyticalmodel of the rapid failure recovery is given by consideringthe factors like failure recovery time and the memory sizerequired for the backup flow rules.

2. PREVIOUS WORKThere has been some research done in the area of fault

management for SDN. Most of them rely on traditional wellknown approach of failure recovery i.e. restoration and pro-tection. In the case of restoration, alternative paths are

established after a failure occurs. In the case of protection,the alternative paths are established even before the fail-ure occurs in the network. Most of the existing scheme [7][8] [9] [11] [12], relies on adding flow entries for installingthe backup path for each of the disrupted flow on the failedlink. In Andrea S. et al [13], for each new flow, controllerinstalls backup path for each of the link which is a part ofthe primary path. This solution is not practical for the net-work having thousands of flows. According to [18] [19], ina modern data center with 100,000+ compute nodes, thenumber of flows in the network will be in the millions. Insuch a case, installing backup flow rule for each of the flowsmay overload the centralized controller and create process-ing bottlenecks. Considering the present switch hardwares,storing the millions of flow rules is impractical. An addi-tional TCAMmemory can used to store the OpenFlow rules,but because of its cost, commercial switches do not supportmore than 5000 to 6000 TCAM based flow rules [14].

In path protection scheme, the immediate local recoveryis not possible. Because after the link failure, the switchwhich reroutes the traffic from a primary path to a backuppath should receive the failure notification. The link failurenotification time adds up to the total recovery time, whichresults in delayed recovery and ultimately higher packet loss.After switching to the backup path due to a failure, the flowentries for the disrupted flows become obsolete and needsextra controller involvement to explicitly remove them fromthe flow table. The authors of [10] [13] relies on restorationmechanism for failure recovery. Restoration mechanism re-quires more time to recover than the protection mechanismbecause the controller has to install the flow rules in all theswitches which are part of the alternate path. It also putssignificant load on the controller which ultimately delays therecovery process.

CORONET [11] relies on LLDP messages for detectingthe changes in topology caused by the failure. But theLLDPs monitoring message processing overloads the con-troller. Yang Y. et al [12] calculates an optimal backuppath after a link failure is detected and then uses restora-tion mechanism for the link recovery. The path calculationand installation process delays the recovery process. Byconsidering all the issues in existing research, we propose anOSH approach for failure recovery in SDN. It addresses allthe issues presented in the existing schemes and provide anapproach to optimally recover a network from failures. Ourproposed RR scheme does not need a full-state controllerintervention upon the failure occurrence which reduces theload on the controller. Thus, it makes the failure recoveryprocess faster by eliminating the overhead of communicationbetween switches and controller. Because of the less com-munication, the overall congestion in the network reducessignificantly.

3. PROPOSED DESIGN ARCHITECTUREAND MECHANISM

3.1 Autonomic Self-healing Architecture forSDN

We propose a self-healing system for SDN which is capableof optimally handling the failures in SDN. Our proposed ar-chitecture of OSH mechanism is shown in the figure 2. Thearchitecture is divided into data plane and control plane.

Page 14: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Modern  Networking  Landscape

• Modern  Networking• Wireless• Mobile• Heterogeneous

• WiFi,  Bluetooth,  NFC,  Ethernet,  USB,  802.15.4,  ZigBee,  5G,  mmWave,  802.11p  VANET

• Applications• Vehicle  Communications

• Vehicle  to  Vehicle  Updates• DoT  to  Vehicle  Updates• Media  Streaming  to  Vehicles• Vehicle  to  DoT  Sensor  Data• Cell  to  Vehicle  Communications Image:  Institute  for  Communication   Systems,  University  of  Surrey,  UK.

http://www.surrey.ac.uk/ics/activity/facilities/futureinternet/  

Page 15: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Modern  Networking  Landscape

• Applications• Internet  of  Things  (IoT)

• Field  built  on  Heterogeneous  Wireless  Devices

• Very  difficult  to  provision  resources  in  this  environment.

• Devices  deployed  in  an  uncoordinated  manner.

• Multi-­‐Objective  Optimization• QoS in  a  DCN  focuses  on  single  optimizations.

• QoS for  IoT adds  in  delay,  jitter,  packet  loss,  throughput• User  Perceivable

Image:  Smart  Home  Energy  http://smarthomeenergy.co.uk/what-­‐smart-­‐home

Page 16: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

SDN  for  Modern  Networking

• Benefits• Can  differentiate  flow  scheduling  over  ad-­‐hoc,  heterogeneous  paths.• Allows  for  opportunistic  exchanges  over  the  best  networking  interface.• Vehicle-­‐Vehicle  over  802.11p• Vehicle-­‐DoT  over  Cell  LTE

• Can  route  flows  based  on  priority  or  other  categorizations.

Image:  Smart  Home  Energy  http://smarthomeenergy.co.uk/what-­‐smart-­‐home

Page 17: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Problems  with  SDN  for  IoT:  Self-­‐Healing

• Motivating  Application• Smart  Grid

• Packet  flows  control  relays  at  power  stations.

• Fast  recovery  is  critical  after  a  communication  disturbance.• Short  downtimes  could   lead  to  overloading   nearby  power  stations,  causing  cascade  failures.

• 2003  Blackout• Alarm  did  not  sound,  operators  failed  to  redistribute  power,  10  million  people  affected.

Image: "Map of North America, blackout 2003" by Lokal_Profil. Licensed under CC BY-SA 2.5 via Commons -

https://commons.wikimedia.org/wiki/File:Map_of_North_America,_blackout_2003.svg#/media/File:Map_of_North_America,_blackout_2003.svg

Page 18: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Problems  with  SDN  for  IoT:  Self-­‐Healing

• Rapid  Recovery  [2]• Problem

• Recovery  takes  too  long  using  restoration  technique.

• Not  feasible  to  store  backup  rules  for  each  flow  with  protection  technique.

• Solution• Treat  all  flows  from  to  the  same  link  with  a  single  rule.

• Pre-­‐allocate  a  single  backup  rule.• On  failure  notification,  immediately  switch  all  flows  over  that  link  to  the  backup.

The control interfaces communicate with data plane basedon the demanded service. Network statistics module pro-vides the interface to monitor the network flow informationand statistics. Topology discovery and management modulemanages the information related to the network topology.The policy module describes how a network element shouldact based on the defined policies. The aim of the load bal-ancing module is to estimate the load on the network andprovide input to the OSH module. The routing module cal-culates the shortest path for each of the backup link. OSHmodule queries the network management modules to col-lect the network information. Based on the informationcollected, it calculates the optimal path for achieving theimproved recovery.

SDN Controller (NOX)

Rapid Recovery Module

Flow Management

Action Management

Resource Management

Control Plane

Data

PlaneOptimized

Self-healing Management

Topology Discovery & Management

Policy Management

Load Balancing

Routing Management

Netw

ork M

anagement

Application

Switch

Level

Mgm

t.

Network Statistics

Management

Forwarding Information

Base

OpenFlow

Notification Module

Figure 2: Proposed architecture of optimized self-healing mechanism

When a new component is added into the network, themodules at the switch level management handles the job ofinstalling flow entries and configuring it. The OSH modulegets invoked by a failure notification from the notificationmodule of data plane. By coordinating with other networkapplications, it finds the optimal path for all the flows whichare affected by the link failure. The optimal path is con-structed to provide a prescribed level of QoS for all the ex-isting services after a failure occur within the network. But,the OSH module requires some time to compute a new op-timal path. Therefore, to achieve a fast failure recovery, RRmodule is utilized. RR module is capable of autonomouslyhandling the link failure without much intervention of thecontroller by using link protection scheme. Once the net-work quickly recovers from the failure using RR mechanismof the data plane, the OSH module tries to optimize therecovery.

3.2 Rapid Failure Recovery in OpenFlow Net-work

For RR, we used link protection scheme which overcomesthe challenges of path protection schemes like deferred faultrecovery, packet loss and increased controller involvement inhandling the failure. In the link protection, when a failureoccurs, the switch connected to the failed link routes the con-nection around the failed link to the neighboring node whichis part of a pre-computed shortest backup path. Therefore,the switches which are directly connected to the failed linkperforms the immediate local recovery. This results into

minimum recovery time and lesser packet loss. To imple-ment the link protection scheme in a centralized OpenFlow[6] enabled network, we applied the group table concept [17].The ability a the flow entry to point to a group table providesadditional ways of forwarding. A group entry in a group ta-ble is associated with multiple action buckets, where eachaction bucket contains a set of actions to execute and itsassociated parameters.

The flow entries in a flow table points to the group with aunique group identifier. A group identifier uniquely identi-fies the group. Each Group entry consists of a group identi-fier, a group type, counter and the number of action buckets.The counter field counts the packets processed by the group.An action bucket contains a set of actions to execute and as-sociated parameters. A group type determines the way inwhich the action buckets are executed. For the implementa-tion of link protection, we are using the fast failover grouptype. A group entry of the fast failover group executes aset of actions based on the alive status value of the port[16]. Fast failover group eliminates the need for controllerinvolvement for performing RR. In a fast failover group, ifthe action bucket alive status value is 0xffffffffL then it is de-clared as unavailable. In this case, the group table executesthe next available action bucket. The status of the actionbucket depends on the port status.

Ether type Instructions

0x0800 Forward Packets to Group #1

0x0800 Forward Packets to Group #1

Group ID Group Type Action Buckets

1 Fast Failover

B1: Output to port 1 (Primary Link)B2: Push VLAN tag and output to port 2 (Backup Path)

2 ... ...

Flow Table

IP Destination

192.168.1.1

192.168.1.2

Rule

1

2--- ---------

Ingress port

1

2---

Figure 3: Group table concept

When a link failure occurs, the fast failover group executesthe next available action bucket which outputs the packetto an intermediate switch of the backup path. Therefore theswitch autonomously performs the immediate link recoverywithout any intervention of the controller. The functional-ity of the fast failover group is illustrated using the Figure3. On receipt of a packet, an OpenFlow switch extracts itsmatch fields and starts flow table lookup. If its IP desti-nation address, ingress port and EtherType field matcheswith the flow rule 1, the packet is forwarded to the group1. Similarly, for a flow of packets having a destination IPaddress as 192.168.1.2, Ingress port as 2 and EtherType as0x0800 matches with the flow rule 2 and forwarded to group1. Group table executes the action bucket B1 and outputsthe incoming packets forwarded by flow rule 1 and 2 to out-put port 1. In case of link failure, the status of bucket B1becomes unavailable. The group table detects the changedstatus of the action bucket B1 and executes the next avail-able action bucket B2. Action bucket B2 executes its asso-ciated action which pushes VLAN tag into the packet andforward it to output port 2.

Image: [2]

Page 19: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Problems  with  SDN  for  IoT:  Self-­‐Healing

• Optimized  Self-­‐Healing  (OSH)[2]• Topology  Discovery  manages  routing  topology  in  place.• Load  Balancing  module  estimates  the  network  load.• Routing  Module  calculates  shortest  paths  per  flow.• On  Failure

• Receive  Notification  from  Data  Plane• Calculate  a  new  Optimal  Path• Validate  Path  WRT  QoS• Send  new  Flow  Routing  information.

The control interfaces communicate with data plane basedon the demanded service. Network statistics module pro-vides the interface to monitor the network flow informationand statistics. Topology discovery and management modulemanages the information related to the network topology.The policy module describes how a network element shouldact based on the defined policies. The aim of the load bal-ancing module is to estimate the load on the network andprovide input to the OSH module. The routing module cal-culates the shortest path for each of the backup link. OSHmodule queries the network management modules to col-lect the network information. Based on the informationcollected, it calculates the optimal path for achieving theimproved recovery.

SDN Controller (NOX)

Rapid Recovery Module

Flow Management

Action Management

Resource Management

Control Plane

Data

PlaneOptimized

Self-healing Management

Topology Discovery & Management

Policy Management

Load Balancing

Routing Management

Netw

ork M

anagement

Application

Switch

Level

Mgm

t.

Network Statistics

Management

Forwarding Information

Base

OpenFlow

Notification Module

Figure 2: Proposed architecture of optimized self-healing mechanism

When a new component is added into the network, themodules at the switch level management handles the job ofinstalling flow entries and configuring it. The OSH modulegets invoked by a failure notification from the notificationmodule of data plane. By coordinating with other networkapplications, it finds the optimal path for all the flows whichare affected by the link failure. The optimal path is con-structed to provide a prescribed level of QoS for all the ex-isting services after a failure occur within the network. But,the OSH module requires some time to compute a new op-timal path. Therefore, to achieve a fast failure recovery, RRmodule is utilized. RR module is capable of autonomouslyhandling the link failure without much intervention of thecontroller by using link protection scheme. Once the net-work quickly recovers from the failure using RR mechanismof the data plane, the OSH module tries to optimize therecovery.

3.2 Rapid Failure Recovery in OpenFlow Net-work

For RR, we used link protection scheme which overcomesthe challenges of path protection schemes like deferred faultrecovery, packet loss and increased controller involvement inhandling the failure. In the link protection, when a failureoccurs, the switch connected to the failed link routes the con-nection around the failed link to the neighboring node whichis part of a pre-computed shortest backup path. Therefore,the switches which are directly connected to the failed linkperforms the immediate local recovery. This results into

minimum recovery time and lesser packet loss. To imple-ment the link protection scheme in a centralized OpenFlow[6] enabled network, we applied the group table concept [17].The ability a the flow entry to point to a group table providesadditional ways of forwarding. A group entry in a group ta-ble is associated with multiple action buckets, where eachaction bucket contains a set of actions to execute and itsassociated parameters.

The flow entries in a flow table points to the group with aunique group identifier. A group identifier uniquely identi-fies the group. Each Group entry consists of a group identi-fier, a group type, counter and the number of action buckets.The counter field counts the packets processed by the group.An action bucket contains a set of actions to execute and as-sociated parameters. A group type determines the way inwhich the action buckets are executed. For the implementa-tion of link protection, we are using the fast failover grouptype. A group entry of the fast failover group executes aset of actions based on the alive status value of the port[16]. Fast failover group eliminates the need for controllerinvolvement for performing RR. In a fast failover group, ifthe action bucket alive status value is 0xffffffffL then it is de-clared as unavailable. In this case, the group table executesthe next available action bucket. The status of the actionbucket depends on the port status.

Ether type Instructions

0x0800 Forward Packets to Group #1

0x0800 Forward Packets to Group #1

Group ID Group Type Action Buckets

1 Fast Failover

B1: Output to port 1 (Primary Link)B2: Push VLAN tag and output to port 2 (Backup Path)

2 ... ...

Flow Table

IP Destination

192.168.1.1

192.168.1.2

Rule

1

2--- ---------

Ingress port

1

2---

Figure 3: Group table concept

When a link failure occurs, the fast failover group executesthe next available action bucket which outputs the packetto an intermediate switch of the backup path. Therefore theswitch autonomously performs the immediate link recoverywithout any intervention of the controller. The functional-ity of the fast failover group is illustrated using the Figure3. On receipt of a packet, an OpenFlow switch extracts itsmatch fields and starts flow table lookup. If its IP desti-nation address, ingress port and EtherType field matcheswith the flow rule 1, the packet is forwarded to the group1. Similarly, for a flow of packets having a destination IPaddress as 192.168.1.2, Ingress port as 2 and EtherType as0x0800 matches with the flow rule 2 and forwarded to group1. Group table executes the action bucket B1 and outputsthe incoming packets forwarded by flow rule 1 and 2 to out-put port 1. In case of link failure, the status of bucket B1becomes unavailable. The group table detects the changedstatus of the action bucket B1 and executes the next avail-able action bucket B2. Action bucket B2 executes its asso-ciated action which pushes VLAN tag into the packet andforward it to output port 2.

Image: [2]

Page 20: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Problems  with  SDN  for  IoT:  Self-­‐Healing

• Evaluation  of  Self-­‐Healing• Model  Evaluation

• 99%  reduction  of  backup  flows  needed  to  be  stored.

• Immediate  restoration  of  service  with  a  possibly  sub-­‐optimal  backup  path.

• Optimal  backup  path  is  pushed  after  controller  calculates  it.

Link ID # 2 (BC)

Link ID # 1 (CB)

A

E F

B C

D

Flow 2

Flow 1

Flow 3

Flow 1,2,3

Ether type Instructions0x0800 Forward Packets to Group #10x0800 Forward Packets to Group #1

Flow Table 1 for BIP src

192.168.1.1192.168.1.2

Rule12

IP Dst192.168.1.8192.168.1.9

21

43

14

23

1

24

VLAN ID Instructions2 Forward Packets to Port 21 Forward Packets to Port 1

Flow Table # FIP src

**

Rule12

--- ---------

IP Dst**---

InstructionsRemove VLAN TAG 1 & forward packets to flow table 1

VLN ID11

RuleFlow Table 0 for B

….......Forward packets to flow table 1*N

... ......

Ether type Instructions0x0800 Forward Packets to Group #10x0800 Forward Packets to Group #1

Flow Table 1 for CIP src

192.168.1.8192.168.1.9

Rule12

IP Dst192.168.1.1192.168.1.2

... ......... .........

InstructionsRemove VLAN TAG 2 & forward packets to flow table 1

VLN ID21

RuleFlow Table 0 for C

….......Forward packets to flow table 1*N

Group ID Group Type Action BucketsOutput to port 3Push VLAN tag 2 and output to port 4

B2: Backup path

DescriptionB1:Primary link

Group Table for B

1 Fast Failover

... ...2 ...

Group ID Group Type Action BucketsOutput to port 1 Push VLAN tag 1 and output to port 2

B2: Backup path

DescriptionB1:Primary link

Group Table for C

1 Fast Failover

... ...2 ...

IP src192.168.1.2

IP Dst192.168.1.9

Flow 2's Packet Header

IP src192.168.1.1

IP Dst192.168.1.8

Flow 1's Packet Header

Figure 4: Link protection mechanism

In reactive link protection, the controller can intervene toremove the outdated flow rules to speed up the recovery. Inthis case, when a failure is detected (TFD), the failure no-tification is sent to the controller. The controller searchesthe disrupted flow (TDFS) and sends flow modification mes-sage (TFM) to the switch for modifying the outdated flowrules. This process is repeated for each of the disruptedflow. When the switch receives the flow modification mes-sage, all the matching rule from the flow table are modi-fied in time (TUPDATE). The propagation time (TPROP) ofthe failure notification message from switch to the controllercontributes to the recovery process. The recovery time (TR)taken by this scheme is expressed by equation 2.

TR = TFD +N!

f=0

(TDFS,f +TFM,f +TUPDATE,f )+TPROP (2)

Our proposed scheme for RR is reactive in nature. Afterfailure detection, the affected switch handles the flow rerout-ing without any controller intervention. Therefore, the timecomplexity of our proposed RR protection scheme dependson the time a switch takes to detect a failure (TFD) and thetime to change the alive status (TAS) of the group entrieswhich corresponds to the failed link. According to Sharmaet al [8], a switch takes an approximately 5.8 microsecondsto modify the alive-status of one Group Entry. The recoverytime (TR) taken by our RR scheme is calculated by equation3.

TR = TFD + TAS (3)

4.2 Calculating Memory Size RequirementIn the traditional protection schemes, controller pre-installs

the flow rules for the backup path. For each of the disruptedflow of a primary link, backup flow rule should be present inthe switches for the backup path. Therefore, the number offlow rules in a switch of the backup path (NBF) is equal tothe number of disrupted flows (NDF) i.e. NBF= NDF. Butthis approach is not suitable for the network with thousandsof flow because of memory constraint of the switch hardware.For a 100+ disrupted flows, our RR module reduces thebackup path’s flow entries in a switch by more than 99 per-cent and saves switch memory. This contributes to smallerflow table and faster table lookup. RR module compress allthe flows having same output port which corresponds to thefailed link into one wildcard flow rule. Therefore, the num-ber of flow rules in an intermediate switch of the backup path(NBF) is equal to 1 (NBF = 1 ) Figure 5 graphically showsthe compression achieved by the RR module (Nbf-RR) overthe traditional scheme (Nbf ). The following graph is plottedusing the above observation.

5. CONCLUSION AND FUTURE WORKAn OSH model for SDN is proposed in this paper. The

proposed model is based on the autonomic principle of theautonomous system. The aim of our model is to achieveoptimal failure recovery. We presented the analytical modelof our RR scheme and proved how it can achieve a quickrecovery with low disruption time and reduces the backupflow entries per switch. The backup path flow aggregationenabled the 99 percent of reduction in the flow entries per

Recovery  Time  for  standard  Reactive  Link  =  Fail  Detect  Time  +  Propagation  time  +  

Sum  over  all  disrupted  flows  of  the  combination  of  (time  to  detect  the  disruption,   thetime  to  send  the  modification  message,and  the  time  to  update  the  flow  table).

Link ID # 2 (BC)

Link ID # 1 (CB)

A

E F

B C

D

Flow 2

Flow 1

Flow 3

Flow 1,2,3

Ether type Instructions0x0800 Forward Packets to Group #10x0800 Forward Packets to Group #1

Flow Table 1 for BIP src

192.168.1.1192.168.1.2

Rule12

IP Dst192.168.1.8192.168.1.9

21

43

14

23

1

24

VLAN ID Instructions2 Forward Packets to Port 21 Forward Packets to Port 1

Flow Table # FIP src

**

Rule12

--- ---------

IP Dst**---

InstructionsRemove VLAN TAG 1 & forward packets to flow table 1

VLN ID11

RuleFlow Table 0 for B

….......Forward packets to flow table 1*N

... ......

Ether type Instructions0x0800 Forward Packets to Group #10x0800 Forward Packets to Group #1

Flow Table 1 for CIP src

192.168.1.8192.168.1.9

Rule12

IP Dst192.168.1.1192.168.1.2

... ......... .........

InstructionsRemove VLAN TAG 2 & forward packets to flow table 1

VLN ID21

RuleFlow Table 0 for C

….......Forward packets to flow table 1*N

Group ID Group Type Action BucketsOutput to port 3Push VLAN tag 2 and output to port 4

B2: Backup path

DescriptionB1:Primary link

Group Table for B

1 Fast Failover

... ...2 ...

Group ID Group Type Action BucketsOutput to port 1 Push VLAN tag 1 and output to port 2

B2: Backup path

DescriptionB1:Primary link

Group Table for C

1 Fast Failover

... ...2 ...

IP src192.168.1.2

IP Dst192.168.1.9

Flow 2's Packet Header

IP src192.168.1.1

IP Dst192.168.1.8

Flow 1's Packet Header

Figure 4: Link protection mechanism

In reactive link protection, the controller can intervene toremove the outdated flow rules to speed up the recovery. Inthis case, when a failure is detected (TFD), the failure no-tification is sent to the controller. The controller searchesthe disrupted flow (TDFS) and sends flow modification mes-sage (TFM) to the switch for modifying the outdated flowrules. This process is repeated for each of the disruptedflow. When the switch receives the flow modification mes-sage, all the matching rule from the flow table are modi-fied in time (TUPDATE). The propagation time (TPROP) ofthe failure notification message from switch to the controllercontributes to the recovery process. The recovery time (TR)taken by this scheme is expressed by equation 2.

TR = TFD +N!

f=0

(TDFS,f +TFM,f +TUPDATE,f )+TPROP (2)

Our proposed scheme for RR is reactive in nature. Afterfailure detection, the affected switch handles the flow rerout-ing without any controller intervention. Therefore, the timecomplexity of our proposed RR protection scheme dependson the time a switch takes to detect a failure (TFD) and thetime to change the alive status (TAS) of the group entrieswhich corresponds to the failed link. According to Sharmaet al [8], a switch takes an approximately 5.8 microsecondsto modify the alive-status of one Group Entry. The recoverytime (TR) taken by our RR scheme is calculated by equation3.

TR = TFD + TAS (3)

4.2 Calculating Memory Size RequirementIn the traditional protection schemes, controller pre-installs

the flow rules for the backup path. For each of the disruptedflow of a primary link, backup flow rule should be present inthe switches for the backup path. Therefore, the number offlow rules in a switch of the backup path (NBF) is equal tothe number of disrupted flows (NDF) i.e. NBF= NDF. Butthis approach is not suitable for the network with thousandsof flow because of memory constraint of the switch hardware.For a 100+ disrupted flows, our RR module reduces thebackup path’s flow entries in a switch by more than 99 per-cent and saves switch memory. This contributes to smallerflow table and faster table lookup. RR module compress allthe flows having same output port which corresponds to thefailed link into one wildcard flow rule. Therefore, the num-ber of flow rules in an intermediate switch of the backup path(NBF) is equal to 1 (NBF = 1 ) Figure 5 graphically showsthe compression achieved by the RR module (Nbf-RR) overthe traditional scheme (Nbf ). The following graph is plottedusing the above observation.

5. CONCLUSION AND FUTURE WORKAn OSH model for SDN is proposed in this paper. The

proposed model is based on the autonomic principle of theautonomous system. The aim of our model is to achieveoptimal failure recovery. We presented the analytical modelof our RR scheme and proved how it can achieve a quickrecovery with low disruption time and reduces the backupflow entries per switch. The backup path flow aggregationenabled the 99 percent of reduction in the flow entries per

Recovery  Time  for  Rapid  Recovery  =  Fail  Detect  Time  +  Time  to  change  Alive  Status  

Change  of  status  takes  approximately  5.8  microseconds.

Page 21: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Link ID # 2 (BC)

Link ID # 1 (CB)

A

E F

B C

D

Flow 2

Flow 1

Flow 3

Flow 1,2,3

Ether type Instructions0x0800 Forward Packets to Group #10x0800 Forward Packets to Group #1

Flow Table 1 for BIP src

192.168.1.1192.168.1.2

Rule12

IP Dst192.168.1.8192.168.1.9

21

43

14

23

1

24

VLAN ID Instructions2 Forward Packets to Port 21 Forward Packets to Port 1

Flow Table # FIP src

**

Rule12

--- ---------

IP Dst**---

InstructionsRemove VLAN TAG 1 & forward packets to flow table 1

VLN ID11

RuleFlow Table 0 for B

….......Forward packets to flow table 1*N

... ......

Ether type Instructions0x0800 Forward Packets to Group #10x0800 Forward Packets to Group #1

Flow Table 1 for CIP src

192.168.1.8192.168.1.9

Rule12

IP Dst192.168.1.1192.168.1.2

... ......... .........

InstructionsRemove VLAN TAG 2 & forward packets to flow table 1

VLN ID21

RuleFlow Table 0 for C

….......Forward packets to flow table 1*N

Group ID Group Type Action BucketsOutput to port 3Push VLAN tag 2 and output to port 4

B2: Backup path

DescriptionB1:Primary link

Group Table for B

1 Fast Failover

... ...2 ...

Group ID Group Type Action BucketsOutput to port 1 Push VLAN tag 1 and output to port 2

B2: Backup path

DescriptionB1:Primary link

Group Table for C

1 Fast Failover

... ...2 ...

IP src192.168.1.2

IP Dst192.168.1.9

Flow 2's Packet Header

IP src192.168.1.1

IP Dst192.168.1.8

Flow 1's Packet Header

Figure 4: Link protection mechanism

In reactive link protection, the controller can intervene toremove the outdated flow rules to speed up the recovery. Inthis case, when a failure is detected (TFD), the failure no-tification is sent to the controller. The controller searchesthe disrupted flow (TDFS) and sends flow modification mes-sage (TFM) to the switch for modifying the outdated flowrules. This process is repeated for each of the disruptedflow. When the switch receives the flow modification mes-sage, all the matching rule from the flow table are modi-fied in time (TUPDATE). The propagation time (TPROP) ofthe failure notification message from switch to the controllercontributes to the recovery process. The recovery time (TR)taken by this scheme is expressed by equation 2.

TR = TFD +N!

f=0

(TDFS,f +TFM,f +TUPDATE,f )+TPROP (2)

Our proposed scheme for RR is reactive in nature. Afterfailure detection, the affected switch handles the flow rerout-ing without any controller intervention. Therefore, the timecomplexity of our proposed RR protection scheme dependson the time a switch takes to detect a failure (TFD) and thetime to change the alive status (TAS) of the group entrieswhich corresponds to the failed link. According to Sharmaet al [8], a switch takes an approximately 5.8 microsecondsto modify the alive-status of one Group Entry. The recoverytime (TR) taken by our RR scheme is calculated by equation3.

TR = TFD + TAS (3)

4.2 Calculating Memory Size RequirementIn the traditional protection schemes, controller pre-installs

the flow rules for the backup path. For each of the disruptedflow of a primary link, backup flow rule should be present inthe switches for the backup path. Therefore, the number offlow rules in a switch of the backup path (NBF) is equal tothe number of disrupted flows (NDF) i.e. NBF= NDF. Butthis approach is not suitable for the network with thousandsof flow because of memory constraint of the switch hardware.For a 100+ disrupted flows, our RR module reduces thebackup path’s flow entries in a switch by more than 99 per-cent and saves switch memory. This contributes to smallerflow table and faster table lookup. RR module compress allthe flows having same output port which corresponds to thefailed link into one wildcard flow rule. Therefore, the num-ber of flow rules in an intermediate switch of the backup path(NBF) is equal to 1 (NBF = 1 ) Figure 5 graphically showsthe compression achieved by the RR module (Nbf-RR) overthe traditional scheme (Nbf ). The following graph is plottedusing the above observation.

5. CONCLUSION AND FUTURE WORKAn OSH model for SDN is proposed in this paper. The

proposed model is based on the autonomic principle of theautonomous system. The aim of our model is to achieveoptimal failure recovery. We presented the analytical modelof our RR scheme and proved how it can achieve a quickrecovery with low disruption time and reduces the backupflow entries per switch. The backup path flow aggregationenabled the 99 percent of reduction in the flow entries per

Problems  with  SDN  for  IoT:  Self-­‐Healing

Image: [2]

Page 22: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Problems  with  SDN  for  IoT:  Self-­‐Healing

• Evaluation  of  Self-­‐Healing• Smart  Grid  Approach  [5]

• Requires  coordination  between  Data  Plane  and  Control  Plane  post-­‐failure  for  backup  route  installation.• Would  benefit   from  [2]

• Novel  Additions  for  Self-­‐Healing• Multiple  QoS Levels

• New  flows  are  added  to  links  based  on  ability  to  ensure  QoSfor  all  current  flows.• Lower  priority   flows  may  be  moved  to  different  links.

• Dedicated  flows  get  their  own  links.

020

4060

8010

0D

atar

ate

[Mbp

s]

Re-routing ofBackgroundData Traffic

Switch 2

MMS TrafficBackground Data Transfer

Background Real-Time TrafficTotal Traffic

0 50 100 150 200

020

4060

8010

0

Time [s]

Dat

arat

e[M

bps]

Minimum Data RateGuarantee for

Smart Grid Traffic

Switch 3

Fig. 5: Scenario 2a: Load Management at Switches 2 and 3

to enqueue traffic flows using these queues, mapping the QoSrequirements of different traffic classes being equivalent topriorities. The traffic class of an arriving packet is identified bymatching against specified rules such as the packet’s applicationlayer protocol. Table I contains the traffic class along withassumptions on minimum data rate and latency requirements,considered for this experiment.Both following QoS approaches have been evaluated by insertingsuccessively the hereafter detailed traffic flows into the network(cf. Figure 2): background data transfer from Server 2 to Client 1(TCP/FTP-based), background real-time traffic from Server 1to Client 1 (UDP-based) and MMS reports from Server 2 toClient 1. In addition, the network is set to 100Mbps maximumdata rate, whereat 5Mbps are reserved for network control at alltimes, leaving effective data rates of 95Mbps.

Scenario 2a: Reserved Data Rate for Time-Critical Services

This QoS approach aims at reserving data rate for time-critical services by guaranteeing minimum data rates for eachtraffic flow and re-routing traffic flows with lower prioritywhenever the available data rate of a link might be exceeded.Therefore the SDN-Controller keeps track of all active trafficflows within the network along with their routes and priorities.For this approach, the following algorithm has been developed:

1) Shortest Path Calculation: First, the shortest route iscalculated for arriving packets.

2) Data Rate Demand: For each link, the SDN-Controllerdetermines the QoS demand of the new flow and comparesits associated minimum data rate with the available datarate of the link.

3) Priority Grouping: If the data rate is sufficient on all linksthe flow is added, otherwise the priority of the new flow andof those already active on the respective link is compared.Subsequently, flows are grouped into flows with higher orsame priority and those with lower priority.

4) Re-routing of New Flow: If the link capacity is insufficientfor the new and all higher/same priority flows, the secondbest route is calculated and tested, restarting at step (2).

020

4060

8010

0D

atar

ate

[Mbp

s]

Re-routing ofBackground

Data and Real-Time Traffic

Switch 2

MMS TrafficBackground Data Transfer

Background Real-Time TrafficTotal Traffic

0 50 100 150 200

020

4060

8010

0

Time [s]

Dat

arat

e[M

bps]

Link Reservation forSmart Grid Traffic

Switch 3

Fig. 6: Scenario 2b: Link Reservation at Switch 3

5) Sorting of Lower Priority Flows: Else, lower priorityflows are sorted into a list depending on their overlap withthe route of the new flow and their priority, defining anorder for re-routing.

6) Determination of Lower Priority Flows for Re-Routing:While link capacity is insufficient for higher, same and theremaining lower priority flows, the next lower priority flowis marked for re-routing and removed from the list.

7) Unmodified Lower Priority Flows: All flows remainingin the list maintain their current route.

8) Calculation of Alternative Routes: New routes are beingcalculated on a virtual topology, excluding the links ofthe new flow’s route, for flows that have to be shifted.Afterwards, for each pair of flow and alternative route allsteps starting from (2) have to be repeated.

9) Drop of Lower Priority Flows: If no alternate routes areavailable flows are dropped.

10) Re-routing of Lower Priority Flows: Else, OFFlowModmessages are prepared for establishing new flow entries atthe switches for the lower priority traffic flows affected.

Figure 5 shows active transmissions and its correspondingdata rates at Switch 2 (top) and 3 (bottom), when applyingthis approach. At the beginning, background data traffic istransmitted via Switch 3, exploiting all available data rate. After30 s background real-time traffic is added to the network. Sincethe link between Switch 3 and 4 is used by this transmissionas well, the data rate of the background data traffic is reducedaccordingly. Next, background data traffic needs to be re-routedvia Switch 1 and 2 in order to enable high priority MMS reportson the same link after 105 s. Otherwise, adding MMS reportswould cause the sum of minimum data rate requirements toexceed the available capacity of the link between Switch 3 and4, which is prevented by shifting the lowest priority traffic flow.

Scenario 2b: Dedicated Links for Time-Critical Services

For the second approach critical services have been specified,which are granted dedicated links for data exchange. Accord-ingly, the previous algorithm has been extended as follows:

2014 IEEE International Conference on Smart Grid Communications

426

Images: [5]

IV. TEST SETUP AND RESULTS OF SDN-ENABLED SMARTGRID COMMUNICATION

Starting with the testbed setup this section reveals extendedpossibilities of SDN-based networks by means of two differentscenarios. Following the scenario descriptions, algorithms forestablishing the required capabilities at the SDN-Controller andanalysis results are presented.

A. Testbed Setup

The SDN4SmartGrids testbed comprises four switches, oneSDN-controller and two servers for generating and one clientfor receiving traffic as shown in Figure 2. As for the switches,four identical workstations have been used, running OpenvSwitch 2.1.0 [4] as kernel module on basis of 64Bit Ubuntu13.04 Server (3.8.0-30-generic Kernel). Each switch is equippedwith one integrated Intel I217-LM 1000Base−T (IEEE 802.3ab)Ethernet Controller for connecting to the control network andone 4 Port Intel I350 1000Base−T Ethernet Controller foroperating within the data network. The SDN-controller hasbeen developed by enhancing the OpenFlow Controller Beacon1.0.4 [12] based on Java JDK 1.7 running on Windows 7 x64.Client and server workstations connect to the data networkwith onboard 1000Base−T RTL8167 Realtek adapters, usingWindows 7 x64 as OS, as well.IEC 61850 compliant traffic has been generated applying theopen-source library libIEC61850 [13], written in standard C, forMMS reports as well as a C implementation of the SV servicefor SV messages, developed at the Communication NetworksInstitute. For testing, both SV and MMS messages are sent inintervals of 1ms, using packet size of 122Byte and 684Byte.

B. Scenario 1: Fast Recovery for Smart Grid Communications

This scenario deals with enabling fast recovery after distur-bance of a communication link. Providing such functionality isof great importance for ensuring reliable operation of commu-nication networks in critical environments such as substationsof power systems. In particular, alternative routes through thenetwork need to be established immediately, guaranteeing thetransmission of monitoring and control traffic. Therefore aproactive algorithm for calculating alternative paths throughthe network has been developed and integrated into the SDN-Controller. The algorithm’s performance has been assessedby measuring the duration of traffic interruption as well asprocessing times at the controller and switches.First, a brief description of the algorithm, which is applied fordealing with communication link disconnection, is given:

SDN Controller

Switch 1 Switch 4

Switch 2

Switch 3

Server 1

Server 2

Client 1

Control NetworkData Network

Fig. 2: Setup of the SDN4SmartGrids Testbed with Data andControl Network

Server Client Server Client Server ClientDataData

ACK

DataData

ACK

DataData

ACKData Data Data

Data

ACK

DupACK

RTO

RTO

RTO

RTO

RTO

Data

Data

Disturbance Disturbance

Disturbance

MMS Case 1 MMS Case 2 MMS Case 3

Fig. 3: MMS-TCP Flowchart Showing Different Recovery Cases

1) Alternative shortest paths: Alternatives are calculated foreach pair of route and possible link failure, using alternativetopologies which exclude the respective link.

2) Mapping to switch configurations: The alternative pathsare converted into switch configurations, preparing corre-sponding OFFlowMod messages for adding, reconfiguringor deleting traffic flow entries at the switches.

3) Monitoring of active flows: The SDN-Controller keepstrack of active traffic flows and their routes, consideringOFFlowRemoved messages received from switches.

4) Port status notification: In case of a link failure, anOFPortStatus message is issued by the switches connectedto the respective link and sent to the SDN-Controller.

5) Re-routing: Pre-calculated alternative routes are looked upfor all affected traffic flows and corresponding OFFlow-Mods are sent out to the switches for re-routing.

Applying this algorithm, multiple measurements have beenexecuted to verify and analyse its effect. The experimental designof this scenario can be conceived by means of Figure 2: MMSreports respectively SV messages, both containing measurementvalues, are transmitted from Server 1 to Client 1 using eitherthe upper path (via Switch 2) or the lower path (via Switch 3).During transmission, one of the active communication linksis disconnected by a) physical disconnection of an interfaceor b) by software command. Figure 4 (top) shows the overallrecovery times of SV and MMS traffic, in terms of a cumulativedistribution function, considering both cases. In case of SVmessages and link disconnection by command, the mean downtime of transmission amounts to 87.16ms (median: 85.17ms),whereas for physical link disconnection the mean delay increasesto 360.64ms (median: 358.80ms). Hence, port status detectionby the OS induces additional delay in the range of 210 to 305ms.A more complex behaviour can be observed for TCP-basedMMS reports due to reliability mechanisms, which apply Ac-knowledgements (ACKs) and retransmissions. Accordingly, re-covery time depends on the following TCP-specific parameters,which are set to Windows 7 default configuration: Retransmis-sion Time-Out (RTO) (300ms), acknowledgement frequency(2 packets) and delayed acknowledgement timer (50ms). There-fore, Figure 3 distinguishes three different cases, which mightoccur when a link is disconnected during TCP based communi-cation, explaining the effects encountered in Figure 4.MMS Case 1: In Case 1, the link is disconnected before orduring the transmission window’s first packet transfer. Thepacket will not be received by the client and no ACK is issued,wherefore retransmission begins after the RTO elapses. Thiscase applies to MMS reports with recovery times in the range of300ms after the link is disconnected by command.

2014 IEEE International Conference on Smart Grid Communications

424

Page 23: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Problems  with  SDN  for  IoT:  Self-­‐Optimization

• Objective• Reduce  energy  usage  for  network  infrastructure.  [1]• Accounted  for  8%  of  total  energy  consumption  in  2008,  projected  14%  by  2020.

• Motivating  Application• Campus  Network

• Existing  infrastructure  requires  24/7  uptime  on  all  networking  hardware,  configured  to  handle  peak  traffic.

• Real  traffic  occurs  in  patterns.• Campus  traffic  is  significantly  higher  during   the  day. Image:  Institute  for  Communication   Systems,  University  of  Surrey,  UK.

http://www.surrey.ac.uk/ics/activity/facilities/futureinternet/  

Page 24: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Problems  with  SDN  for  IoT:  Self-­‐Optimization

• Determine  Minimum  Switches  to  Power• NP-­‐Hard,  but  can  be  formulated  with  a  MILP• Multiple  Constraints

• Strategic  Greedy  Heuristic• Looks  for  best  paths  on  active  nodes,  will  activate  other  nodes  if  needed.

• Shortest  Shortest  Path  First  (SPF)• Longest  Shortest  Path  First  (LPF)• Smallest  Demand  First  (SDF)• Highest  Demand  First  (HDF)

Fig. 2. Campus network to be investigated

The campus network is build of four different switch types:Core, distribution, WAN and access switches, denoted bydecreasingly sized red dots in this order. Table I displays theenergy consumption of the four different types of switches usedin the campus network evaluation. The last column Bandwidth(Gbit/s) corresponds to the bandwidth of the four link typesused within the campus network. We bundle multiple real linksinto one logical link, so that shutting down one of these logicallinks is equivalent of saving the power usage of one respectiveline card inserted in the chassis of the switch. These four typescan be distinguished by their thickness of the line representingthis logical link in figure 2. Core links interconnect coreswitches, whereas distribution links connect the core switchesto the distribution nodes. WAN links can be found betweenthe distribution layer an the WAN switches in the bottom leftand right hand corner.

TABLE I. HARDWARE INFORMATION OF THE CAMPUS NETWORK

Type Switch e.c. (W) Line Card e.c. (W) Bandwidth (Gbit/s)Core 2000 150 400Distribution 1000 80 150WAN 700 50 100Access 200 -/- (5 W Port) 50

As we can see from the table I, core link bandwidth is notproportional to the line card energy usage of the distributionand WAN switches. The reason for this is that core line cardsare usually incorporating optical links, which consume lessenergy than copper wired port links.

In contrast to the campus network, our mesh networkis build in a homogeneous manner. More specifically, eachnode consumes the same amount of energy and each linkinterconnecting two nodes provides 150 Gbit/s bandwidth.

A. Proof of concept

We randomly generated traffic demands at three levels: lowtraffic load (nighttime traffic), medium traffic load (averagedaytime traffic) and high traffic load (yearly peak traffic).Traffic utilization during the day roughly triples compared tothe demands at nighttime [16]. Accordingly, we set the mediumtraffic load to be three times higher than low traffic load andpeak traffic is five times higher.

Fig. 3. Optimal network configuration for a low traffic utilization at night

Figure 3 displays an example of the traffic flow andnetwork state in a low traffic utilization scenario. This trafficdemand distribution corresponds to low demand as it occursduring nighttime. Yellow nodes display nodes with demands,whereas blue nodes denote switches turned on to forwardflows. Blue links denote turned on line cards and bundlesof links of the respective switches. Everything colored greyis currently shut down. An example for a campus network

Fig. 4. Optimal network configuration for a mid traffic utilization duringdaytime

configuration at daytime traffic load is displayed in figure 4.In Figure 5 and 6, we can see the potential energy savingsin a campus network and a mesh network, respectively. Theseare the results from Cplex solver and the heuristic LPF. Atthe night time, we are able to save up to ca 45% of energycompared to an ”always online” network. An important aspectto highlight, is that in realistic mesh networks (e.g. WAN’s)the average network utilization is just around 30% [17] andtherefore even higher energy savings are possible.

The Strategic Greedy LPF method results in an optimalsolution for low traffic load. In case of medium and high trafficload, it is ca. 7% worse than MILP in mesh network and ca.4% in the campus network. Furthermore, we observe that the

2014 IEEE 3rd International Conference on Cloud Networking (CloudNet)

158

Image: [1]

Page 25: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Problems  with  SDN  for  IoT:  Self-­‐Optimization

• Determine  Minimum  Switches  to  Power• NP-­‐Hard,  but  can  be  formulated  with  a  MILP• Multiple  Constraints

• Strategic  Greedy  Heuristic• Looks  for  best  paths  on  active  nodes,  will  activate  other  nodes  if  needed.

• Shortest  Shortest  Path  First  (SPF)• Longest  Shortest  Path  First  (LPF)• Smallest  Demand  First  (SDF)• Highest  Demand  First  (HDF)

Fig. 2. Campus network to be investigated

The campus network is build of four different switch types:Core, distribution, WAN and access switches, denoted bydecreasingly sized red dots in this order. Table I displays theenergy consumption of the four different types of switches usedin the campus network evaluation. The last column Bandwidth(Gbit/s) corresponds to the bandwidth of the four link typesused within the campus network. We bundle multiple real linksinto one logical link, so that shutting down one of these logicallinks is equivalent of saving the power usage of one respectiveline card inserted in the chassis of the switch. These four typescan be distinguished by their thickness of the line representingthis logical link in figure 2. Core links interconnect coreswitches, whereas distribution links connect the core switchesto the distribution nodes. WAN links can be found betweenthe distribution layer an the WAN switches in the bottom leftand right hand corner.

TABLE I. HARDWARE INFORMATION OF THE CAMPUS NETWORK

Type Switch e.c. (W) Line Card e.c. (W) Bandwidth (Gbit/s)Core 2000 150 400Distribution 1000 80 150WAN 700 50 100Access 200 -/- (5 W Port) 50

As we can see from the table I, core link bandwidth is notproportional to the line card energy usage of the distributionand WAN switches. The reason for this is that core line cardsare usually incorporating optical links, which consume lessenergy than copper wired port links.

In contrast to the campus network, our mesh networkis build in a homogeneous manner. More specifically, eachnode consumes the same amount of energy and each linkinterconnecting two nodes provides 150 Gbit/s bandwidth.

A. Proof of concept

We randomly generated traffic demands at three levels: lowtraffic load (nighttime traffic), medium traffic load (averagedaytime traffic) and high traffic load (yearly peak traffic).Traffic utilization during the day roughly triples compared tothe demands at nighttime [16]. Accordingly, we set the mediumtraffic load to be three times higher than low traffic load andpeak traffic is five times higher.

Fig. 3. Optimal network configuration for a low traffic utilization at night

Figure 3 displays an example of the traffic flow andnetwork state in a low traffic utilization scenario. This trafficdemand distribution corresponds to low demand as it occursduring nighttime. Yellow nodes display nodes with demands,whereas blue nodes denote switches turned on to forwardflows. Blue links denote turned on line cards and bundlesof links of the respective switches. Everything colored greyis currently shut down. An example for a campus network

Fig. 4. Optimal network configuration for a mid traffic utilization duringdaytime

configuration at daytime traffic load is displayed in figure 4.In Figure 5 and 6, we can see the potential energy savingsin a campus network and a mesh network, respectively. Theseare the results from Cplex solver and the heuristic LPF. Atthe night time, we are able to save up to ca 45% of energycompared to an ”always online” network. An important aspectto highlight, is that in realistic mesh networks (e.g. WAN’s)the average network utilization is just around 30% [17] andtherefore even higher energy savings are possible.

The Strategic Greedy LPF method results in an optimalsolution for low traffic load. In case of medium and high trafficload, it is ca. 7% worse than MILP in mesh network and ca.4% in the campus network. Furthermore, we observe that the

2014 IEEE 3rd International Conference on Cloud Networking (CloudNet)

158

Image: [1]

Page 26: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Problems  with  SDN  for  IoT:  Self-­‐Optimization

• Evaluation• LPF  provides  the  greatest  energy  savings.• When  the  longest  paths  are  addressed  first,  a  subset  of  the  spanning  tree  is  produced.

• In  SPF,  this  creates  many  disconnected  fragments,  which  need  to  be  later  connected,  resulting  in  sub-­‐optimal  graph.

Image: [1]

Fig. 5. Performance of optimal solution and heuristic in a campus network

Fig. 6. Performance of optimal solution and heuristic in a mesh network

energy saving in mesh network is linearly decreasing whilethis is not the case for the campus network. This is because ofthe inhomogeneous energy consumption of switches in campusnetwork. When the traffic reaches a certain boundary, new coreor distribution switches have to be powered on, resulting in asharp increase in energy consumption.

B. Strategic Greedy Heuristic performance

The performance of four heuristic algorithms is shown inFigure 7 and for 8 mesh and campus network, respectively.

Fig. 7. Average energy savings for different strategies in a mesh network

For the mesh network, LPF outperforms the other three byup to 5% more total energy saving. This clearly makes sense ifwe think about the geometry. When the demands with longestpaths are accommodated first, a subset of a spanning tree isproduced. Afterwards, LPF will be able to efficiently allocate

smaller paths, which makes use of the already turned on linksand nodes. However, if we allocate demands with shortestpaths first, small distributed path fragments are activatedthroughout the network. Node pairs with longer paths will notbe able to efficiently allocate their path along these randomlypositioned short routes and therefore overall performance ofSPF suffers. When increasing the network size even further,this effect is expected to reveal even more, as smaller pathswill be more distributed and it will be even harder to makeuse of already turned on network gear.

Fig. 8. Average energy savings for different strategies in a campus network

For the campus network, all algorithms except SPF performnearly the same at low traffic load. When traffic increases, LPFtends to outperform the others by up to 7% total energy savingfor the same reason mentioned above.

C. Robustness of heuristic algorithms

This section investigates the robustness of the four heuristicalgorithms. Figure 9 shows, that all four strategies were alwaysable to find a valid network configuration, while routing up to600 Gbit/s overall in the campus network (mid traffic load).In terms of robustness in high peak load scenarios, it makessense to allocate resources for node pairs with short paths firstas figure 9 informs.

Fig. 9. Robustness for the Strategic Greedy for different strategies in acampus network

The situation, however, changes in mesh networks as thistopology is not designed in a hierarchical way. Figure 10displays, that SPF performed by far the worst and only founda valid configuration in 40 out of 100 different traffic scenarios

2014 IEEE 3rd International Conference on Cloud Networking (CloudNet)

159

Page 27: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Problems  with  SDN  for  IoT:  Self-­‐Optimization

• Objective• Enhance  User  Experience.  [3]

• QoS is  critical,  however,  user  perceivable  experience  is  now  important  for  user-­‐centric,  mobile  networking.

• Quality  of  Experience  (QoE)

• Motivating  Application• Smart  Home

• Multiple  QoE levels.• Distance  Learning• Health  Provider  Link• Online  Gaming

entries in the flow table, bandwidth) and accepts or rejects the requisitions according to the availability of resources.

4.2.2 Overview Network Application Follows the up/down state of the network switches and their ports by listening to the asynchronous messages exchanged between the OpenFlow controller and the switches. Data collected by this application helps the management layer to keep a global view of the network.

4.2.3 Route Calculation Application Determinates the available path(s), calculating the route(s) based in the “rules control” of Mapping Rules. The adequacy of a route for fulfilling the solicitation is determined by the network topology and by a set of performance metrics, such as latency.

4.2.4 Statistics Counter Uses a mix of passive and active monitoring for measuring different network metrics in different aggregation levels. It also measures the flow latency, error rate and jitter by inserting package probes in the network.

4.2.5 Mapping Rules Are used by the management layer for translating the high level network policies into control rules. The controller and the control applications (for instance, routing) use those rules for calculating the entries in the flow table of each switch.

4.3 Data Layer Composed by network devices with some protocol enabled. OpenFlow [23] was the first open protocol to provide standard APIs for managing the data layer in the commuting structure by the controller. The controller uses the OpenFlow API to install forwarding rules in the flow tables of switches, discovering network topology, monitoring statistics flow and visualizing the up/down status of active devices and flows.

OpenFlow provides support for monitoring and controlling QoS. The packages pertaining to a flow may be queued in a particular queue of an output interface. Controllers may consult configuration parameters and the statistic of queues. Switches may rewrite the IP ToS field in the IP header. Support for notification of explicit congestion is also inserted in OpenFlow.

ONF [24] created the OF-Config for supporting the configuration of several characteristics in an OpenFlow switch. OF-Config [25] may be used for configuring the minimum and maximum transmission rates of a queue in an OpenFlow switch. An OpenFlow controller may also read those rates from a switch. The most recent additions regarding QoS in OpenFlow are the meter tables and meter bands, which may be used for limiting the transmission rate of an output interface.

Regarding monitoring, an OpenFlow controller may query a switch in order to get aggregation statistics in different levels of aggregation, by table, by flow, by interface and by queue. In summary, OpenFlow provides an extensive support for configuring and monitoring QoS, facilitating the control of complex situations introduced by distributed routing and traffic engineering mechanisms.

5. TESTBED AND EXPERIMENTAL RESULTS This section presents the scenario, the main elements of the proposed architecture (Figure 4) used in the testbed, and experiments carried out in order to evidence how the QoE/QoS management mechanism reacts after detecting nonconformities in the service provided. An overview of the network topology used in the experiments can be seen in Figure 5

Internet

Laptop

Desktop

Game Machine

Wireless orWired Clients

Home Gateway

Ethernet

Home Area Network Access Network Metropolitan Area Network Service Provider

NetworkWide AreaNetwork

OpenFlow MessagePhisycal ConextionData

Legend Remote server for games

Gaming Services

Teleconsultation

formal andinformal carers

Tele-education

Remote server for Tele-education

Figure 5. Topology of the Network used in the Experiments

5.1 Description of the Scenario The scenario consists of three sub-categories of services, provided via an IPTV provider in a HAN (Home Area Network), as follows: Teleconsultation, tele-education and GoD (Game on Demand).

The sub-service Teleconsultation offered by a third party is used by Alice, a remotely assisted patient who often needs to consult with health caregivers and send data, vital signs, and monitoring images to a remote service unit. The appointment is held via videoconference between the patient and the caregiver. Vital signs are collected using the eHealth platform [6]. Images of the patient are sent via cameras. The configuration of the sensors and evaluation of the context parameters in a Ubiquitous Assisted Environment were discussed in our previous research work [22]. The sub-service Tele-education VoD (Video on Demand) is used by Bob, a distance education undergraduate student. The sub-service GoD is used by another user who is keen to network games.

5.2 Home Area Network Four machines were used in the HAN. eHealth flows from Host 4 (application server) to Host 1 (Alice’s machine). Teleconsultation transits between Host 1 (Alice’s machine) and Host 5 (application server). Biomedical signals are sent from Host 1 to Host 5. Tele-education flows from Host 5 to Host 2 (Bob’s machine). GoD flows from Host 5 to Host 3 (another user’s machine). Host 4 was used to generate background traffic from Host 4 to Host 5.

The machines of the users were installed IPTV client application with an interface with access to services and a module to capture the QoE impact dimensions. It was prototyped in Java, using the SAWSL framework for performing semantic annotations. The information data in the terminal reports, including the values of the QoE/QoS performance parameters, are transported in XML using a variation of the GSQR protocol [9]. The module was configured so that from time to time the terminal sends information about QoE and QoS parameters from the reports to the application managing the QoE/QoS of the service provider.

55

Image: [3]

Page 28: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Problems  with  SDN  for  IoT:  Self-­‐Optimization

• QoE• Subjective  evaluation  of  the  user’s  perceptions  and  expectations.  • Need  to  meet  hedonic  and  pragmatic  needs.

• Autonomically adapt  the  resources  on  the  network  based  on  the  user  experience.• Measured  from  Client  Machine• Reported  to  the  Controller

• User  Experience  Information  stored  in  a  Knowledge  Base

• Compares  expected  quality  to  reported  quality  for  QoEfeedback. Image: [3]

L A Y E R D A T A

L A Y E R M A N A G E M E N T

L A Y E R C O N T R O LSouthbound API (OpenFlow Protocol API)

Northbound API (REST API)

Standard SDN Controller

Switchs OpenFlow

Applications Modules(Admission Control, Overview Network,

Route Calculation, Statistical Counter)

Rule BaseMapping

RulesJava

API

Application Servers

(eHealth, VoIP, VoD, Games, News...)KB

KnowledgeBase

DBPolicyBase

DBSubscriberDatabase

QoS / QoEManagement Application

(QoE Aware Engine, KB, DB)

Use

rIn

terf

ace

QoE Aware Semantic Engine

Capture Dimensions QoE

QoE Evaluation

QoS Measurements

QoE Ontology

Figure 4. Functional Architecture for QoE Management

4.1 Management Layer This layer contains the business applications of the organization, which offer network services such as virtualization, firewalls, load balancers, QoS/QoE managers, and others. Any application of this level communicates with the standard SDN controller or with some application module of the control layer, using the Northbound API. The main contribution, in this layer, consists in the semantic engine, which captures information about QoE dimensions and learns with the user’s experience. The engine is constituted by the elements described below.

4.1.1 QoE Ontology Described in section 3, it is used to unify concepts, easing the knowledge about the user's experience when using the service and allowing inferences. The ontology was built in Protégé 4.0.

4.1.2 Module for Capturing QoE Dimmensions This module was implemented using the SAWSDL Framework, in order to Extract, Transform and Load (ETL) information in the reports from the user's terminals. Information in the reports is structured in XML format, referencing the concepts of QoE ontology. Transport of data from terminal was made using the GSQR protocol [9]. When the reports arrive in the QoE management system, they pass through the ETL processing and information is persisted in the Knowledge Base (KB) of the server hosting the corresponding service, as instances in Web Ontology Language (OWL).

4.1.3 Module for Monitoring and Measuring QoE/QoS Parameters Responsible for monitoring and measuring the values of qualitative (user) and quantitative (network performance parameters) metrics, at the client side. Having the knowledge about each user persisted in the KB, this module applies a mapping function over the QoE/QoS parameters for each service (i.e. VoIP, IPTV). For the VoIP service the mapping service of QoE/QoS described in our previous work [4] may be used. For the IPTV service a mapping function of QoE/QoS using linear regression, described in [12] may be used, Equation (1).

MOS = α Thr + β Jt + γ Plr + ε (1)

Thr is the throughput; Jt is the jitter and Plr is the packet loss rate. Coefficients α, β, γ, and ε are calculated particularly for each case.

With all information captured about the QoE dimensions, plus user and network status information, the semantic engine learns about the user’s experience using a service and is able to provide information of QoE degradation to the network controller. Table 2 illustrates examples of the KB records for video streaming.

Table 2. KB instances used for QoE Learning

USER USER CONTEXT CONTENT QoS APPLICATION

QoS NETWORK MOS

Name Equipament Activity Physical Type Codec Resolution Delay Jiiter PLR Bitrate Value

Alice notebook work-shy

Room Action MPEG4 1366X768 0 0 0 673202 5,0

Alice Mobile Strolling Office Talkshow MPEG4 1280X720 0 0 0 423606 4,0

Bob notebook work-shy

Room Action MPEG4 1366X768 0 0 0 423601 4,0

Bob Mobile Strolling Office Talkshow MPEG4 1280X720 20ms 15ms 0 123200 3,0

4.1.4 Module for Analyzing and Verifying QoE This module is composed by semantic rules, used to analyze and verify QoE degradations. The semantic engine compares the expected value of the metric with values (minimum and maximum limits) of policies defined by the administrator (policies ontology). When degradations are detected, an event from the semantic engine is triggered for a QoE/QoS management application that queries its policies database about actions to be performed according to the metric. Table 3, adapted from [1] illustrates some examples of policy adaptations that may be applied for optimizing QoE.

Table 3. Examples of policies adaptations for optimizing QoE

QoS Metrics Policies Adaptation Actions

Dropping packets

Change queue configuration Forward flow through alternative route

Throughput Change limiting rates of flows saturating bandwith Forward flow through alternative route

Latency Change queue configuration of the switch Plan the transmission of flows through a less congested route with proper delay

Jitter Forward flow through a less congested route

To apply the actions, the QoE/QoS management application communicates with the control layer using a Northbound API. In this moment the high level rules are converted and persisted into control rules, on the Mapping Rules module, of the control layer. After the control layer verifies the available resources, the actions are performed in the commuters interfaces (queues, routes, flow limiters), using a Southbound API.

4.2 Control Layer Composed by the SDN controller, which is able to communicate with commuters from the data layer, via Southbound APIs, using, for instance, the OpenFlow protocol. In order to allow automatic configurations for optimizing QoE, this layer needs some control applications in addition to a standard SDN controller, such as those proposed in [1], described hereinafter. Note that the contribution of this paper consists not in implementing those modules, but in the mechanisms described in the management layer, which in turn provide information for the control layer modules to perform the adaptation of network policies.

4.2.1 Admission Control Application Receives solicitations for provisioning resources from the management layer, analyzes the network resources (i.e. queues,

54

Page 29: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Problems  with  SDN  for  IoT:  Self-­‐Optimization

• QoE• Mean  Opinion  Score  (MOS)

• Calculated  using  factors  of  Throughput,  Jitter,  and  Packet  Loss  Rates.• KB  stores  information  about  each  user  to  build  context.

Image: [3]

L A Y E R D A T A

L A Y E R M A N A G E M E N T

L A Y E R C O N T R O LSouthbound API (OpenFlow Protocol API)

Northbound API (REST API)

Standard SDN Controller

Switchs OpenFlow

Applications Modules(Admission Control, Overview Network,

Route Calculation, Statistical Counter)

Rule BaseMapping

RulesJava

API

Application Servers

(eHealth, VoIP, VoD, Games, News...)KB

KnowledgeBase

DBPolicyBase

DBSubscriberDatabase

QoS / QoEManagement Application

(QoE Aware Engine, KB, DB)

Use

rIn

terf

ace

QoE Aware Semantic Engine

Capture Dimensions QoE

QoE Evaluation

QoS Measurements

QoE Ontology

Figure 4. Functional Architecture for QoE Management

4.1 Management Layer This layer contains the business applications of the organization, which offer network services such as virtualization, firewalls, load balancers, QoS/QoE managers, and others. Any application of this level communicates with the standard SDN controller or with some application module of the control layer, using the Northbound API. The main contribution, in this layer, consists in the semantic engine, which captures information about QoE dimensions and learns with the user’s experience. The engine is constituted by the elements described below.

4.1.1 QoE Ontology Described in section 3, it is used to unify concepts, easing the knowledge about the user's experience when using the service and allowing inferences. The ontology was built in Protégé 4.0.

4.1.2 Module for Capturing QoE Dimmensions This module was implemented using the SAWSDL Framework, in order to Extract, Transform and Load (ETL) information in the reports from the user's terminals. Information in the reports is structured in XML format, referencing the concepts of QoE ontology. Transport of data from terminal was made using the GSQR protocol [9]. When the reports arrive in the QoE management system, they pass through the ETL processing and information is persisted in the Knowledge Base (KB) of the server hosting the corresponding service, as instances in Web Ontology Language (OWL).

4.1.3 Module for Monitoring and Measuring QoE/QoS Parameters Responsible for monitoring and measuring the values of qualitative (user) and quantitative (network performance parameters) metrics, at the client side. Having the knowledge about each user persisted in the KB, this module applies a mapping function over the QoE/QoS parameters for each service (i.e. VoIP, IPTV). For the VoIP service the mapping service of QoE/QoS described in our previous work [4] may be used. For the IPTV service a mapping function of QoE/QoS using linear regression, described in [12] may be used, Equation (1).

MOS = α Thr + β Jt + γ Plr + ε (1)

Thr is the throughput; Jt is the jitter and Plr is the packet loss rate. Coefficients α, β, γ, and ε are calculated particularly for each case.

With all information captured about the QoE dimensions, plus user and network status information, the semantic engine learns about the user’s experience using a service and is able to provide information of QoE degradation to the network controller. Table 2 illustrates examples of the KB records for video streaming.

Table 2. KB instances used for QoE Learning

USER USER CONTEXT CONTENT QoS APPLICATION

QoS NETWORK MOS

Name Equipament Activity Physical Type Codec Resolution Delay Jiiter PLR Bitrate Value

Alice notebook work-shy

Room Action MPEG4 1366X768 0 0 0 673202 5,0

Alice Mobile Strolling Office Talkshow MPEG4 1280X720 0 0 0 423606 4,0

Bob notebook work-shy

Room Action MPEG4 1366X768 0 0 0 423601 4,0

Bob Mobile Strolling Office Talkshow MPEG4 1280X720 20ms 15ms 0 123200 3,0

4.1.4 Module for Analyzing and Verifying QoE This module is composed by semantic rules, used to analyze and verify QoE degradations. The semantic engine compares the expected value of the metric with values (minimum and maximum limits) of policies defined by the administrator (policies ontology). When degradations are detected, an event from the semantic engine is triggered for a QoE/QoS management application that queries its policies database about actions to be performed according to the metric. Table 3, adapted from [1] illustrates some examples of policy adaptations that may be applied for optimizing QoE.

Table 3. Examples of policies adaptations for optimizing QoE

QoS Metrics Policies Adaptation Actions

Dropping packets

Change queue configuration Forward flow through alternative route

Throughput Change limiting rates of flows saturating bandwith Forward flow through alternative route

Latency Change queue configuration of the switch Plan the transmission of flows through a less congested route with proper delay

Jitter Forward flow through a less congested route

To apply the actions, the QoE/QoS management application communicates with the control layer using a Northbound API. In this moment the high level rules are converted and persisted into control rules, on the Mapping Rules module, of the control layer. After the control layer verifies the available resources, the actions are performed in the commuters interfaces (queues, routes, flow limiters), using a Southbound API.

4.2 Control Layer Composed by the SDN controller, which is able to communicate with commuters from the data layer, via Southbound APIs, using, for instance, the OpenFlow protocol. In order to allow automatic configurations for optimizing QoE, this layer needs some control applications in addition to a standard SDN controller, such as those proposed in [1], described hereinafter. Note that the contribution of this paper consists not in implementing those modules, but in the mechanisms described in the management layer, which in turn provide information for the control layer modules to perform the adaptation of network policies.

4.2.1 Admission Control Application Receives solicitations for provisioning resources from the management layer, analyzes the network resources (i.e. queues,

54

Page 30: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Problems  with  SDN  for  IoT:  Self-­‐Optimization

• QoE Evaluation• Background  traffic  was  only  induced  on  the  final  test.• Flow  rates  are  limited  based  on  MOS  to  maintain  acceptable  levels.

Image: [3]

For the Home Gateway, TP-Link with 10Mbps capacity was used, emulating the ADSL / Cable / PON service.

5.3 Network Provider and IPTV service In the experiments, the network and service providers are represented by the same entity. The provider provides the service and the infrastructure service access network.

To simulate the services provided, an application server was implemented. The application server was configured in Linux Ubuntu 04.13, 64bits. In the server, videos and games were made available so that users could watch and play over the network. An eHealth application was used to collect data about patients’ vital signs. Alice’s biomedical data are requested at any time by the health caregiver before, during or after teleconsultation.

On the same physical machine, the QoE/QoS managing application was configured with a semantic engine, a KB Knowledge Base) and a network adaptation policies base. The managing application uses REST API to communicate with the controller.

The Floodlight controller was installed and used as an OpenFlow controller. It was adopted because it has a group of modules and applications that, together with the OpenFlow API and the OF-Config protocol, allow the visualization of the network topology, status of devices, changing the forwarding tables and verifying active flows, among other functionalities. For this scenario, the modules Topology Management, Static Flow Entry Pusher and Counter Store of Floodlight were used for, respectively, verifying the network topology, forwarding the flows, installing and removing flows from a given switch and generating statistics.

For monitoring performance parameters of the transport network, Linux tools and the Counter Store module of the Floodlight controller were used. The output of the monitors’ data was used as input for the controller.

In order to emulate the transmission network, a mesh was used with four OVS installed on Ubuntu 13.04-64bits operating system. Each OVS runs in a separate physical machine, interconnected, using GRE tunnels (Generic Routing Encapsulation). The Linux tc command was used to set the maximum capacity of each link and the delay. All links were configured with a maximum capacity of 10 Mbps. Two routes were created (Route 1: S1-S2-S4 and route 2: S1-S3-S4) to send flows through alternative paths. By default, all flows follow Route 1.

5.4 Experimental Results In the experiment, the QoE of flows in active sessions of three users was analyzed, using different applications, in a HAN consuming three sub-services from the IPTV provider. In opportune moments, background traffic is generated in order to ascertain the behavior of the proposed mechanism. Four experiments were performed in the laboratory.

In the first experiment, the mechanism of QoE management control was not activated. And applications competed by the total bandwidth, indiscriminately available, according to results measured and plotted in Figure 6.

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300

Th

rou

gh

pu

t (

Mb

ps

)

Time (second)

Experiment Results without QoE/QoS Management Control

Tele-education Teleconsultation Game on Demand

Figure 6. Throughput results of measurements without QoE/QoS mechanism

The second experiment was to evaluate the same flow, without and after enabling the QoS management mechanism. The flows with QoS guarantee correspond to 5Mbps for Teleconsultation, 3 Mbps for Tele-education and 2Mbps for GoD. Up to 90 seconds all flows compete for the total bandwidth available, and at 90 seconds, after activation of the QoS control mechanism, each flow receives its web portion, as can be seen in Figure 7.

0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300

Thro

ughp

ut (M

bps)

Time (second)

Experiment Result with QoE/QoS Management Control

Tele-education Teleconsultation Game on Demand

Figure 7. Throughput results of measurements with the QoS control mechanism

The third experiment was to test the throughput adaptation policy with an alternate route. QoS guarantees of 4,5Mbps, 2,5Mbps, 2Mbps and 1,0Mbps were established respectively for Teleconsultation, Tele-education, GoD and Biomedical Signals. All flows were transported by route 1. In order to induce QoE degradation, backgroud flows were programmed to transmit data with a transmission rate of 3Mbps at two different time intervals, either 12 to 30 seconds or 40 to 60 seconds. In both instances, with such a transmission rate, the bandwidth was saturated and the services started to suffer degradation from 12 to 20 seconds and from 40 to 45 seconds. As the semantic engine was programmed to analyze the level of the service provided at intervals of 5 seconds in the first case, detection of the QoE degradation occurred at 15 seconds and in the second case at 40 seconds. In both cases, the time between the detection of QoE and restoration of the service level was 5 seconds. The measurements and occurrences of those events are plotted in Figure 8. In the two transmissions background flows, when the semantic engine identified the violation of the metric, an event was triggered for the QoE/QoS application. This one, queried its policies database, verified the kind of action to be performed and, using the Northbound API, communicated with the control layer. The high level rule was mapped and persisted in the Mapping Rules database. And the “Admission control” application was invoked. As this application has an overall view of the network, it verified the availability of alternate routes, and the module “Static Flow Entry Pusher” of the controller was triggered to change the forwarding table on the switches and to forward the ill-behaved flows through route 2 (S1-S3-S4), as shown in Figure 8.

56

0

1

2

3

4

5

6

7

8

9

10

0 5 10 15 20 25 30 35 40 45 50 55 60

Th

rou

gh

pu

t (M

bp

s)

Time (second)

Throughput Adaptation Policy - Flow Bypass to Route Alternative

Tele-education Teleconsultation Game on Demand Biomedical Signals Background Traffic

Figure 8. Throughput adaptation policy, after change

alternative route

The fourth experiment was to test the throughput adaptation policy, with flow rate limiters and without alternate route. In this experiment, only route 1 was kept to test the system behavior to accommodate best effort flows and QoS guarantee flows. QoS guarantees of 4,5Mbps, 2,5Mbps and 2Mbps were established, respectively for Teleconsultation, Tele-education, and GoD and with bandwidth saturation. Up to 10 seconds, the throughput was kept in all sessions. But those new sessions compromised the bandwidth of active flows, until the problem was detected and fixed, as illustrated in Figure 9.

0

1

2

3

4

5

6

7

8

9

10

0 5 10 15 20 25 30 35 40 45 50 55 60

Trh

ou

gh

pu

t (M

bp

s)

Time (second)

Throughput Adaptation Policy - Limiters Flow Rates

Tele-education Teleconsultation Game on Demand heavy games downloads Update eHealth software

Figure 9. Throughput adaptation policy, after change limiting

rates of flows

From 11 seconds up to 50 seconds, Bob’s terminal downloaded heavy games from the server. Programmed to verify service level at every 5 seconds, at 15 seconds the semantic engine showed throughput degradation, and at 20 seconds the transmission rate limit for downloading games was changed, and the residual bandwidth of 1 Mbps was used for this session until 30 seconds. At this time, Alice’s terminal started to update the eHealth software. From 30 to 60 seconds, Alice’s terminal updated the eHealth system from the server. In a short time (5 seconds), the services were provided with degradations and at 35 seconds there was normalization and flows with QoS guarantees were maintained until the end of the experiment. From 35 to 60 seconds, the residual bandwidth of 1Mbps was distributed among the new asset flows. At 50 seconds, during which game download was completed, the residual bandwidth of 1Mbps was allocated for uploading the eHealth software. The semantic engine detected violation in the values of the throughput metric and generated events for the QoE/QoS managing application. As the flows are “best effort” and in this experiment there is no alternate route for deviating them, the application searched its policies database and found that the action to be performed consisted in changing the rate of limiting flows, because they were saturating the bandwidth. The control layer was triggered by the Northbound API; the high level rule was converted and persisted in the Mapping Rules Base; the admission control application verified the available resources and, using the OpenFlow protocol, through the Southbound API, the limiting rate for flows (game downloads and updated e-Health software) was changed on the switches interfaces.

6. CONSIDERATIONS AND FUTURE WORKS Many proposals for provision and delivery of services also address QoE as an extension of QoS, where only the network parameters are mapped to predict the level of the user’s satisfaction. This QoE management mode contradicts the opinion poll presented in section II and the main concepts of QoE in the literature. Concerns about the quality of service under the user's perspective demands that the QoE be conceptualized as a multidimensional construct, which encompasses the human dimensions, content, context and technology, therefore, based on an interdisciplinary approach. Under this perspective, this proposal is located, and different from others found in the literature, by being founded in several areas of knowledge and projected for the Internet of the Future. To provide an effective and comprehensive management of QoE, we propose a taxonomy of impact dimensions in the quality perceived by the user, ways to model and represent knowledge using ontologies. Information data on the QoE impact dimensions are mapped and persisted in a knowledge base. The semantic engine proposed uses this information to learn the user’s experience in the use of a service and based on policies and can detect QoE degradations. The semantic elements, together with network control applications, were incorporated, prototyped and tested in an SDN architecture. A usage scenario, with IPTV sub-services consumption was presented and experienced. Two of the experiments tested the throughput metric adaptation policies with and without the use of an alternate route, with and without saturation of the total bandwidth available. In the experiments, it was observed that the components of the SDN network, along with the semantic mechanism proposed, provide support to offer services aware of the user experience, and are able to detect, report and correct degradations of QoE without impacting the user-perceived quality. In the third and fourth experiments, it was observed that the time to restore the QoE after the degradation detection is 5 seconds. However, the time of restoration of the QoE can range between 9 seconds, once the semantic engine is programmed to perform the inference process every 5 seconds. Our research continues with new experiments to evaluate the performance of the proposed mechanism with multiple flows and multiple users and, considering that the testbad was performed with defined number of flows and users, and in a realistic setting, these elements are scalar. In addition, we intend to evaluate other QoS metric adaptation policies and combine new dimensions of the proposed taxonomy

7. REFERENCES [1] Bari, M.F., Chowdhury, S.R., Ahmed, R. and Boutaba, R.

2013. PolicyCop: An Autonomic QoS Policy Enforcement Framework for Software Defined Networks. 2013 IEEE SDN for Future Networks and Services (SDN4FNS) (Nov. 2013), 1–7.

[2] Bellavista, P., Corradi, A., Fanelli, M. and Foschini, L. 2012. A survey of context data distribution for mobile ubiquitous systems. ACM Computing Surveys. 44, 4 (2012), 1–45.

[3] Cardone, G., Corradi, A., Foschini, L. and Montanari, R. 2012. Socio-technical awareness to support recommendation and efficient delivery of IMS-enabled mobile services. IEEE Communications Magazine. 50, 6 (Jun. 2012), 82–90.

57

Page 31: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Proposed  IoT Architecture  

• Proposed  architecture  to  address  basic  IoT concerns  [4]• Task-­‐Resource  Matching  Module

• Maps  heterogeneous  device  resources  through  semantic  modeling.• Service  Solution  Specification  Module

• Maps  the  characteristics  of  the  devices  involved  in  the  proposed  solutions  to  specific  requirements  for  devices.

• Flow  Scheduling  Module• Accesses  the  network  state  information  and  uses  the  Genetic  Algorithm  to  schedule  flows.

• GA  is  natively  compatible  with  networking.• Nodes  are  genes,  mutation  and  crossover  are  performed  by  replacing  sub-­‐paths.  • Fitness  value  is  the  QoS of  the  flow.

Page 32: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Proposed  IoT Architecture  

• Proposed  IoT Architecture  Evaluation• On  a  Campus-­‐Wide  Network

• Data  Transfers  – 8%  throughput  increase• TeleAudio Flows  – 51-­‐71%  reduction  of  end-­‐to-­‐end  delay.• Video  Streaming  – 32-­‐67%  less  Jitter

0

50

10 0

15 0

20 0

25 0

30 0

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43

Kbps

B in P a ck in g

Lo a d B a la n ce

P rop o s ed A l g or ith m

F low Id

F i l e s h a r i n g

0

1

2

3

4

5

6

7

8

9

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43

Bin PackingLoad BalanceProposed Algorithm

FlowId

Seco

nd Tele Audio

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43

Bin PackingLoad BalanceProposed Algorithm

Video

FlowId

Second

(a) End-to-End Throughput (b) End-to-End Delay (c) End-to-End Jitter

Fig. 8. Performance Comparisons Results

[0.01, 0.1] seconds. Tele audio and video streaming flows are

from real traffic traces [26], [27].

In our GA-based flow scheduling algorithm, we initially

choose two paths for each flow as parents. Under this specific

network topology, we choose the path generated by load

balance algorithm as one of the parents; then, we determine

the other parent by exchanging the current core route with

the alternative one (we have two core routers). We argue

that the file sharing service requires large throughput, the tele

audio service requires low delay, while the video streaming

service requires low jitter. Since QoS requirements wd, wj , wt

mentioned in Section V-B2 highly depend on the user

experience, audio/video codec and buffer size in the end

devices, etc, we do not set any particular QoS requirement in

this simulation based experiment. Instead, we try to optimize

the QoS performance (maximize throughput, minimize delay

and jitter) in a predefined amount of generations (we set 10

generations here). Hence we slightly change the fitness value

in equation (7) with �xd+ xj+!(10000/xt): for file sharing

flows (�, , !) = (0, 0, 1), for tele audio flows (�, , !) =(1, 0, 0), and for video streaming flows (�, , !) = (0, 1, 0).

We have totally 45 flows (each of 45 end devices has

one flow): flows 1-21 are file sharing, flows 22-36 are tele

audio, and flows 37-45 are video streaming. Fig. 8(a) shows

the Flow Throughput comparison. For file sharing flows, the

load balance algorithm outperforms the bin packing algorithm,

while our proposed algorithm has an average 8% throughput

increase if compared with the load balance algorithm. The

reason is in wireless links when link utilization exceeds

a threshold, the packet drop rate increases dramatically, as

indicated in [33]. Fig. 8(b) shows that for tele audio flows,

our proposed algorithm can improve the end-to-end delay

performance by 51% and 71%, compared to load balance and

bin packing algorithm respectively. However, the other two

types of flows suffer approximately the same delay experience

under these three algorithms. We argue the reason is tele

audio flows have bursty traffic patterns; it might not have

big data volume, but if two flows are scheduled with similar

busty pattern in the same link, a large delay occurs. That

is why tele audio flows have poor delay performance under

bin packing and load balance algorithms. Fig. 8(c) shows

that video streaming flows have an average 32% and 67%

less jitter with our proposed algorithm than the other two

algorithms. Two observations can be obtained here: a) video

streaming flows have a better overall jitter performance than

tele audio ones; b) our proposed algorithm has almost the

same throughput and delay performance on video streaming

flows, compared with the other two algorithms. The reason is

video streaming flows have variable packet length, but almost

constant inter packet interval. Hence if the interfered flows

also have a stable inter packet interval, the jitter should be

low. In fact, our proposed algorithm schedules more video

streaming flows with flow sharing flows (more stable inter

packet interval) than tele audio flows (variable inter packet

interval).

Extra flow entry messages overhead exists in the beginning

of the experiments. Since we assume that we perform a

one time flow scheduling and flows are stable once they are

initialized, we do not examine how the extra message overhead

affects the network performance. However, enabling online

scheduling with dynamic flow admission is also one of our

future work directions.

VII. CONCLUSIONS

In this paper, we have presented an original SDN controller

design in IoT Multinetworks whose central, novel feature is the

layered architecture that enable flexible, effective, and efficient

management on task, flow, network, and resources. We gave a

novel vision on tasks and resources in IoT environments, and

illustrated how we bridge the gap between abstract high level

tasks and specific low level network/device resources. A vari-

ant of Network Calculus model is developed to accurately esti-

mate the end-to-end flow performance in IoT Multinetworks,

which is further serving as fundamentals of a novel multi-

constraints flow scheduling algorithm under heterogeneous

traffic pattern and network links. Simulation based validations

have shown that our proposed flow scheduling algorithm has

better performance when compared with existing ones. We are

currently in the process of integrating this layered controller

design with our MINA software stack, in a large IoT electrical

vehicular network testbed [2] and developing more secure,

sophisticated tools to assist on-the-fly resource provisioning

and network control.

What we have realized is that the layered controller design

is critical to the management of heterogeneous IoT Multinet-

works. Techniques applied at each layer could be different - in

our design, the semantic modeling approach performs resource

matching and the GA-based algorithm schedules flows. Those

techniques can be viewed as plug-ins and can be adjusted

or replaced in different IoT scenarios. We strongly believe

that our novel layered controller architecture that inherently

supports heterogeneity and flexibility is of primary importance

to efficiently manage IoT Multinetworks.

Image: [4]

Page 33: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Conclusion

• SDN  for  Heterogeneous,  Wireless,  Mobile  networks  is  still  very  new.• The  papers  cited  are  from  2014/2015.

• I  just  got  back  from  the  conference  one  was  presented  at.

• They  all  reference  a  lack  of  proper  work  in  autonomic  resiliency  and  address  aspects  of  the  same  issue.• Combining  their  work  would  result  in  a  much  stronger  case  towards  earlier  adoption  of  SDN  by  IoT.

• The  papers  all  lack  proper  evaluations,  with  several  lacking  even  rudimentary  evaluations.    • The  next  step  is  a  common  testbed for  evaluations  and  rigorous  analysis.

"At  the  South  Pole,  December  1911"  by  Olav  Bjaaland (1863-­‐1961)[1]   -­‐ Cropped  photograph  from  Amundsen,   Roald:  The  South  Pole,  Vol.  II,  first  published   by  John  Murray,  London  1913.  Photo  facing  page  134.  Licensed  under  PD-­‐US  via  Wikipedia   -­‐https://en.wikipedia.org/wiki/File:At_the_South_Pole,_December_1911.jpg#/media/File:At_the_South_Pole,_December_1911.jpg

Page 34: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Conclusion

• SDN  for  Heterogeneous,  Wireless,  Mobile  networks  is  still  very  new.• The  papers  cited  are  from  2014/2015.

• I  just  got  back  from  the  conference  one  was  presented  at.

• They  all  reference  a  lack  of  proper  work  in  autonomic  resiliency  and  address  aspects  of  the  same  issue.• Combining  their  work  would  result  in  a  much  stronger  case  towards  earlier  adoption  of  SDN  by  IoT.

• The  papers  all  lack  proper  evaluations,  with  several  lacking  even  rudimentary  evaluations.    • The  next  step  is  a  common  testbed for  evaluations  and  rigorous  analysis.

Image:  John  Walker  (Founder  of  AutoDesk,  Co-­‐Author  of  AutoCAD),  http://www.fourmilab.ch/images/antarctica_2013/S015.html

Page 35: Study&on&Self,Healing&and&Self, …menasce/cs788/slides/kandrea_TermPaper2.pdfOptimization&in&Software&Defined& Networking&for&Heterogeneous& Networks Kevin Andrea Image:’Birmingham’Rail’&Locomotive

Primary  References[1] A. Markiewicz, P. N. Tran, and A. Timm-Giel, “Energy consumption optimization for software defined networks considering dynamic traffic,” in Cloud Networking (CloudNet), 2014 IEEE 3rd International Conference on, pp. 155–160, Oct 2014.

[2] P. Thorat, S. M. Raza, D. T. Nguyen, G. Im, H. Choo, and D. S. Kim, “Optimized self-healing framework for software defined networks,” in Proceedings of the 9th International Conference on Ubiquitous Information Management and Communication, IMCOM ’15, (New York, NY, USA), pp. 7:1–7:6, ACM, 2015.

[3] M. P. da Silva, M. A. Dantas, A. L. Gon ̧calves, and A. R. Pinto, “A managing qoe approach for provisioning user experience aware services using sdn,” in Proceed- ings of the 11th ACM Symposium on QoS and Security for Wireless and Mobile Networks, Q2SWinet ’15, (New York, NY, USA), pp. 51–58, ACM, 2015.

[4] Z. Qin, G. Denker, C. Giannelli, P. Bellavista, and N. Venkatasubramanian, “A software defined networking architecture for the internet-of-things,” in Network Operations and Management Symposium (NOMS), 2014 IEEE, pp. 1–9, May 2014.

[5] N. Dorsch, F. Kurtz, H. Georg, C. Hagerling, and C. Wietfeld, “Software-defined networking for smart grid communications: Applications, challenges and advantages,” in Smart Grid Communications (SmartGridComm), 2014 IEEE International Conference on, pp. 422–427, Nov 2014.