24
Scuola Superiore Sant’Anna QoS-Aware Fault Tolerance in Grid Computing L. Valcarenghi, F. Cugini, F. Paolucci, and P. Castoldi Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16, 2006, Athens, Greece

QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,

Scuola Superiore Sant’Anna

QoS-Aware Fault Tolerance in Grid Computing

L. Valcarenghi, F. Cugini, F. Paolucci, and P. CastoldiScuola Superiore Sant’Anna and CNIT

Pisa, Italy

Workshop on Reliability and Robustness in Grid Computing SystemsGGF16

Feb. 13-16, 2006, Athens, Greece

Page 2: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,

Outline

• Grid fault tolerance• Quality of Service (QoS) aware fault

tolerance• Integrated QoS aware fault tolerance• Performance evaluation• Conclusions

Page 3: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,

Approaches for Grid Fault Tolerance

Applicationend-user applications

Collectivecollective resource control

Resourceresource management

ConnectivityInter-process communication,

protection

Fabricbasic hardware and software

ApplicationMiddleware

Transport

Link

delegated to WAN, MAN, and LAN resilience schemes

delegated to HW, SW, and farm failover schemes

Tasks and Data Replicas

Condor-Gcheckpointing, migration, DAGMan

GT2/GT3GridFTP

Reliable File Transfer (RFT)Replica Location Service (RLS)

Fault Tolerant TCP (FT-TCP)

Application specific fault tolerant schemes based on middleware

fault detection

Fault Tolerant Schemes TCP/IP Stack

Internet/Network

Layered Grid Architecture

Page 4: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,

QoS-Aware Fault Tolerance

• Def. QoS-Aware Fault Tolerance:– Capability of overcoming both hardware and software

failures while maintaining communication QoSguarantees

• Def. QoS-Capable Layer– Grid layer capable of guaranteeing communication QoS

• Example:– Upon failure of main data center bandwidth guaranteed

connectivity must be guaranteed to data replica center

Page 5: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,

QoS Awareness in Grid Fault Tolerance

Fabric

ApplicationCollectiveResourceConnectivity

QoS Unaware

QoS Capable

Application

Middleware

TCP/IP

MPLS

SONET/SDH

Optical Transport Network (OTN)

Page 6: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,

Application/Middleware Fault Tolerance

• Advantages– Network layer independent– Flexible (e.g., degree of failover dependent on

application)• Drawbacks

– Application dependent– User driven– Need for TCP synchronization– Slow reaction to failures– Not scalable (e.g., CPU and storage)– No communication QoS guarantees

Page 7: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,

QoS Capable TCP/IP

• Dynamic rerouting with Diffserv• Advantages

– Pervasiveness• Drawbacks

– No Traffic Engineering• Example

– After rerouting, packets with the same Type of Service (ToS) compete for the same insufficient resources along the shortest path

Page 8: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,

QoS Aware Fault Tolerance Below Layer 3

• QoS aware fault tolerance through connection oriented communication in QoScapable layer– Multi-Protocol Label Switching (MPLS):

• Label Switched Paths (LSPs)

– SONET/SDH• SONET/SDH Path

– OTN• Lightpath (i.e., wavelength channel)

Page 9: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,

Implementing Multi-layer QoS Aware Fault Tolerance

• Assumption– Grid computing services requiring communication QoS guarantees

(e.g., collaborative visualization)– QoS parameter

• minimum bandwidth• Objective

– Maximize recovered connections and minimize required network resources upon network link failure

• Proposed approach– Integrating QoS unaware layer and QoS capable layer fault

tolerance ⇒ QoS aware integrated fault tolerance• QoS capable layer fault tolerance

– (G)MPLS path restoration• Software layer fault tolerance

– Service replication (server migration)

Page 10: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,

Network Layer Fault Tolerance (Path Restoration) Issue: Blocking

Primary VideoServer

Backup Video

Server

Primary LSP

Backup LSP

InsufficientCapacity

Client

Primary LSP

Page 11: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,

Primary VideoServer

Backup Video

Server

Software Layer Fault Tolerance Issue: Wasted Network Capacity

Primary LSP

LSP to Backup Video Server

Implies 1:1 Path Protection

Client

Page 12: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,

Primary VideoServer

Backup Video

Server

Non Integrated Fault Tolerance Issue: Unnecessary Server Migration when Backup LSP is available

Primary LSP

Backup LSP

LSP to Backup Video Server

Absence ofInter-layer Coordination

Client

Page 13: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,

Integrated Fault Tolerance Advantages: Path Restoration + Service Replication

Primary VideoServer

Backup Video

Server

Primary LSP

Backup LSP

LSP to Backup Video Server

Client

Primary LSP

Page 14: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,

Simulation Scenario

• Physical network– 100 randomly generated connection

matrices– Bidirectional connection generation– Bidirectional connection rerouting

• Expected network blocking probability– average ratio between number of

unrecovered connections and failed connections

• Expected replica utilization ratio– Average ratio between the number

of locations utilized for service replication and number of locations allowed for service replication

• Expected path restoration utilization– Average number of times the original

server location node is utilized as replica location normalized ot the number of replica locations utilized

• Expected connection path length– average length of connection paths to

reach replica location• Evaluation scenarios

– Limited number of replicas• per location• per failed connection between

(s,d) pair – Limited distance (hop) of allowed

replica locations– Minimum required replication

flow

Page 15: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,

Integrated Restoration Performance

• Integrated restoration outperforms OSPF dynamic rerouting resilience

• Integrated restoration performs as well as service migration resilience but by utilizing path restoration decreases the need for service synchronization and restart

Page 16: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,

Integrated Multi-layer QoS Aware Fault Tolerance

•Service Agent setup a new connection from another server•Client switches-over

Connection rerouting

Notification to Service Agent

Notification to the ingress router

Connection tear down byIngress router

Connection tear down by transit router

Integrated Server Migration

Network restoration

Page 17: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,

Rerouting after connection preemption

Throughput at the client fallsduring LSP rerouting (100 ÷900 ms interruption)

Migration is not neededbecause LSP rerouting is fast enough

Stream carried on a guaranteedbandwidth LSP

New LSP at higher priority set up

LSP carriyng the stream isprempetd and rerouted on a new

path

Page 18: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,

Migration after connection preemption 1(Integrated scheme)

Agent sets up LSP fromprimary server

Agent periodically checksLSP status

LSP carrying the stream is prempted

Premption isdetected by the agent

Agent provides to setup an LSP from backup server

As soon as LSP is up Ingress router notifies

the Agent

Agent then notifiesthe client server

migration

Recovery takes about 2-4 seconds tocomplete:

• Less than 1 second (polling period) for fault detection

• 2-3 seconds for network reconfiguration

• Network latency negligible

Page 19: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,

Replica Placement Problem Objective

– Utilize service replication for guaranteeing high percentage of recovered inter-service connections

– Limit number of allowed replica locations to optimize utilized computational resources

– Utilize simple and efficient heuristics for replica placement• Proposed Approach

– Place replicas in nodes adjacent to original service location– Form cluster of nodes with same service replicas (i.e., service

islands)• Expected results

– Improve service inter-connectivity recovery– High replica utilization ratio

Page 20: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,

HOP Heuristic

• Place replica in all the nodes reachable in H-hop from the original service location

ClientClient

ServerServer

Service ReplicaService ReplicaH=1H=1

Nodal degree=4Nodal degree=4Nodal degree=4Nodal degree=4

Page 21: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,

Super-Node Degree Heuristic

Nodal degree=4Nodal degree=4Nodal degree=5Nodal degree=5

ClientClient

ServerServer

Service ReplicaService Replica

• Place replica in all the nodes reachable in H-hop from the original service location and incrementing the previous step super-node nodal degree

H=1H=1

Nodal degree=5Nodal degree=5

NONO

Page 22: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,

HOP and Super-node Degree

• H=∞ and α=0 expected network blocking probability lower bound

• For both α=0 and α=1 – Super-Node degree and HOP heuristic

similar expected restoration blocking probability

– Super-node degree better expected replica utilization ratio

• H, number of hops from destination to consider candidate service island nodes

• α=0, no capacity need for replica update (static replica placement)

• α=1, full capacity (same as client-server) capacity for replica update (dynamic replica placement)

Page 23: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,

Summary

• Need for QoS aware fault tolerance in Grid Computing– Not only recovering connectivity but also maintaining

communication QoS parameters

• Proposed implementation– Integration of application/middleware fault tolerant

schemes (e.g., replication) with QoS capable layer (i.e., below layer 3) fault tolerance (e.g., LSP restoration)

• Further work– QoS aware fault tolerance as grid service– Other QoS parameters

Page 24: QoS-Aware Fault Tolerance in Grid Computing · Scuola Superiore Sant’Anna and CNIT Pisa, Italy Workshop on Reliability and Robustness in Grid Computing Systems GGF16 Feb. 13-16,

E-mail: [email protected]@[email protected]@sssup.it

Sant’Anna School & CNIT, CNR research area, Via Moruzzi 1, 56124 Pisa, Italy