Surviving Failures in Bandwidth Constrained Datacenters

Surviving Failures in Bandwidth Constrained Datacenters

Authors: Peter Bodik

Ishai MenacheMosharaf ChowdhuryPradeepkumar Mani

David A.MaltzIon Stoica

Presented By,Sneha Arvind Mani

OUTLINEIntroductionMotivation and BackgroundProblem StatementAlgorithmic SolutionsEvaluation of the AlgorithmsRelated WorkConclusion

IntroductionThe main goal of this paper:

◦To improve the fault tolerance of the deployed applications

◦Reduce bandwidth usage in the core.-How? - By optimizing allocation of applications to physical machines.• Both of the above problems are NP-hard• So they formulated a related convex optimization

problem that • Incentivizes spreading machines of individual services

across fault domains.• Adds a penalty term for machine reallocations that

increase bandwidth usage.

Introduction (2)Their algorithm achieved 20%-50% reduction in

bandwidth usage and improving worst-case survival by 40%-120%

Improvement in Fault Tolerance – reduced the fraction of services affected by potential hardware failures by up to a factor of 14.

The contribution of this paper is three-fold:◦Measurement Study◦Algorithms◦Methodology

Motivation and BackgroundBing.com – a large scale Web application running

in multiple datacenters around the world.Some definitions used in this paper:

◦ Logical Machine: Smallest logical component of a web application.

◦ Service: Service consists of many logical machines executing the same code.

◦ Environment: Consists of many services◦ Physical Machine: Physical server that can run a single

logical machine.◦ Fault Domain: Set of physical machines that share a

single point of failure.

Communication PatternsOn tracing communication between all

pairs of servers and for each pairs of services i and j, it was observed that datacenter network core is highly utilized.

Traffic matrix is verysparse. Only 2% servicepairs communicate at all.

link utilization >50%

>60% >70% >80%

aggregate months above utilization

115.7 47.5 18.3 6.2

Communication Patterns(2)Communication pattern is very skewed. 0.1% of

the services that communicate generate 60% of all traffic & 4.8% of service pairs generate 99% of traffic.

Services that do not require lot of bandwidth can be spread out across the datacenter, improving their fault tolerance.

Communication Patterns(3)The majority of the traffic, 45% stays within the same

service, 23% leaves the service but stays within the same environment & 23% crosses environments.

Median services talk to nine other services.Communicating services form small and large

components.

Failure CharacteristicsNetworking hardware failures causes significant

outages.Redundancy reduces impact of failures on lost

bytes by only 40%Power fault domains create non-trivial patterns.Implications for Optimization Framework:

It has to consider the complex patterns of the power and networking fault domains, instead of simply spreading the services across several racks to achieve good fault tolerance.

Problem StatementMetrics:Bandwidth (BW): The sum of the rates on the core links is

the overall measure of the bandwidth usage at the core of network.

Fault Tolerance(FT): It is the average of Worst-Case-Survival(WCS) across all the services.

No. of Moves(NM): The number of servers that have to be re-imaged to get from initial datacenter allocation to the proposed allocation.

Optimization:Maximize FT – α BWSubject to NM ≤ N0

α – tunable positive parameterN0 – Upper limit on number of moves.

Algorithmic Solutions The solution roadmap is as follows:

◦ Cells – a subset of physical machines that belong to exactly the same fault domains. This allows reduction in the size of optimization problem.

◦ Fault Tolerance Cost (FTC) is a convex structure, hence the minimization of FTC improves FT.

◦ Their method to optimize BW is to perform a minimum k-way cut on the communication graph.

◦ CUT + FT + BW consists of two- phases Minimum k-way cut to compute initial assignment that

minimizes bandwidth at the network core. Iteratively move machines to improve FT.FT + BW does not perform graph-cut but starts with current allocation & improves performance by greedy moves that reduce weighted sum of BW and FTC.

Formal Definitions I – the indicator function I(n1,n2) = 1 if traffic from n1 to n2 traverses through

core link & I(n1,n2) = 0 otherwise.Bandwidth is given by:

Where is required BW between a pair of machines from services k1 and k2.To define FT let be the total

number of machines allocated to service k affected by fault j. FT is given by:

K – total no. of services.

Formal Definitions(2)Fault Tolerance Cost(FTC) is given by:

bk and wj are positive weights assigned to services and faults.

A decrease in FTC should increase FT, as squaring the zk,j variables incentivizes keeping their values small, obtained by spreading the machine assignment across multiple fault domain.

Minimization of BW is based on minimum k-way cut, which partitions the logical machines into a given number of clusters.

Algorithms to improve both BW & FTCUT+FT : Apply CUT in the first phase

then minimize FTC in the second phase using machine swap

CUT + FT +BW: As above but in second phase a penalty term for bandwidth is added. (i.e )ΔFTC + αΔ BW, α is the weighing factor.

NM-aware algorithm:FT + BW: Start with initial allocation, do

only second phase of CUT + FT + BW.

Scaling to large DatacentersAn algorithm that directly exploits skewness of the communication matrix.CUT+RandLow: Apply cut in the first phase. Determine

the subset of services whose aggregate BW are lower than others then randomly permute the machine allocation of all services belonging to the subset.

Scaling to large datacenters:To scale to large datacenters, we sample a large number

of candidate swaps and choose the one that most improves FTC.

Also during graph cut, logical machines of same service are grouped into smaller number of representative nodes.

Evaluation of AlgorithmsCUT + FT+ BW: When ignoring the server

moves, it achieves 30%-60% reduction in BW usage at the same time improving FT by 40-120%

FT + BW is close to CUT + FT+BW : FT+BW performs only steepest-descent moves.It could be used in scenarios where concurrent server moves is limited.

Random allocation in CUT + RandLow works well as many services transfer relatively little data and they can be spread randomly across DCs.

Methodology to EvaluateThe following information is needed to perform evaluation:Network Topology of a clusterServices running in the cluster and list of

machines required for each services.List of fault domains and machines in each

fault domainsTraffic matrix for services in the cluster.The algorithms are compared on the entire achievable tradeoff boundaries instead of their performance.

Comparing Different Algorithms

The solid circles represents the FT and BW at starting allocation(at origin), after BW-only optimization(bottom-left-corner) & after FT-only optimization (top-right-corner).

Optimizing for both BW and FTArtificially partitioning each service to several

subgroups – did not lead to satisfactory results.Augmenting the cut procedure with “spreading”

requirements for services – did not scale to large applications.

Cut + FT: Graph is plotted by increasing number of server swaps.

By changing the number of swaps, tradeoff between FT & BW can be controlled.

The formulation is convex, so performing steepest descent until convergence leads to global minimum w.r.t. fault tolerance.

Optimizing for both BW and FT(2)Cut + FT+BW: Depends on α . Higher the value of

α, more weight on improving BW at the cost of not improving FT.

Not optimizing over a convex function, not guaranteed to reach global optimum.

Cut + RandLow : Performs close to Cut+FT+BW but does not optimize the BW of low-talking service nor the FT of high-talking ones.

These graphs show the trade-off boundary between FT and BW for different algorithms across 3 more DCs.

Optimizing for BW,FT and NMWe notice significant

improvements by moving just 5% of the cluster. Moving 29% of the cluster achieves results similar to moving most of machines using Cut + FT + BW

When running FT + BW until convergence, it achieves results close to Cut+FT+BW even without the graph cut.

This is significant because it means we can use FT + BW incrementally and still reach similar performance as Cut+FT+BW reshuffles the whole datacenter.

Improvements in FT & BW

For α = 0.1, FT+BW achieved reduction in BW usage by 26% but improved FT by 140% and FT was reduced only for 2.7% of services and it is much lesser than for α = 1.0

For α = 1.0, FT+BW reduced core BW usage by 47% and improved average FT by 121%

Additional ScenariosOptimization of bandwidth across

multiple layers.Preparing for maintenance and online

recovery.Adapting to changes in traffic patterns.Hard constraints on fault tolerance and

placement.Multiple logical machines on a server.

Related WorkDatacenter traffic analysisDatacenter resource allocationVirtual network embeddingHigh availability in distributed systemsVPN and network testbed allocation

ConclusionAnalysis shows that the communication

volume between pairs of services has long tail, with majority of traffic being generated by small fraction of service pairs.

This allowed the optimization algorithm to spread most of the services across fault domains without significantly increasing BW usage in the core.

Thank You!

Documents

Surviving Failures in Bandwidth Constrained Datacenters