Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core...

Optimizing Network Performance in Distributed Machine Learning

Luo Mai Chuntao Hong Paolo Costa

Machine Learning

• Successful in many fields• Online advertisement

• Spam filtering

• Fraud detection

• Image recognition

• …

• One of the most important workloads in data centers

Industry Scale Machine Learning

• More data, higher accuracy

• Scales of industry problems• 100 Billions samples, 1TBs – 1PBs data

• 10 Billions parameters, 1GBs – 1TBs data

• Distributed execution • 100s – 1000s machines

Distributed Machine Learning

Data partitions

Model replicas

Workers

W1 W2 W3 W4 Data partitions

Data partitions

Model replicas

Workers5

W1 + 0.1 W2 + 0.2 W3 – 0.3 W4 +1.2 W1– 0.9 W2 + 0.5 W3 – 0.1 W4 – 0.5

gradient

Data partitions

Model replicas

Workers

1. Push gradients

Parameter server

2. Aggregate gradient for each parameter

Data partitions

Model replicas

Workers

Parameter server

3. Add gradients to parameters

4. Pull new parameters

W1 + g1 W2 + g2 W3 + g3 W4 + g4

Data partitions

Model replicas

Workers

Parameter servers

W3 W4W1 W2

Use multiple PS to avoid

bottleneck

Data partitions

Model replicas

Workers

Parameter servers

Bottleneck

Inbound Congestion

Network Core

Inbound congestion

Outbound Congestion

Outbound congestion

Network Core

Network Core Congestion

Over-subscribed Network Core

Congestion in the core in case of over-subscribed

networks

Existing Approaches

• Over-provisioning networkExpensive

Limited deployment scale

Not available in public cloudsTraining algorithm

Fast network H/We.g., Infiniband and

Existing Approaches

• Over-provisioning networkExpensive

Limited deployment scale

Not available in public Clouds

• Asynchronous training algorithmTraining efficiency

Might not converge

Asynchronoustraining algorithm

Network H/W

Rethinking the Network Design

Training algorithm

Network H/W

MLNet is a communication layerdesigned for distributed machinelearning systems

Improves communication efficiency

Orthogonal to existing approaches

Rethinking the Network Design

Training algorithm

Network H/W

MLNet is a communication layerdesigned for distributed machinelearning systemsImproves communication efficiency

Orthogonal to existing approaches

Optimizations:Traffic reduction

Flow prioritization

Traffic Reduction

Aggregate the gradients from 6 workers

𝑔1 = 𝑔11+ 𝑔12+ 𝑔13 + 𝑔14+ 𝑔15+ 𝑔16

Traffic Reduction: Key Insight

Workers

Parameterserver

Aggregation is commutative and associative

𝒈𝟏𝟏 + 𝒈𝟏𝟐 +𝒈𝟏𝟑 𝒈𝟏𝟒 + 𝒈𝟏𝟓 +𝒈𝟏𝟔

Aggregate the gradients from 6 workers

Traffic Reduction: Key Insight

Aggregate gradientsincrementally does notchange the final result

Traffic Reduction: Design

Intercept the push message from the worker to the PS

Redirect the messages to a local worker for partial aggregation

Send the partial results to the PS for final aggregation

More details on the paper:1. Traffic reduction in pull request2. Asynchronous communication

Traffic Prioritization

Traffic Prioritization: Key Insight

Job 1 Job 2 Job 3 Job 4

These four TCP flows share a bottleneck link and each of

them gets 25% of its bandwidth

0 1 2 3 4

Flow Completion Time (FCT)

Average completion time is 4Model 1 Model 2 Model 3 Model 4

All flows are delayed! TCP per-flow fairness is harmful in distributed

machine learning.

MLNet prioritizes the competing flows to minimize the average training time

Job 1 Job 2 Job 3 Job 4

0 1 2 3 4

Flow Completion Time (FCT)

Average completion time is 2Model 1 Model 2 Model 3 Model 4

Shorten average FCT can largely improve average training time

Evaluation

• Simulate common network topology in data centers• Classic 10Gbps 1024-node data center topology [Fat-Tree, SIGCOMM’08]

• Training large scale logistic regression• 65B parameters, 141TB dataset [Parameter Server, OSDI’14]

• 800 workers [Parameter Server, OSDI’14]

• With production trace• Data processing rate: uniform(100, 200) MBps

• Synchronize every 30 seconds

Traffic Reduction (Non-oversubscribed Net.)

101214

50 100 200 300 400

Number of parameter servers

Rack Baseline

Better

Cost-effective Expensive

101214

50 100 200 300 400

Rack Baseline

Better

Rack reduces 48% completion time

101214

50 100 200 300 400

Rack Baseline

Better

Deploying more parameter serversresolve edge network bottlenecks

101214

50 100 200 300 400

Rack Baseline

Better

Deploying more parameter servers to reduce training time(1) uses more machines(2) only possible with non-oversubscribed networks

101214

50 100 200 300 400

Rack Baseline

Traffic Reduction (1:4 Oversubscribed Net.)

Better

MLNet reduces congestionin the network core.

Reduces training time by >70%

• 20 jobs running in the same cluster

0 2 4 6 8 10 12 14

Training time (Hours)

Baseline Prioritization

Everyone finish (almost) at the same time

0 2 4 6 8 10 12 14

Baseline Prioritization

Improve themedian by 25%

Delay the tail by 2%

Better Worse

Traffic Prioritization + Traffic Reduction

0 2 4 6 8 10 12 14

Baseline Priori. + Red. Reduction

Improve themedian by 60%

Improve thetail by 54%

Better Worse

More details on the paper:1. Binary tree aggregation2. More analysis

Summary

• MLNet can significantly improve the network performance ofdistributed machine learning• Traffic reduction

• Flow prioritization

• Drop-in solution

Thanks!

Discussion

• Relaxed fault-tolerance?• When worker fails, drop that portion of data

• Adaptive communication• Reduce synchronization when network is busy?

• Hybrid network infrastructure?• Some with 10GE, some with 40GE ROCE, etc.

• Degree of tree?

Is the local aggregator a new bottleneck?

Example: 15 workers in a rack

Build a balanced aggregation structure such as a binary tree.

Example: 15 workers in a rack Binary tree aggregation

Traffic Reduction

101214

50 100 200 300 400

Rack Binary Baseline

Better

101214

50 100 200 300 400

Better

101214

50 100 200 300 400

Better

Binary Tree and Rack reduces 78%and 48% completion time

101214

50 100 200 300 400

Better

Deploying more parameter serversresolve edge network bottlenecks

101214

50 100 200 300 400

Better

Deploying more parameter servers to reduce training time(1) needs more machines(2) only possible with non-oversubscribed networks

Better

101214

50 100 200 300 400

101214

50 100 200 300 400

Better

MLNet reduces congestionin the network core

101214

50 100 200 300 400

Better

Binary is consistently consumingmore bandwidth than Rack

Example: Training a Neural Network

Random init weight

Truth: {cat, dog, cat,…}

Calculate error/gradient

W: {w1, w2, w3, w4}

G: {g1, g2, g3, g4}

W’: {w1’, w2’, w3’, w4’}

Update weights

Example: Neural Network

Dog : 99%

Cat : 1%

Train Apply

Model Training

Refine model

Random Init Final Model

Converge

Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core...

Documents

SUBSCRIBED JOURNALS 2012 IMPACT FACTOR ... - Perpustakaan UiTMlibrary.uitm.edu.my/v1/images/stories/facilities/serial_services... · SUBSCRIBED JOURNALS 2012. NO. TITLES ... 20 ADVANCES

Subscribed Sydney 2017: Product Keynote

Sydney Subscribed 2016: Monetising Subscription Services

Subscribed zuora forsalesforce training -sections 101 & 102

• Congestion • Congestion • Congestion

Spring 2003CS 4611 Congestion Control Outline Queuing Discipline Reacting to Congestion Avoiding Congestion

Subscribed 2015: Psychology of the Consumer

1 Congestion Control Outline Queuing Discipline Reacting to Congestion Avoiding Congestion

Subscribed Melbourne 2017: Product Keynote

Subscribed 2013 Europe: CEO's Keynote

Subscribed 2016: Monetizing Subscription Services

Subscribed SF: CEO's Keynote

Subscribed World Tour Keynote: London, 2015

A Network Congestion control Protocol (NCP) - CORE · PDF fileA Network Congestion control Protocol (NCP) Debessay Fesehaye, Klara Nahrstedt and Matthew Caesar Department of Computer

Subscribed 2016: SaaS Application Architecture Defined

Spring 2003CS 3321 Congestion Avoidance. Spring 2003CS 3322 Congestion Avoidance TCP congestion control strategy: –Increase load until congestion occurs,

Subscribed 2013 Sydney Keynote

Congestion Control Outline Queuing Discipline Reacting to Congestion Avoiding Congestion

Congestion Control Algorithms General Principles of Congestion Control Congestion Prevention Policies Congestion Control in Virtual-Circuit Subnets

Subscribed 2015: CEO's Keynote