Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core...

Preview:

Citation preview

Optimizing Network Performance in Distributed Machine Learning

Luo Mai Chuntao Hong Paolo Costa

Machine Learning

• Successful in many fields• Online advertisement

• Spam filtering

• Fraud detection

• Image recognition

• …

• One of the most important workloads in data centers

2

Industry Scale Machine Learning

• More data, higher accuracy

• Scales of industry problems• 100 Billions samples, 1TBs – 1PBs data

• 10 Billions parameters, 1GBs – 1TBs data

• Distributed execution • 100s – 1000s machines

3

Distributed Machine Learning

Data partitions

Model replicas

Workers

W1 W2 W3 W4 Data partitions

Distributed Machine Learning

Data partitions

Model replicas

Workers5

W1 + 0.1 W2 + 0.2 W3 – 0.3 W4 +1.2 W1– 0.9 W2 + 0.5 W3 – 0.1 W4 – 0.5

gradient

Distributed Machine Learning

Data partitions

Model replicas

Workers

1. Push gradients

Parameter server

6

2. Aggregate gradient for each parameter

Distributed Machine Learning

Data partitions

Model replicas

Workers

Parameter server

3. Add gradients to parameters

4. Pull new parameters

7

W1 + g1 W2 + g2 W3 + g3 W4 + g4

Distributed Machine Learning

Data partitions

Model replicas

Workers

Parameter servers

8

W3 W4W1 W2

Use multiple PS to avoid

bottleneck

Distributed Machine Learning

Data partitions

Model replicas

Workers

Parameter servers

Bottleneck

9

Inbound Congestion

10

Network Core

Inbound congestion

Outbound Congestion

11

Outbound congestion

Network Core

Network Core Congestion

12

Over-subscribed Network Core

Congestion in the core in case of over-subscribed

networks

Existing Approaches

13

• Over-provisioning networkExpensive

Limited deployment scale

Not available in public cloudsTraining algorithm

Fast network H/We.g., Infiniband and

RoCE

Existing Approaches

14

• Over-provisioning networkExpensive

Limited deployment scale

Not available in public Clouds

• Asynchronous training algorithmTraining efficiency

Might not converge

Asynchronoustraining algorithm

Network H/W

Rethinking the Network Design

Training algorithm

Network H/W

MLNet

15

MLNet is a communication layerdesigned for distributed machinelearning systems

Improves communication efficiency

Orthogonal to existing approaches

Rethinking the Network Design

Training algorithm

Network H/W

MLNet

16

MLNet is a communication layerdesigned for distributed machinelearning systemsImproves communication efficiency

Orthogonal to existing approaches

Optimizations:Traffic reduction

Flow prioritization

Traffic Reduction

17

18

Aggregate the gradients from 6 workers

𝑔1 = 𝑔11+ 𝑔12+ 𝑔13 + 𝑔14+ 𝑔15+ 𝑔16

Traffic Reduction: Key Insight

Workers

Parameterserver

Aggregation is commutative and associative

19

𝒈𝟏𝟏 + 𝒈𝟏𝟐 +𝒈𝟏𝟑 𝒈𝟏𝟒 + 𝒈𝟏𝟓 +𝒈𝟏𝟔

Aggregate the gradients from 6 workers

Traffic Reduction: Key Insight

Aggregate gradientsincrementally does notchange the final result

Traffic Reduction: Design

20

Intercept the push message from the worker to the PS

Traffic Reduction: Design

21

Redirect the messages to a local worker for partial aggregation

Traffic Reduction: Design

22

Send the partial results to the PS for final aggregation

23

More details on the paper:1. Traffic reduction in pull request2. Asynchronous communication

Traffic Prioritization

24

Traffic Prioritization: Key Insight

25

Job 1 Job 2 Job 3 Job 4

These four TCP flows share a bottleneck link and each of

them gets 25% of its bandwidth

Traffic Prioritization: Key Insight

26

0 1 2 3 4

Flow Completion Time (FCT)

Average completion time is 4Model 1 Model 2 Model 3 Model 4

Job 1

Job 2

Job 3

Job 4

All flows are delayed! TCP per-flow fairness is harmful in distributed

machine learning.

Traffic Prioritization: Key Insight

27

MLNet prioritizes the competing flows to minimize the average training time

Job 1 Job 2 Job 3 Job 4

Traffic Prioritization: Key Insight

28

Job 1

0 1 2 3 4

Job 2

Job 3

Job 4

Flow Completion Time (FCT)

Average completion time is 2Model 1 Model 2 Model 3 Model 4

Shorten average FCT can largely improve average training time

Evaluation

• Simulate common network topology in data centers• Classic 10Gbps 1024-node data center topology [Fat-Tree, SIGCOMM’08]

• Training large scale logistic regression• 65B parameters, 141TB dataset [Parameter Server, OSDI’14]

• 800 workers [Parameter Server, OSDI’14]

• With production trace• Data processing rate: uniform(100, 200) MBps

• Synchronize every 30 seconds

29

30

Traffic Reduction (Non-oversubscribed Net.)

02468

101214

50 100 200 300 400

Trai

nin

g ti

me

(Ho

urs

)

Number of parameter servers

Rack Baseline

Better

Worse

Cost-effective Expensive

02468

101214

50 100 200 300 400

Trai

nin

g ti

me

(Ho

urs

)

Number of parameter servers

Rack Baseline

31

Traffic Reduction (Non-oversubscribed Net.)

Better

Worse

Cost-effective Expensive

Rack reduces 48% completion time

02468

101214

50 100 200 300 400

Trai

nin

g ti

me

(Ho

urs

)

Number of parameter servers

Rack Baseline

Traffic Reduction (Non-oversubscribed Net.)

32

Better

Worse

Cost-effective Expensive

Deploying more parameter serversresolve edge network bottlenecks

33

Traffic Reduction (Non-oversubscribed Net.)

02468

101214

50 100 200 300 400

Trai

nin

g ti

me

(Ho

urs

)

Number of parameter servers

Rack Baseline

Better

Worse

Cost-effective Expensive

Deploying more parameter servers to reduce training time(1) uses more machines(2) only possible with non-oversubscribed networks

02468

101214

50 100 200 300 400

Trai

nin

g ti

me

(Ho

urs

)

Number of parameter servers

Rack Baseline

34

Traffic Reduction (1:4 Oversubscribed Net.)

Better

Worse

Cost-effective Expensive

MLNet reduces congestionin the network core.

Reduces training time by >70%

Traffic Prioritization

• 20 jobs running in the same cluster

35

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14

CD

F

Training time (Hours)

Baseline Prioritization

Everyone finish (almost) at the same time

Traffic Prioritization

36

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14

CD

F

Training time (Hours)

Baseline Prioritization

Improve themedian by 25%

Delay the tail by 2%

Better Worse

Traffic Prioritization + Traffic Reduction

37

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14

CD

F

Training time (Hours)

Baseline Priori. + Red. Reduction

Improve themedian by 60%

Improve thetail by 54%

Better Worse

38

More details on the paper:1. Binary tree aggregation2. More analysis

Summary

• MLNet can significantly improve the network performance ofdistributed machine learning• Traffic reduction

• Flow prioritization

• Drop-in solution

39

Thanks!

40

Discussion

• Relaxed fault-tolerance?• When worker fails, drop that portion of data

• Adaptive communication• Reduce synchronization when network is busy?

• Hybrid network infrastructure?• Some with 10GE, some with 40GE ROCE, etc.

• Degree of tree?

41

Traffic Reduction: Design

Is the local aggregator a new bottleneck?

42

Example: 15 workers in a rack

Traffic Reduction: Design

Build a balanced aggregation structure such as a binary tree.

43

Example: 15 workers in a rack Binary tree aggregation

44

Traffic Reduction

02468

101214

50 100 200 300 400

Trai

nin

gti

me

(Ho

urs

)

Number of parameter servers

Rack Binary Baseline

Better

Worse

Cost-effective Expensive

45

Traffic Reduction (Non-oversubscribed Net.)

02468

101214

50 100 200 300 400

Trai

nin

g ti

me

(Ho

urs

)

Number of parameter servers

Rack Binary Baseline

Better

Worse

Cost-effective Expensive

46

Traffic Reduction (Non-oversubscribed Net.)

02468

101214

50 100 200 300 400

Trai

nin

g ti

me

(Ho

urs

)

Number of parameter servers

Rack Binary Baseline

Better

Worse

Cost-effective Expensive

Binary Tree and Rack reduces 78%and 48% completion time

Traffic Reduction (Non-oversubscribed Net.)

47

02468

101214

50 100 200 300 400

Trai

nin

g ti

me

(Ho

urs

)

Number of parameter servers

Rack Binary Baseline

Better

Worse

Cost-effective Expensive

Deploying more parameter serversresolve edge network bottlenecks

48

Traffic Reduction (Non-oversubscribed Net.)

02468

101214

50 100 200 300 400

Trai

nin

g ti

me

(Ho

urs

)

Number of parameter servers

Rack Binary Baseline

Better

Worse

Cost-effective Expensive

Deploying more parameter servers to reduce training time(1) needs more machines(2) only possible with non-oversubscribed networks

49

Traffic Reduction (1:4 Oversubscribed Net.)

Better

Worse

Cost-effective Expensive

02468

101214

50 100 200 300 400

Trai

nin

g ti

me

(Ho

urs

)

Number of parameter servers

Rack Binary Baseline

02468

101214

50 100 200 300 400

Trai

nin

g ti

me

(Ho

urs

)

Number of parameter servers

Rack Binary Baseline

50

Traffic Reduction (1:4 Oversubscribed Net.)

Better

Worse

Cost-effective Expensive

MLNet reduces congestionin the network core

02468

101214

50 100 200 300 400

Trai

nin

g ti

me

(Ho

urs

)

Number of parameter servers

Rack Binary Baseline

51

Traffic Reduction (1:4 Oversubscribed Net.)

Better

Worse

Cost-effective Expensive

Binary is consistently consumingmore bandwidth than Rack

Example: Training a Neural Network

52

Random init weight

Truth: {cat, dog, cat,…}

Calculate error/gradient

W: {w1, w2, w3, w4}

G: {g1, g2, g3, g4}

W’: {w1’, w2’, w3’, w4’}

Update weights

Example: Neural Network

53

Dog : 99%

Cat : 1%

Model

W1

W4

W2

W3

Train Apply

Model Training

54

Model

W1

W4

W2

W3

Refine model

W1

W4

W2

W3

Random Init Final Model

W1

W4

W2

W3

Converge

Recommended