View
0
Download
0
Category
Preview:
Citation preview
Optimizing Network Performance in Distributed Machine Learning
Luo Mai Chuntao Hong Paolo Costa
Machine Learning
• Successful in many fields• Online advertisement
• Spam filtering
• Fraud detection
• Image recognition
• …
• One of the most important workloads in data centers
2
Industry Scale Machine Learning
• More data, higher accuracy
• Scales of industry problems• 100 Billions samples, 1TBs – 1PBs data
• 10 Billions parameters, 1GBs – 1TBs data
• Distributed execution • 100s – 1000s machines
3
Distributed Machine Learning
Data partitions
Model replicas
Workers
W1 W2 W3 W4 Data partitions
Distributed Machine Learning
Data partitions
Model replicas
Workers5
W1 + 0.1 W2 + 0.2 W3 – 0.3 W4 +1.2 W1– 0.9 W2 + 0.5 W3 – 0.1 W4 – 0.5
gradient
Distributed Machine Learning
Data partitions
Model replicas
Workers
1. Push gradients
Parameter server
6
2. Aggregate gradient for each parameter
Distributed Machine Learning
Data partitions
Model replicas
Workers
Parameter server
3. Add gradients to parameters
4. Pull new parameters
7
W1 + g1 W2 + g2 W3 + g3 W4 + g4
Distributed Machine Learning
Data partitions
Model replicas
Workers
Parameter servers
8
W3 W4W1 W2
Use multiple PS to avoid
bottleneck
Distributed Machine Learning
Data partitions
Model replicas
Workers
Parameter servers
Bottleneck
9
Inbound Congestion
10
Network Core
Inbound congestion
Outbound Congestion
11
Outbound congestion
Network Core
Network Core Congestion
12
Over-subscribed Network Core
Congestion in the core in case of over-subscribed
networks
Existing Approaches
13
• Over-provisioning networkExpensive
Limited deployment scale
Not available in public cloudsTraining algorithm
Fast network H/We.g., Infiniband and
RoCE
Existing Approaches
14
• Over-provisioning networkExpensive
Limited deployment scale
Not available in public Clouds
• Asynchronous training algorithmTraining efficiency
Might not converge
Asynchronoustraining algorithm
Network H/W
Rethinking the Network Design
Training algorithm
Network H/W
MLNet
15
MLNet is a communication layerdesigned for distributed machinelearning systems
Improves communication efficiency
Orthogonal to existing approaches
Rethinking the Network Design
Training algorithm
Network H/W
MLNet
16
MLNet is a communication layerdesigned for distributed machinelearning systemsImproves communication efficiency
Orthogonal to existing approaches
Optimizations:Traffic reduction
Flow prioritization
Traffic Reduction
17
18
Aggregate the gradients from 6 workers
𝑔1 = 𝑔11+ 𝑔12+ 𝑔13 + 𝑔14+ 𝑔15+ 𝑔16
Traffic Reduction: Key Insight
Workers
Parameterserver
Aggregation is commutative and associative
19
𝒈𝟏𝟏 + 𝒈𝟏𝟐 +𝒈𝟏𝟑 𝒈𝟏𝟒 + 𝒈𝟏𝟓 +𝒈𝟏𝟔
Aggregate the gradients from 6 workers
Traffic Reduction: Key Insight
Aggregate gradientsincrementally does notchange the final result
Traffic Reduction: Design
20
Intercept the push message from the worker to the PS
Traffic Reduction: Design
21
Redirect the messages to a local worker for partial aggregation
Traffic Reduction: Design
22
Send the partial results to the PS for final aggregation
23
More details on the paper:1. Traffic reduction in pull request2. Asynchronous communication
Traffic Prioritization
24
Traffic Prioritization: Key Insight
25
Job 1 Job 2 Job 3 Job 4
These four TCP flows share a bottleneck link and each of
them gets 25% of its bandwidth
Traffic Prioritization: Key Insight
26
0 1 2 3 4
Flow Completion Time (FCT)
Average completion time is 4Model 1 Model 2 Model 3 Model 4
Job 1
Job 2
Job 3
Job 4
All flows are delayed! TCP per-flow fairness is harmful in distributed
machine learning.
Traffic Prioritization: Key Insight
27
MLNet prioritizes the competing flows to minimize the average training time
Job 1 Job 2 Job 3 Job 4
Traffic Prioritization: Key Insight
28
Job 1
0 1 2 3 4
Job 2
Job 3
Job 4
Flow Completion Time (FCT)
Average completion time is 2Model 1 Model 2 Model 3 Model 4
Shorten average FCT can largely improve average training time
Evaluation
• Simulate common network topology in data centers• Classic 10Gbps 1024-node data center topology [Fat-Tree, SIGCOMM’08]
• Training large scale logistic regression• 65B parameters, 141TB dataset [Parameter Server, OSDI’14]
• 800 workers [Parameter Server, OSDI’14]
• With production trace• Data processing rate: uniform(100, 200) MBps
• Synchronize every 30 seconds
29
30
Traffic Reduction (Non-oversubscribed Net.)
02468
101214
50 100 200 300 400
Trai
nin
g ti
me
(Ho
urs
)
Number of parameter servers
Rack Baseline
Better
Worse
Cost-effective Expensive
02468
101214
50 100 200 300 400
Trai
nin
g ti
me
(Ho
urs
)
Number of parameter servers
Rack Baseline
31
Traffic Reduction (Non-oversubscribed Net.)
Better
Worse
Cost-effective Expensive
Rack reduces 48% completion time
02468
101214
50 100 200 300 400
Trai
nin
g ti
me
(Ho
urs
)
Number of parameter servers
Rack Baseline
Traffic Reduction (Non-oversubscribed Net.)
32
Better
Worse
Cost-effective Expensive
Deploying more parameter serversresolve edge network bottlenecks
33
Traffic Reduction (Non-oversubscribed Net.)
02468
101214
50 100 200 300 400
Trai
nin
g ti
me
(Ho
urs
)
Number of parameter servers
Rack Baseline
Better
Worse
Cost-effective Expensive
Deploying more parameter servers to reduce training time(1) uses more machines(2) only possible with non-oversubscribed networks
02468
101214
50 100 200 300 400
Trai
nin
g ti
me
(Ho
urs
)
Number of parameter servers
Rack Baseline
34
Traffic Reduction (1:4 Oversubscribed Net.)
Better
Worse
Cost-effective Expensive
MLNet reduces congestionin the network core.
Reduces training time by >70%
Traffic Prioritization
• 20 jobs running in the same cluster
35
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12 14
CD
F
Training time (Hours)
Baseline Prioritization
Everyone finish (almost) at the same time
Traffic Prioritization
36
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12 14
CD
F
Training time (Hours)
Baseline Prioritization
Improve themedian by 25%
Delay the tail by 2%
Better Worse
Traffic Prioritization + Traffic Reduction
37
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12 14
CD
F
Training time (Hours)
Baseline Priori. + Red. Reduction
Improve themedian by 60%
Improve thetail by 54%
Better Worse
38
More details on the paper:1. Binary tree aggregation2. More analysis
Summary
• MLNet can significantly improve the network performance ofdistributed machine learning• Traffic reduction
• Flow prioritization
• Drop-in solution
39
Thanks!
40
Discussion
• Relaxed fault-tolerance?• When worker fails, drop that portion of data
• Adaptive communication• Reduce synchronization when network is busy?
• Hybrid network infrastructure?• Some with 10GE, some with 40GE ROCE, etc.
• Degree of tree?
41
Traffic Reduction: Design
Is the local aggregator a new bottleneck?
42
Example: 15 workers in a rack
Traffic Reduction: Design
Build a balanced aggregation structure such as a binary tree.
43
Example: 15 workers in a rack Binary tree aggregation
44
Traffic Reduction
02468
101214
50 100 200 300 400
Trai
nin
gti
me
(Ho
urs
)
Number of parameter servers
Rack Binary Baseline
Better
Worse
Cost-effective Expensive
45
Traffic Reduction (Non-oversubscribed Net.)
02468
101214
50 100 200 300 400
Trai
nin
g ti
me
(Ho
urs
)
Number of parameter servers
Rack Binary Baseline
Better
Worse
Cost-effective Expensive
46
Traffic Reduction (Non-oversubscribed Net.)
02468
101214
50 100 200 300 400
Trai
nin
g ti
me
(Ho
urs
)
Number of parameter servers
Rack Binary Baseline
Better
Worse
Cost-effective Expensive
Binary Tree and Rack reduces 78%and 48% completion time
Traffic Reduction (Non-oversubscribed Net.)
47
02468
101214
50 100 200 300 400
Trai
nin
g ti
me
(Ho
urs
)
Number of parameter servers
Rack Binary Baseline
Better
Worse
Cost-effective Expensive
Deploying more parameter serversresolve edge network bottlenecks
48
Traffic Reduction (Non-oversubscribed Net.)
02468
101214
50 100 200 300 400
Trai
nin
g ti
me
(Ho
urs
)
Number of parameter servers
Rack Binary Baseline
Better
Worse
Cost-effective Expensive
Deploying more parameter servers to reduce training time(1) needs more machines(2) only possible with non-oversubscribed networks
49
Traffic Reduction (1:4 Oversubscribed Net.)
Better
Worse
Cost-effective Expensive
02468
101214
50 100 200 300 400
Trai
nin
g ti
me
(Ho
urs
)
Number of parameter servers
Rack Binary Baseline
02468
101214
50 100 200 300 400
Trai
nin
g ti
me
(Ho
urs
)
Number of parameter servers
Rack Binary Baseline
50
Traffic Reduction (1:4 Oversubscribed Net.)
Better
Worse
Cost-effective Expensive
MLNet reduces congestionin the network core
02468
101214
50 100 200 300 400
Trai
nin
g ti
me
(Ho
urs
)
Number of parameter servers
Rack Binary Baseline
51
Traffic Reduction (1:4 Oversubscribed Net.)
Better
Worse
Cost-effective Expensive
Binary is consistently consumingmore bandwidth than Rack
Example: Training a Neural Network
52
Random init weight
Truth: {cat, dog, cat,…}
Calculate error/gradient
W: {w1, w2, w3, w4}
G: {g1, g2, g3, g4}
W’: {w1’, w2’, w3’, w4’}
Update weights
Example: Neural Network
53
Dog : 99%
Cat : 1%
Model
W1
W4
W2
W3
Train Apply
Model Training
54
Model
W1
W4
W2
W3
Refine model
W1
W4
W2
W3
Random Init Final Model
W1
W4
W2
W3
Converge
Recommended