ATP: In-network Aggregation for Multi-tenant Learning

ATP: In-network Aggregation for Multi-tenant Learning

Chonlam Lao*, Yanfang Le*, Kshiteej Mahajan, Yixi Chen, Wenfei Wu, Aditya Akella, Michael Swift

Tsinghua University University of Wisconsin-Madison

* = co-primary authors1

Distributed Training (PS Architecture)

PS 1

Parameter Servers (PS)

Workersa1 b1 a2 b2 a3 b3 a4 b4 Gradients

Worker1 Worker2 Worker3 Worker4

2


PS 1




2


PS 1




2

a’ = a1+a2+a3+a4


PS 1




2

a’ a’ a’ a’

a’ = a1+a2+a3+a4


PS 1




Network can be bottleneck for Distributed Training

2

a’ a’ a’ a’

a’ = a1+a2+a3+a4

• Programmable switch offers in-transit packet processing and in-network state

Traffic Manager

Packet Ingress

Packet Egress

Trend of In-network Computation

3


Traffic Manager

Packet Ingress

Packet EgressRegister Register Register Register


3


• Reduce training time by moving gradient aggregation into the network

Traffic Manager

Packet Ingress

Packet EgressRegister Register Register Register


3

State-of-the-art In-network Aggregation

• SwitchML (Sapio et al. NSDI’21) • Target single-rack settings

4


• SwitchML (Sapio et al. NSDI’21) • Target single-rack settings• Support multiple jobs by static

partitioning of switch resourcesSwitch

R6R5 R7Register(s)

Worker 1

Job 2

Worker 2

R2R1 R3Register(s)

Worker 1

Job 1

Worker 2

4



partitioning of switch resources

• Short comings

SwitchR6R5 R7

Register(s)

Worker 1

Job 2

Worker 2

R2R1 R3Register(s)

Worker 1

Job 1

Worker 2

4




• Short comings• Inefficiently use the switch resources

SwitchR6R5 R7

Register(s)

Worker 1

Job 2

Worker 2

R2R1 R3Register(s)

Worker 1

Job 1

Worker 2

4

0 250 500 750 1000 1250Time(ms)

05

10152025303540

Thro

ughp

ut (G

bps)




• Short comings• Inefficiently use the switch resources

SwitchR6R5 R7

Register(s)

Worker 1

Job 2

Worker 2

R2R1 R3Register(s)

Worker 1

Job 1

Worker 2

4

0 250 500 750 1000 1250Time(ms)

05

10152025303540

Thro

ughp

ut (G

bps)




• Short comings• Inefficiently use the switch resources• Does not consider multi-rack setting

SwitchR6R5 R7

Register(s)

Worker 1

Job 2

Worker 2

R2R1 R3Register(s)

Worker 1

Job 1

Worker 2

4

0 250 500 750 1000 1250Time(ms)

05

10152025303540

Thro

ughp

ut (G

bps)




• Short comings• Inefficiently use the switch resources• Does not consider multi-rack setting

SwitchR6R5 R7

Register(s)

Worker 1

Job 2

Worker 2

R2R1 R3Register(s)

Worker 1

Job 1

Worker 2

4

Key Goal

Speed up multiple DT jobs in a cluster while maximizing the benefits from in-network

multi-switch aggregation

5

Outline

• Multi-tenant• Multi-rack• Additional challenges• Reliability• Congestion control • Improve floating point computation

• Evaluation

6

Multi-tenant: dynamic allocation

Switch

7


• Objective: maximize switch resource utilization• Key idea: dynamic allocation in per-packet level

Switch

7



Switch

Aggregator

7



• Randomly hash gradient packets to whole memory

Switch

Aggregator

7




Switch

Job 2Worker 1

Worker 2

......

Worker n

Job 2 PS

Aggregator

7




Switch

Job 2Worker 1

Worker 2

......

Worker n

Job 2 PS

Aggregator

7




Switch

Job 2Worker 1

Worker 2

......

Worker n

Job 2 PS

Aggregator

7




Switch

Job 2Worker 1

Worker 2

......

Worker n

Job 2 PS

Aggregator

7




Switch

Job 2Worker 1

Worker 2

......

Worker n

Job 2 PS

Aggregator

7


Switch

Job 2Worker 1

Worker 2

......

Worker n

Job 2 PS



8


Switch

Job 2Worker 1

Worker 2

......

Worker n

Job 2 PS



8


Switch

Job 2Worker 1

Worker 2

......

Worker n

Job 2 PS



8


Switch

Job 2Worker 1

Worker 2

......

Worker n

Job 2 PS



8

Challenge 1: Heavy Contention

Job 2 PS

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3

Worker 1 Worker n PS......Job 1Worker 1 Worker n PS......

9


Job 2 PS

Best-effort

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3


9


Job 2 PS

Best-effort

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3


9


Job 2 PS

Best-effort

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3


9


Job 2 PS

Best-effort

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3


10


Job 2 PS

Best-effort

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3


10


Job 2 PS

Best-effort

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3


10

Challenge 2: Incomplete Aggregation

Job 2 PS

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3


11


Job 2 PS

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3


a1

11


Job 2 PS

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3


a1

a111


Job 2 PS

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3


a1

a111


Job 2 PS

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3


a1

a2a...an

11


Job 2 PS

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3


a1

a2a...an a2+a3+…+an

11


Job 2 PS

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3


12

a2+a3+…+an

a1


Job 2 PS

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3


bcd

12

a2+a3+…+an

a1


Job 2 PS

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3


bcd

13

a2+a3+…+an

a1


Job 2 PS

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3


bcd

13

a2+a3+…+an+ a1

a1

a2a...an

a1

Inter-Rack Aggregation

• Aggregation at every layer of network topology• Nondeterministic routing, i.e., ECMP

14



• Support two-level aggregation at ToR switches• Workers and PS(es) locate in different racks

Datacenter Network

SW1 SW2 SW3

W1 W3 W4 W5W2 W6 PSa1 a2 a3 a5a4 a6

14




Datacenter Network

SW1 SW2 SW3

W1 W3 W4 W5W2 W6 PS

a1a2 a3 a5a4 a6

14




Datacenter Network

SW1 SW2 SW3

W1 W3 W4 W5W2 W6 PS

a1a2 a3 a5a4 a6

a1’a2’

14




Datacenter Network

SW1 SW2 SW3

W1 W3 W4 W5W2 W6 PS

a1a2 a3 a5a4 a6

a1’a2’

a1’+a2’+a5+a6 14



• Support two-level aggregation at ToR switches• Workers and PS(es) locate in different racks• Scale up to 1024 workers

Datacenter Network

SW1 SW2 SW3

W1 W3 W4 W5W2 W6 PS

a1a2 a3 a5a4 a6

a1’a2’

a1’+a2’+a5+a6 14

Additional Challenges

15


• Rethink reliability• Recovery from packet loss • Ensure exact once aggregation• Memory leak: aggregators are reserved forever, but not used

15



• Rethink congestion control• N flows merged into one flow communication• Drop congestion signal, i.e., ECN

15



• Rethink congestion control• N flows merged into one flow communication• Drop congestion signal, i.e., ECN

• Improve the floating point computation• Convert gradients to 32-bit integer at workers by a scaling factor• Aggregation overflow at switch

15

ATP Implementation and Evaluation

• Implementation• Replace the networking stack of BytePS at the end host• Use P4 to implement the in-network aggregation service at Barefoot Tofino

switch

• Evaluation • Setup: 9 servers, each with one GPU, one 100G NIC • Baseline: ( BytePS + TCP, BytePS+ RDMA ) x (Nto1, NtoN ), SwitchML,

Horovod+RDMA, Horovod+TCP• Metrics: Training Throughput, Time-to-Accuracy• Workloads: AlexNet, VGG11, VGG16, VGG19, ResNet50, ResNet101, and

ResNet152

16

Single Job Performance

17

Single Job Performance

ATP is comparable to, and outperforms the state-of-the-art approaches.ATP gets larger performance gains on network-intensive workloads (VGG)

than the computation-intensive workloads (ResNet). 17

Multiple Jobs: dynamic (ATP) vs static

• 3 VGG16 Jobs• Static approach evenly

distributes aggregators to jobs• PTA: the number of the

aggregators to make each job to achieve the peak aggregation throughput

18





18

100% 85% 75% 60% 45% 33%Peak Throughput Aggregators (PTA)

0

50

100

150

200

Trai

ning

Thr

ough

put

(imag

e/se

c)

Dynamic Static

Percent of Peak Throughput Aggregators

Trai

ning

Thr

ough

put

(imag

e/se

c)





18


0

50

100

150

200

Trai

ning

Thr

ough

put

(imag

e/se

c)

Dynamic Static


Trai

ning

Thr

ough

put

(imag

e/se

c)





18


0

50

100

150

200

Trai

ning

Thr

ough

put

(imag

e/se

c)

Dynamic Static


Trai

ning

Thr

ough

put

(imag

e/se

c)





When switch memory is sufficient, ATP’s dynamic ≈ staticWhen switch memory is insufficient, ATP’s dynamic > static

18


0

50

100

150

200

Trai

ning

Thr

ough

put

(imag

e/se

c)

Dynamic Static


Trai

ning

Thr

ough

put

(imag

e/se

c)





When switch memory is sufficient, ATP’s dynamic ≈ staticWhen switch memory is insufficient, ATP’s dynamic > static

18


0

50

100

150

200

Trai

ning

Thr

ough

put

(imag

e/se

c)

Dynamic Static


Trai

ning

Thr

ough

put

(imag

e/se

c)

More evaluations about packet loss recovery overhead, time-to-accuracy, congestion control in various scenarios.

ATP Summary

• A network service that supports best-effort, dynamic in-network aggregation aimed at multi-rack, multi-tenant

• Co-design end-host and switch logic• Reliability• Congestion control• Dealing with floating point

Opensource: https://github.com/in-ATP/ATP19

https://github.com/in-ATP/ATP

ATP: In-network Aggregation for Multi-tenant Learning

Chonlam Lao*, Yanfang Le*, Kshiteej Mahajan, Yixi Chen, Wenfei Wu, Aditya Akella, Michael Swift

Tsinghua University University of Wisconsin-Madison

* = co-primary authors

Thank You!Opensource: https://github.com/in-ATP/ATP

20

https://github.com/in-ATP/ATP

Documents

ATP: In-network Aggregation for Multi-tenant Learning