69
ATP: In-network Aggregation for Multi-tenant Learning Chonlam Lao*, Yanfang Le*, Kshiteej Mahajan, Yixi Chen, Wenfei Wu, Aditya Akella, Michael Swift Tsinghua University University of Wisconsin-Madison * = co-primary authors 1

ATP: In-network Aggregation for Multi-tenant Learning

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ATP: In-network Aggregation for Multi-tenant Learning

ATP: In-network Aggregation for Multi-tenant Learning

Chonlam Lao*, Yanfang Le*, Kshiteej Mahajan, Yixi Chen, Wenfei Wu, Aditya Akella, Michael Swift

Tsinghua University University of Wisconsin-Madison

* = co-primary authors1

Page 2: ATP: In-network Aggregation for Multi-tenant Learning

Distributed Training (PS Architecture)

PS 1

Parameter Servers (PS)

Workersa1 b1 a2 b2 a3 b3 a4 b4 Gradients

Worker1 Worker2 Worker3 Worker4

2

Page 3: ATP: In-network Aggregation for Multi-tenant Learning

Distributed Training (PS Architecture)

PS 1

Parameter Servers (PS)

Workersa1 b1 a2 b2 a3 b3 a4 b4 Gradients

Worker1 Worker2 Worker3 Worker4

2

Page 4: ATP: In-network Aggregation for Multi-tenant Learning

Distributed Training (PS Architecture)

PS 1

Parameter Servers (PS)

Workersa1 b1 a2 b2 a3 b3 a4 b4 Gradients

Worker1 Worker2 Worker3 Worker4

2

a’ = a1+a2+a3+a4

Page 5: ATP: In-network Aggregation for Multi-tenant Learning

Distributed Training (PS Architecture)

PS 1

Parameter Servers (PS)

Workersa1 b1 a2 b2 a3 b3 a4 b4 Gradients

Worker1 Worker2 Worker3 Worker4

2

a’ a’ a’ a’

a’ = a1+a2+a3+a4

Page 6: ATP: In-network Aggregation for Multi-tenant Learning

Distributed Training (PS Architecture)

PS 1

Parameter Servers (PS)

Workersa1 b1 a2 b2 a3 b3 a4 b4 Gradients

Worker1 Worker2 Worker3 Worker4

Network can be bottleneck for Distributed Training

2

a’ a’ a’ a’

a’ = a1+a2+a3+a4

Page 7: ATP: In-network Aggregation for Multi-tenant Learning

• Programmable switch offers in-transit packet processing and in-network state

Traffic Manager

Packet Ingress

Packet Egress

Trend of In-network Computation

3

Page 8: ATP: In-network Aggregation for Multi-tenant Learning

• Programmable switch offers in-transit packet processing and in-network state

Traffic Manager

Packet Ingress

Packet EgressRegister Register Register Register

Trend of In-network Computation

3

Page 9: ATP: In-network Aggregation for Multi-tenant Learning

• Programmable switch offers in-transit packet processing and in-network state

• Reduce training time by moving gradient aggregation into the network

Traffic Manager

Packet Ingress

Packet EgressRegister Register Register Register

Trend of In-network Computation

3

Page 10: ATP: In-network Aggregation for Multi-tenant Learning

State-of-the-art In-network Aggregation

• SwitchML (Sapio et al. NSDI’21) • Target single-rack settings

4

Page 11: ATP: In-network Aggregation for Multi-tenant Learning

State-of-the-art In-network Aggregation

• SwitchML (Sapio et al. NSDI’21) • Target single-rack settings• Support multiple jobs by static

partitioning of switch resourcesSwitch

R6R5 R7Register(s)

Worker 1

Job 2

Worker 2

R2R1 R3Register(s)

Worker 1

Job 1

Worker 2

4

Page 12: ATP: In-network Aggregation for Multi-tenant Learning

State-of-the-art In-network Aggregation

• SwitchML (Sapio et al. NSDI’21) • Target single-rack settings• Support multiple jobs by static

partitioning of switch resources

• Short comings

SwitchR6R5 R7

Register(s)

Worker 1

Job 2

Worker 2

R2R1 R3Register(s)

Worker 1

Job 1

Worker 2

4

Page 13: ATP: In-network Aggregation for Multi-tenant Learning

State-of-the-art In-network Aggregation

• SwitchML (Sapio et al. NSDI’21) • Target single-rack settings• Support multiple jobs by static

partitioning of switch resources

• Short comings• Inefficiently use the switch resources

SwitchR6R5 R7

Register(s)

Worker 1

Job 2

Worker 2

R2R1 R3Register(s)

Worker 1

Job 1

Worker 2

4

0 250 500 750 1000 1250Time(ms)

05

10152025303540

Thro

ughp

ut (G

bps)

Page 14: ATP: In-network Aggregation for Multi-tenant Learning

State-of-the-art In-network Aggregation

• SwitchML (Sapio et al. NSDI’21) • Target single-rack settings• Support multiple jobs by static

partitioning of switch resources

• Short comings• Inefficiently use the switch resources

SwitchR6R5 R7

Register(s)

Worker 1

Job 2

Worker 2

R2R1 R3Register(s)

Worker 1

Job 1

Worker 2

4

0 250 500 750 1000 1250Time(ms)

05

10152025303540

Thro

ughp

ut (G

bps)

Page 15: ATP: In-network Aggregation for Multi-tenant Learning

State-of-the-art In-network Aggregation

• SwitchML (Sapio et al. NSDI’21) • Target single-rack settings• Support multiple jobs by static

partitioning of switch resources

• Short comings• Inefficiently use the switch resources• Does not consider multi-rack setting

SwitchR6R5 R7

Register(s)

Worker 1

Job 2

Worker 2

R2R1 R3Register(s)

Worker 1

Job 1

Worker 2

4

0 250 500 750 1000 1250Time(ms)

05

10152025303540

Thro

ughp

ut (G

bps)

Page 16: ATP: In-network Aggregation for Multi-tenant Learning

State-of-the-art In-network Aggregation

• SwitchML (Sapio et al. NSDI’21) • Target single-rack settings• Support multiple jobs by static

partitioning of switch resources

• Short comings• Inefficiently use the switch resources• Does not consider multi-rack setting

SwitchR6R5 R7

Register(s)

Worker 1

Job 2

Worker 2

R2R1 R3Register(s)

Worker 1

Job 1

Worker 2

4

Page 17: ATP: In-network Aggregation for Multi-tenant Learning

Key Goal

Speed up multiple DT jobs in a cluster while maximizing the benefits from in-network

multi-switch aggregation

5

Page 18: ATP: In-network Aggregation for Multi-tenant Learning

Outline

• Multi-tenant• Multi-rack• Additional challenges• Reliability• Congestion control • Improve floating point computation

• Evaluation

6

Page 19: ATP: In-network Aggregation for Multi-tenant Learning

Multi-tenant: dynamic allocation

Switch

7

Page 20: ATP: In-network Aggregation for Multi-tenant Learning

Multi-tenant: dynamic allocation

• Objective: maximize switch resource utilization• Key idea: dynamic allocation in per-packet level

Switch

7

Page 21: ATP: In-network Aggregation for Multi-tenant Learning

Multi-tenant: dynamic allocation

• Objective: maximize switch resource utilization• Key idea: dynamic allocation in per-packet level

Switch

Aggregator

7

Page 22: ATP: In-network Aggregation for Multi-tenant Learning

Multi-tenant: dynamic allocation

• Objective: maximize switch resource utilization• Key idea: dynamic allocation in per-packet level

• Randomly hash gradient packets to whole memory

Switch

Aggregator

7

Page 23: ATP: In-network Aggregation for Multi-tenant Learning

Multi-tenant: dynamic allocation

• Objective: maximize switch resource utilization• Key idea: dynamic allocation in per-packet level

• Randomly hash gradient packets to whole memory

Switch

Job 2Worker 1

Worker 2

......

Worker n

Job 2 PS

Aggregator

7

Page 24: ATP: In-network Aggregation for Multi-tenant Learning

Multi-tenant: dynamic allocation

• Objective: maximize switch resource utilization• Key idea: dynamic allocation in per-packet level

• Randomly hash gradient packets to whole memory

Switch

Job 2Worker 1

Worker 2

......

Worker n

Job 2 PS

Aggregator

7

Page 25: ATP: In-network Aggregation for Multi-tenant Learning

Multi-tenant: dynamic allocation

• Objective: maximize switch resource utilization• Key idea: dynamic allocation in per-packet level

• Randomly hash gradient packets to whole memory

Switch

Job 2Worker 1

Worker 2

......

Worker n

Job 2 PS

Aggregator

7

Page 26: ATP: In-network Aggregation for Multi-tenant Learning

Multi-tenant: dynamic allocation

• Objective: maximize switch resource utilization• Key idea: dynamic allocation in per-packet level

• Randomly hash gradient packets to whole memory

Switch

Job 2Worker 1

Worker 2

......

Worker n

Job 2 PS

Aggregator

7

Page 27: ATP: In-network Aggregation for Multi-tenant Learning

Multi-tenant: dynamic allocation

• Objective: maximize switch resource utilization• Key idea: dynamic allocation in per-packet level

• Randomly hash gradient packets to whole memory

Switch

Job 2Worker 1

Worker 2

......

Worker n

Job 2 PS

Aggregator

7

Page 28: ATP: In-network Aggregation for Multi-tenant Learning

Multi-tenant: dynamic allocation

Switch

Job 2Worker 1

Worker 2

......

Worker n

Job 2 PS

• Objective: maximize switch resource utilization• Key idea: dynamic allocation in per-packet level

• Randomly hash gradient packets to whole memory

8

Page 29: ATP: In-network Aggregation for Multi-tenant Learning

Multi-tenant: dynamic allocation

Switch

Job 2Worker 1

Worker 2

......

Worker n

Job 2 PS

• Objective: maximize switch resource utilization• Key idea: dynamic allocation in per-packet level

• Randomly hash gradient packets to whole memory

8

Page 30: ATP: In-network Aggregation for Multi-tenant Learning

Multi-tenant: dynamic allocation

Switch

Job 2Worker 1

Worker 2

......

Worker n

Job 2 PS

• Objective: maximize switch resource utilization• Key idea: dynamic allocation in per-packet level

• Randomly hash gradient packets to whole memory

8

Page 31: ATP: In-network Aggregation for Multi-tenant Learning

Multi-tenant: dynamic allocation

Switch

Job 2Worker 1

Worker 2

......

Worker n

Job 2 PS

• Objective: maximize switch resource utilization• Key idea: dynamic allocation in per-packet level

• Randomly hash gradient packets to whole memory

8

Page 32: ATP: In-network Aggregation for Multi-tenant Learning

Challenge 1: Heavy Contention

Job 2 PS

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3

Worker 1 Worker n PS......Job 1Worker 1 Worker n PS......

9

Page 33: ATP: In-network Aggregation for Multi-tenant Learning

Challenge 1: Heavy Contention

Job 2 PS

Best-effort

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3

Worker 1 Worker n PS......Job 1Worker 1 Worker n PS......

9

Page 34: ATP: In-network Aggregation for Multi-tenant Learning

Challenge 1: Heavy Contention

Job 2 PS

Best-effort

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3

Worker 1 Worker n PS......Job 1Worker 1 Worker n PS......

9

Page 35: ATP: In-network Aggregation for Multi-tenant Learning

Challenge 1: Heavy Contention

Job 2 PS

Best-effort

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3

Worker 1 Worker n PS......Job 1Worker 1 Worker n PS......

9

Page 36: ATP: In-network Aggregation for Multi-tenant Learning

Challenge 1: Heavy Contention

Job 2 PS

Best-effort

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3

Worker 1 Worker n PS......Job 1Worker 1 Worker n PS......

10

Page 37: ATP: In-network Aggregation for Multi-tenant Learning

Challenge 1: Heavy Contention

Job 2 PS

Best-effort

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3

Worker 1 Worker n PS......Job 1Worker 1 Worker n PS......

10

Page 38: ATP: In-network Aggregation for Multi-tenant Learning

Challenge 1: Heavy Contention

Job 2 PS

Best-effort

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3

Worker 1 Worker n PS......Job 1Worker 1 Worker n PS......

10

Page 39: ATP: In-network Aggregation for Multi-tenant Learning

Challenge 2: Incomplete Aggregation

Job 2 PS

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3

Worker 1 Worker n PS......Job 1Worker 1 Worker n PS......

11

Page 40: ATP: In-network Aggregation for Multi-tenant Learning

Challenge 2: Incomplete Aggregation

Job 2 PS

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3

Worker 1 Worker n PS......Job 1Worker 1 Worker n PS......

a1

11

Page 41: ATP: In-network Aggregation for Multi-tenant Learning

Challenge 2: Incomplete Aggregation

Job 2 PS

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3

Worker 1 Worker n PS......Job 1Worker 1 Worker n PS......

a1

a111

Page 42: ATP: In-network Aggregation for Multi-tenant Learning

Challenge 2: Incomplete Aggregation

Job 2 PS

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3

Worker 1 Worker n PS......Job 1Worker 1 Worker n PS......

a1

a111

Page 43: ATP: In-network Aggregation for Multi-tenant Learning

Challenge 2: Incomplete Aggregation

Job 2 PS

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3

Worker 1 Worker n PS......Job 1Worker 1 Worker n PS......

a1

a2a...an

11

Page 44: ATP: In-network Aggregation for Multi-tenant Learning

Challenge 2: Incomplete Aggregation

Job 2 PS

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3

Worker 1 Worker n PS......Job 1Worker 1 Worker n PS......

a1

a2a...an a2+a3+…+an

11

Page 45: ATP: In-network Aggregation for Multi-tenant Learning

Challenge 2: Incomplete Aggregation

Job 2 PS

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3

Worker 1 Worker n PS......Job 1Worker 1 Worker n PS......

12

a2+a3+…+an

a1

Page 46: ATP: In-network Aggregation for Multi-tenant Learning

Challenge 2: Incomplete Aggregation

Job 2 PS

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3

Worker 1 Worker n PS......Job 1Worker 1 Worker n PS......

bcd

12

a2+a3+…+an

a1

Page 47: ATP: In-network Aggregation for Multi-tenant Learning

Challenge 2: Incomplete Aggregation

Job 2 PS

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3

Worker 1 Worker n PS......Job 1Worker 1 Worker n PS......

bcd

13

a2+a3+…+an

a1

Page 48: ATP: In-network Aggregation for Multi-tenant Learning

Challenge 2: Incomplete Aggregation

Job 2 PS

Job 2Worker 1

Worker 2

......

Worker n

Switch

Job 3

Worker 1 Worker n PS......Job 1Worker 1 Worker n PS......

bcd

13

a2+a3+…+an+ a1

a1

a2a...an

a1

Page 49: ATP: In-network Aggregation for Multi-tenant Learning

Inter-Rack Aggregation

• Aggregation at every layer of network topology• Nondeterministic routing, i.e., ECMP

14

Page 50: ATP: In-network Aggregation for Multi-tenant Learning

Inter-Rack Aggregation

• Aggregation at every layer of network topology• Nondeterministic routing, i.e., ECMP

• Support two-level aggregation at ToR switches• Workers and PS(es) locate in different racks

Datacenter Network

SW1 SW2 SW3

W1 W3 W4 W5W2 W6 PSa1 a2 a3 a5a4 a6

14

Page 51: ATP: In-network Aggregation for Multi-tenant Learning

Inter-Rack Aggregation

• Aggregation at every layer of network topology• Nondeterministic routing, i.e., ECMP

• Support two-level aggregation at ToR switches• Workers and PS(es) locate in different racks

Datacenter Network

SW1 SW2 SW3

W1 W3 W4 W5W2 W6 PS

a1a2 a3 a5a4 a6

14

Page 52: ATP: In-network Aggregation for Multi-tenant Learning

Inter-Rack Aggregation

• Aggregation at every layer of network topology• Nondeterministic routing, i.e., ECMP

• Support two-level aggregation at ToR switches• Workers and PS(es) locate in different racks

Datacenter Network

SW1 SW2 SW3

W1 W3 W4 W5W2 W6 PS

a1a2 a3 a5a4 a6

a1’a2’

14

Page 53: ATP: In-network Aggregation for Multi-tenant Learning

Inter-Rack Aggregation

• Aggregation at every layer of network topology• Nondeterministic routing, i.e., ECMP

• Support two-level aggregation at ToR switches• Workers and PS(es) locate in different racks

Datacenter Network

SW1 SW2 SW3

W1 W3 W4 W5W2 W6 PS

a1a2 a3 a5a4 a6

a1’a2’

a1’+a2’+a5+a6 14

Page 54: ATP: In-network Aggregation for Multi-tenant Learning

Inter-Rack Aggregation

• Aggregation at every layer of network topology• Nondeterministic routing, i.e., ECMP

• Support two-level aggregation at ToR switches• Workers and PS(es) locate in different racks• Scale up to 1024 workers

Datacenter Network

SW1 SW2 SW3

W1 W3 W4 W5W2 W6 PS

a1a2 a3 a5a4 a6

a1’a2’

a1’+a2’+a5+a6 14

Page 55: ATP: In-network Aggregation for Multi-tenant Learning

Additional Challenges

15

Page 56: ATP: In-network Aggregation for Multi-tenant Learning

Additional Challenges

• Rethink reliability• Recovery from packet loss • Ensure exact once aggregation• Memory leak: aggregators are reserved forever, but not used

15

Page 57: ATP: In-network Aggregation for Multi-tenant Learning

Additional Challenges

• Rethink reliability• Recovery from packet loss • Ensure exact once aggregation• Memory leak: aggregators are reserved forever, but not used

• Rethink congestion control• N flows merged into one flow communication• Drop congestion signal, i.e., ECN

15

Page 58: ATP: In-network Aggregation for Multi-tenant Learning

Additional Challenges

• Rethink reliability• Recovery from packet loss • Ensure exact once aggregation• Memory leak: aggregators are reserved forever, but not used

• Rethink congestion control• N flows merged into one flow communication• Drop congestion signal, i.e., ECN

• Improve the floating point computation• Convert gradients to 32-bit integer at workers by a scaling factor• Aggregation overflow at switch

15

Page 59: ATP: In-network Aggregation for Multi-tenant Learning

ATP Implementation and Evaluation

• Implementation• Replace the networking stack of BytePS at the end host• Use P4 to implement the in-network aggregation service at Barefoot Tofino

switch

• Evaluation • Setup: 9 servers, each with one GPU, one 100G NIC • Baseline: ( BytePS + TCP, BytePS+ RDMA ) x (Nto1, NtoN ), SwitchML,

Horovod+RDMA, Horovod+TCP• Metrics: Training Throughput, Time-to-Accuracy• Workloads: AlexNet, VGG11, VGG16, VGG19, ResNet50, ResNet101, and

ResNet152

16

Page 60: ATP: In-network Aggregation for Multi-tenant Learning

Single Job Performance

17

Page 61: ATP: In-network Aggregation for Multi-tenant Learning

Single Job Performance

ATP is comparable to, and outperforms the state-of-the-art approaches.ATP gets larger performance gains on network-intensive workloads (VGG)

than the computation-intensive workloads (ResNet). 17

Page 62: ATP: In-network Aggregation for Multi-tenant Learning

Multiple Jobs: dynamic (ATP) vs static

• 3 VGG16 Jobs• Static approach evenly

distributes aggregators to jobs• PTA: the number of the

aggregators to make each job to achieve the peak aggregation throughput

18

Page 63: ATP: In-network Aggregation for Multi-tenant Learning

Multiple Jobs: dynamic (ATP) vs static

• 3 VGG16 Jobs• Static approach evenly

distributes aggregators to jobs• PTA: the number of the

aggregators to make each job to achieve the peak aggregation throughput

18

100% 85% 75% 60% 45% 33%Peak Throughput Aggregators (PTA)

0

50

100

150

200

Trai

ning

Thr

ough

put

(imag

e/se

c)

Dynamic Static

Percent of Peak Throughput Aggregators

Trai

ning

Thr

ough

put

(imag

e/se

c)

Page 64: ATP: In-network Aggregation for Multi-tenant Learning

Multiple Jobs: dynamic (ATP) vs static

• 3 VGG16 Jobs• Static approach evenly

distributes aggregators to jobs• PTA: the number of the

aggregators to make each job to achieve the peak aggregation throughput

18

100% 85% 75% 60% 45% 33%Peak Throughput Aggregators (PTA)

0

50

100

150

200

Trai

ning

Thr

ough

put

(imag

e/se

c)

Dynamic Static

Percent of Peak Throughput Aggregators

Trai

ning

Thr

ough

put

(imag

e/se

c)

Page 65: ATP: In-network Aggregation for Multi-tenant Learning

Multiple Jobs: dynamic (ATP) vs static

• 3 VGG16 Jobs• Static approach evenly

distributes aggregators to jobs• PTA: the number of the

aggregators to make each job to achieve the peak aggregation throughput

18

100% 85% 75% 60% 45% 33%Peak Throughput Aggregators (PTA)

0

50

100

150

200

Trai

ning

Thr

ough

put

(imag

e/se

c)

Dynamic Static

Percent of Peak Throughput Aggregators

Trai

ning

Thr

ough

put

(imag

e/se

c)

Page 66: ATP: In-network Aggregation for Multi-tenant Learning

Multiple Jobs: dynamic (ATP) vs static

• 3 VGG16 Jobs• Static approach evenly

distributes aggregators to jobs• PTA: the number of the

aggregators to make each job to achieve the peak aggregation throughput

When switch memory is sufficient, ATP’s dynamic ≈ staticWhen switch memory is insufficient, ATP’s dynamic > static

18

100% 85% 75% 60% 45% 33%Peak Throughput Aggregators (PTA)

0

50

100

150

200

Trai

ning

Thr

ough

put

(imag

e/se

c)

Dynamic Static

Percent of Peak Throughput Aggregators

Trai

ning

Thr

ough

put

(imag

e/se

c)

Page 67: ATP: In-network Aggregation for Multi-tenant Learning

Multiple Jobs: dynamic (ATP) vs static

• 3 VGG16 Jobs• Static approach evenly

distributes aggregators to jobs• PTA: the number of the

aggregators to make each job to achieve the peak aggregation throughput

When switch memory is sufficient, ATP’s dynamic ≈ staticWhen switch memory is insufficient, ATP’s dynamic > static

18

100% 85% 75% 60% 45% 33%Peak Throughput Aggregators (PTA)

0

50

100

150

200

Trai

ning

Thr

ough

put

(imag

e/se

c)

Dynamic Static

Percent of Peak Throughput Aggregators

Trai

ning

Thr

ough

put

(imag

e/se

c)

More evaluations about packet loss recovery overhead, time-to-accuracy, congestion control in various scenarios.

Page 68: ATP: In-network Aggregation for Multi-tenant Learning

ATP Summary

• A network service that supports best-effort, dynamic in-network aggregation aimed at multi-rack, multi-tenant

• Co-design end-host and switch logic• Reliability• Congestion control• Dealing with floating point

Opensource: https://github.com/in-ATP/ATP19

Page 69: ATP: In-network Aggregation for Multi-tenant Learning

ATP: In-network Aggregation for Multi-tenant Learning

Chonlam Lao*, Yanfang Le*, Kshiteej Mahajan, Yixi Chen, Wenfei Wu, Aditya Akella, Michael Swift

Tsinghua University University of Wisconsin-Madison

* = co-primary authors

Thank You!Opensource: https://github.com/in-ATP/ATP

20