149
iSwitch: Accelerating Distributed Reinforcement Learning with In-Switch Computing Jian Huang Youjie Li Iou-Jen Liu Yifan Yuan Deming Chen Alexander Schwing University of Illinois at Urbana-Champaign

iSwitch: Accelerating Distributed Reinforcement Learning

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: iSwitch: Accelerating Distributed Reinforcement Learning

iSwitch: Accelerating Distributed Reinforcement

Learning with In-Switch Computing

Jian Huang

Youjie Li Iou-Jen Liu Yifan Yuan

Deming Chen Alexander Schwing

University of Illinois at Urbana-Champaign

Page 2: iSwitch: Accelerating Distributed Reinforcement Learning

2

AI Applications are Increasingly Operating in Dynamic Environments

Autonomous Driving GamesRobotics

Page 3: iSwitch: Accelerating Distributed Reinforcement Learning

2

AI Applications are Increasingly Operating in Dynamic Environments

Autonomous Driving GamesRobotics

Reinforcement Learning Empowers AI Applications to Take Real-Time Intelligent Actions

Page 4: iSwitch: Accelerating Distributed Reinforcement Learning

3

What is Reinforcement Learning?

Agent Environment

Page 5: iSwitch: Accelerating Distributed Reinforcement Learning

3

What is Reinforcement Learning?

Agent Environment

State

Page 6: iSwitch: Accelerating Distributed Reinforcement Learning

3

What is Reinforcement Learning?

Agent Environment

Action

State

Page 7: iSwitch: Accelerating Distributed Reinforcement Learning

3

What is Reinforcement Learning?

Agent Environment

Action

Next State

Page 8: iSwitch: Accelerating Distributed Reinforcement Learning

3

What is Reinforcement Learning?

Agent Environment

Action

Reward

Next State

Page 9: iSwitch: Accelerating Distributed Reinforcement Learning

3

What is Reinforcement Learning?

Model

Agent Environment

Action

Reward

Next State

Page 10: iSwitch: Accelerating Distributed Reinforcement Learning

3

What is Reinforcement Learning?

Model

Agent Environment

Action

Reward

Next State

Page 11: iSwitch: Accelerating Distributed Reinforcement Learning

3

What is Reinforcement Learning?

Gradient

Model

Agent Environment

Action

Reward

Next State

Page 12: iSwitch: Accelerating Distributed Reinforcement Learning

3

What is Reinforcement Learning?

Gradient

Model

Agent Environment

ActionTraining

Reward

Next State

Page 13: iSwitch: Accelerating Distributed Reinforcement Learning

3

What is Reinforcement Learning?

Gradient

Model

Agent Environment

ActionTraining

Reward

Next State

Train a Typical RL Agent on a

Single GPU = 8 Days*

*Mnih, ICML’16

Page 14: iSwitch: Accelerating Distributed Reinforcement Learning

3

What is Reinforcement Learning?

RL Requires Distributed Training for Improved Performance

Gradient

Model

Agent Environment

ActionTraining

Reward

Next State

Train a Typical RL Agent on a

Single GPU = 8 Days*

*Mnih, ICML’16

Page 15: iSwitch: Accelerating Distributed Reinforcement Learning

4

Parameter

Server

Workers

Centralized Distributed RL Training: Parameter-Server Based

Switch

Page 16: iSwitch: Accelerating Distributed Reinforcement Learning

4

Parameter

Server

Workers

Centralized Distributed RL Training: Parameter-Server Based

Switch

Model

Page 17: iSwitch: Accelerating Distributed Reinforcement Learning

4

Parameter

Server

Workers

Gradient

Centralized Distributed RL Training: Parameter-Server Based

Switch

Model

Page 18: iSwitch: Accelerating Distributed Reinforcement Learning

4

Parameter

Server

Workers

Gradient

Centralized Distributed RL Training: Parameter-Server Based

Switch

Model

Sum Update WeightParameter

Server

Page 19: iSwitch: Accelerating Distributed Reinforcement Learning

4

Parameter

Server

Workers

Gradient

Centralized Distributed RL Training: Parameter-Server Based

Switch

Model

Sum Update WeightParameter

Server

Page 20: iSwitch: Accelerating Distributed Reinforcement Learning

4

Parameter

Server

Workers

Gradient

Centralized Distributed RL Training: Parameter-Server Based

Switch

Model

Sum Update WeightParameter

Server

Multiple

Network Hops

Page 21: iSwitch: Accelerating Distributed Reinforcement Learning

4

Parameter

Server

Workers

Gradient

Centralized Distributed RL Training: Parameter-Server Based

Switch

Model

Sum Update WeightParameter

ServerCentral

Bottleneck

Multiple

Network Hops

Page 22: iSwitch: Accelerating Distributed Reinforcement Learning

5

Decentralized Distributed RL Training: AllReduce Based

Ring-AllReduce

Switch

Workers

Model Sum

Page 23: iSwitch: Accelerating Distributed Reinforcement Learning

5

Decentralized Distributed RL Training: AllReduce Based

Gradient

Ring-AllReduce

Switch

Workers

Model Sum

Sum

Page 24: iSwitch: Accelerating Distributed Reinforcement Learning

5

Decentralized Distributed RL Training: AllReduce Based

Gradient

Ring-AllReduce

Switch

Workers

Model Sum

Sum Sum

Page 25: iSwitch: Accelerating Distributed Reinforcement Learning

5

Decentralized Distributed RL Training: AllReduce Based

Gradient

Ring-AllReduce

Switch

Workers

Model Sum

Sum Sum

Full

Page 26: iSwitch: Accelerating Distributed Reinforcement Learning

5

Decentralized Distributed RL Training: AllReduce Based

Gradient

Ring-AllReduce

Switch

Workers

Model Sum

Aggregated

Gradient

Sum Sum

FullFull

Full Full

Aggregation

Complete!

Page 27: iSwitch: Accelerating Distributed Reinforcement Learning

5

Decentralized Distributed RL Training: AllReduce Based

Gradient

Ring-AllReduce

Switch

Workers

Excessive

Network Hops

Model Sum

Aggregated

Gradient

Sum Sum

FullFull

Full Full

Aggregation

Complete!

Page 28: iSwitch: Accelerating Distributed Reinforcement Learning

6

Parameter

Server

Workers

Centralized Design

Gradient

Switch Gradient

Decentralized Design

Ring-AllReduce

Switch

Workers

Network Communication is the Bottleneck in Distributed RL Training

Page 29: iSwitch: Accelerating Distributed Reinforcement Learning

6

Parameter

Server

Workers

Centralized Design

Gradient

Switch Gradient

Decentralized Design

Ring-AllReduce

Switch

Workers

Network Communication is the Bottleneck in Distributed RL Training

Network Hops = 4

Page 30: iSwitch: Accelerating Distributed Reinforcement Learning

6

Parameter

Server

Workers

Centralized Design

Gradient

Switch Gradient

Decentralized Design

Ring-AllReduce

Switch

Workers

Network Communication is the Bottleneck in Distributed RL Training

Network Hops = 4 Network Hops = 4N - 4

Page 31: iSwitch: Accelerating Distributed Reinforcement Learning

7

The Unique Characteristic of Distributed RL Training: Latency Critical

RLBenchmark

DQN-

Atari

A2C-

Atari

PPO-

MuJoCo

DDPG-

MuJoCo

Gradient Size 6 MB 3 MB 40 KB 158 KB

Training Iterations 200 M 2 M 0.2 M 3 M

Page 32: iSwitch: Accelerating Distributed Reinforcement Learning

7

The Unique Characteristic of Distributed RL Training: Latency Critical

RLBenchmark

DQN-

Atari

A2C-

Atari

PPO-

MuJoCo

DDPG-

MuJoCo

Gradient Size 6 MB 3 MB 40 KB 158 KB

Training Iterations 200 M 2 M 0.2 M 3 M

DNNBenchmark

AlexNet-

ImageNet

ResNet50-

ImageNet

VGG16-

ImageNet

MLP-

MNIST

Gradient Size 250 MB 100 MB 525 MB 4 MB

Training Iterations 320 K 600 K 370 K 10 K

Page 33: iSwitch: Accelerating Distributed Reinforcement Learning

7

The Unique Characteristic of Distributed RL Training: Latency Critical

RLBenchmark

DQN-

Atari

A2C-

Atari

PPO-

MuJoCo

DDPG-

MuJoCo

Gradient Size 6 MB 3 MB 40 KB 158 KB

Training Iterations 200 M 2 M 0.2 M 3 M

DNNBenchmark

AlexNet-

ImageNet

ResNet50-

ImageNet

VGG16-

ImageNet

MLP-

MNIST

Gradient Size 250 MB 100 MB 525 MB 4 MB

Training Iterations 320 K 600 K 370 K 10 K

88x Smaller Gradient Size

158x More Iterations

Page 34: iSwitch: Accelerating Distributed Reinforcement Learning

7

The Unique Characteristic of Distributed RL Training: Latency Critical

RLBenchmark

DQN-

Atari

A2C-

Atari

PPO-

MuJoCo

DDPG-

MuJoCo

Gradient Size 6 MB 3 MB 40 KB 158 KB

Training Iterations 200 M 2 M 0.2 M 3 M

DNNBenchmark

AlexNet-

ImageNet

ResNet50-

ImageNet

VGG16-

ImageNet

MLP-

MNIST

Gradient Size 250 MB 100 MB 525 MB 4 MB

Training Iterations 320 K 600 K 370 K 10 K

Distributed RL Training is Latency Critical

88x Smaller Gradient Size

158x More Iterations

Page 35: iSwitch: Accelerating Distributed Reinforcement Learning

0%

20%

40%

60%

80%

100%

DQN A2C PPO DDPG

Parameter Server

Local Computation Grad Aggregation

0%

20%

40%

60%

80%

100%

DQN A2C PPO DDPG

AllReduce

Local Computation Grad Aggregation

8

Quantifying the Network Overhead in Distributed RL Training

Page 36: iSwitch: Accelerating Distributed Reinforcement Learning

0%

20%

40%

60%

80%

100%

DQN A2C PPO DDPG

Parameter Server

Local Computation Grad Aggregation

0%

20%

40%

60%

80%

100%

DQN A2C PPO DDPG

AllReduce

Local Computation Grad Aggregation

8

Quantifying the Network Overhead in Distributed RL Training

Gradient Aggregation over the Network Dominates the Training Time (50~83%)

Page 37: iSwitch: Accelerating Distributed Reinforcement Learning

0%

20%

40%

60%

80%

100%

DQN A2C PPO DDPG

Parameter Server

Local Computation Grad Aggregation

0%

20%

40%

60%

80%

100%

DQN A2C PPO DDPG

AllReduce

Local Computation Grad Aggregation

8

Quantifying the Network Overhead in Distributed RL Training

Gradient Aggregation over the Network Dominates the Training Time (50~83%)

Compute

Network

Page 38: iSwitch: Accelerating Distributed Reinforcement Learning

9

Programmable Switch

Aggregation Accelerator

+ + + =

In-Switch Acceleration: A New Distributed Computing Paradigm

Page 39: iSwitch: Accelerating Distributed Reinforcement Learning

9

Programmable Switch

Aggregation Accelerator

+ + + =Performance Reduce End-to-End Network Latency

In-Switch Acceleration: A New Distributed Computing Paradigm

Page 40: iSwitch: Accelerating Distributed Reinforcement Learning

9

Programmable Switch

Aggregation Accelerator

+ + + =Performance Reduce End-to-End Network Latency

Programmability Hardware-Algorithm Co-Design

In-Switch Acceleration: A New Distributed Computing Paradigm

Page 41: iSwitch: Accelerating Distributed Reinforcement Learning

9

Programmable Switch

Aggregation Accelerator

+ + + =Performance Reduce End-to-End Network Latency

Programmability

Scalability

Hardware-Algorithm Co-Design

Scale Training at Rack Scale

In-Switch Acceleration: A New Distributed Computing Paradigm

Page 42: iSwitch: Accelerating Distributed Reinforcement Learning

10

Challenges of In-Switch Acceleration

No Impact on

Regular Switch

Functions

Page 43: iSwitch: Accelerating Distributed Reinforcement Learning

10

Challenges of In-Switch Acceleration

Limited

On-Chip

Resources

No Impact on

Regular Switch

Functions

Page 44: iSwitch: Accelerating Distributed Reinforcement Learning

10

Challenges of In-Switch Acceleration

Limited

On-Chip

Resources

No Impact on

Regular Switch

Functions

Scale with

More Switches

and Nodes

Page 45: iSwitch: Accelerating Distributed Reinforcement Learning

11

Basics of Programmable Switch

Control Plane

Data Plane

Page 46: iSwitch: Accelerating Distributed Reinforcement Learning

11

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

Page 47: iSwitch: Accelerating Distributed Reinforcement Learning

11

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

DataHead

Page 48: iSwitch: Accelerating Distributed Reinforcement Learning

11

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

DataHead

Page 49: iSwitch: Accelerating Distributed Reinforcement Learning

11

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

DataHead

Page 50: iSwitch: Accelerating Distributed Reinforcement Learning

11

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

DataHead

Page 51: iSwitch: Accelerating Distributed Reinforcement Learning

11

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

DataHead

Page 52: iSwitch: Accelerating Distributed Reinforcement Learning

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

DataHead

Packet Forwarding

11

Page 53: iSwitch: Accelerating Distributed Reinforcement Learning

11

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

Forwarding Control

DataHead

Packet Forwarding

Page 54: iSwitch: Accelerating Distributed Reinforcement Learning

11

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

Forwarding Control System Configuration

DataHead

Packet Forwarding

Page 55: iSwitch: Accelerating Distributed Reinforcement Learning

11

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

Forwarding Control System Configuration

DataHead

Packet Forwarding

Page 56: iSwitch: Accelerating Distributed Reinforcement Learning

11

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

Forwarding Control System Configuration

DataHead

Packet Forwarding

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Page 57: iSwitch: Accelerating Distributed Reinforcement Learning

11

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

Forwarding Control System Configuration

DataHead

Packet Forwarding

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

DataHeader

Page 58: iSwitch: Accelerating Distributed Reinforcement Learning

11

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

Forwarding Control System Configuration

DataHead

Packet Forwarding

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

DataHeader

Page 59: iSwitch: Accelerating Distributed Reinforcement Learning

11

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

Forwarding Control System Configuration

DataHead

Packet Forwarding

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

DataHeader

Page 60: iSwitch: Accelerating Distributed Reinforcement Learning

11

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

Forwarding Control System Configuration

DataHead

Packet Forwarding

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

DataHeader

Page 61: iSwitch: Accelerating Distributed Reinforcement Learning

11

Input Port Output Ports

Basics of Programmable Switch

Control Plane

Data Plane

Forwarding Control System Configuration

DataHead

Packet Forwarding

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

DataHeader

Page 62: iSwitch: Accelerating Distributed Reinforcement Learning

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Integrating Aggregation Accelerator into the Programmable Switch

Page 63: iSwitch: Accelerating Distributed Reinforcement Learning

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Integrating Aggregation Accelerator into the Programmable Switch

Core of

Regular

Functions

Page 64: iSwitch: Accelerating Distributed Reinforcement Learning

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Integrating Aggregation Accelerator into the Programmable Switch

Page 65: iSwitch: Accelerating Distributed Reinforcement Learning

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Page 66: iSwitch: Accelerating Distributed Reinforcement Learning

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

Page 67: iSwitch: Accelerating Distributed Reinforcement Learning

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

Header

Page 68: iSwitch: Accelerating Distributed Reinforcement Learning

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

Regular

Header

Page 69: iSwitch: Accelerating Distributed Reinforcement Learning

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

Regular

Header

Page 70: iSwitch: Accelerating Distributed Reinforcement Learning

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

Page 71: iSwitch: Accelerating Distributed Reinforcement Learning

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

Header

Page 72: iSwitch: Accelerating Distributed Reinforcement Learning

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

Gradient

Header

Page 73: iSwitch: Accelerating Distributed Reinforcement Learning

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

Gradient

Header

Page 74: iSwitch: Accelerating Distributed Reinforcement Learning

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

Page 75: iSwitch: Accelerating Distributed Reinforcement Learning

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

Regular Traffic

Gradient Traffic

Page 76: iSwitch: Accelerating Distributed Reinforcement Learning

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

Hardware Acceleration Isolated From Regular Switch Function

Regular Traffic

Gradient Traffic

Page 77: iSwitch: Accelerating Distributed Reinforcement Learning

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

Hardware Acceleration Isolated From Regular Switch Function

Page 78: iSwitch: Accelerating Distributed Reinforcement Learning

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

Hardware Acceleration Isolated From Regular Switch Function

Header

Page 79: iSwitch: Accelerating Distributed Reinforcement Learning

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

Hardware Acceleration Isolated From Regular Switch Function

Header

Page 80: iSwitch: Accelerating Distributed Reinforcement Learning

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

Hardware Acceleration Isolated From Regular Switch Function

Page 81: iSwitch: Accelerating Distributed Reinforcement Learning

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

Hardware Acceleration Isolated From Regular Switch Function

Header

Page 82: iSwitch: Accelerating Distributed Reinforcement Learning

12

RxQ

Receiver

Receiver

Receiver

Receiver

Input Arbiter

Output Port Lookup

Packet ProcessRxQ

RxQ

RxQ

TxQ

Transmitter

Transmitter

Transmitter

Transmitter

TxQ

TxQ

TxQ

Data Plane

Accelerator

Integrating Aggregation Accelerator into the Programmable Switch

Input

Arbiter

Hardware Acceleration Isolated From Regular Switch Function

Header

Page 83: iSwitch: Accelerating Distributed Reinforcement Learning

13

Developing Light-Weight Accelerator for Aggregation

In-Switch Accelerator

Page 84: iSwitch: Accelerating Distributed Reinforcement Learning

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

In-Switch Accelerator

Page 85: iSwitch: Accelerating Distributed Reinforcement Learning

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

In-Switch Accelerator

Pkt i

Seg i

Page 86: iSwitch: Accelerating Distributed Reinforcement Learning

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

In-Switch Accelerator

Pkt i

Seg i

Page 87: iSwitch: Accelerating Distributed Reinforcement Learning

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

In-Switch Accelerator

Pkt i

Seg i

Page 88: iSwitch: Accelerating Distributed Reinforcement Learning

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Header

Payload In-Switch Accelerator

Pkt i

Seg i

Page 89: iSwitch: Accelerating Distributed Reinforcement Learning

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Header

Payload In-Switch Accelerator

Pkt i

Seg i

Page 90: iSwitch: Accelerating Distributed Reinforcement Learning

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Header

Seg

Idx

Payload In-Switch Accelerator

Pkt i

Seg i

Page 91: iSwitch: Accelerating Distributed Reinforcement Learning

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Buffer ModuleHeader

Seg

Idx

Payload In-Switch Accelerator

Pkt i

Seg i

Page 92: iSwitch: Accelerating Distributed Reinforcement Learning

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Buffer ModuleHeader

Seg

Idx

Payload In-Switch Accelerator

Pkt i

Seg i

Page 93: iSwitch: Accelerating Distributed Reinforcement Learning

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Buffer ModuleHeader

Seg

Idx

Payload In-Switch Accelerator

Pkt i

Seg i

Page 94: iSwitch: Accelerating Distributed Reinforcement Learning

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Buffer ModuleHeader

Seg

Idx

Payload In-Switch Accelerator

Slicer

Elements

Pkt i

Seg i

Page 95: iSwitch: Accelerating Distributed Reinforcement Learning

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Buffer ModuleHeader

Seg

Idx

Payload In-Switch Accelerator

Slicer

Elements

Pkt i

Seg i

Page 96: iSwitch: Accelerating Distributed Reinforcement Learning

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Buffer ModuleHeader

Seg

Idx

Payload In-Switch Accelerator

Slicer

Elements

Pkt i

Seg i

Page 97: iSwitch: Accelerating Distributed Reinforcement Learning

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Buffer ModuleHeader

Seg

Idx

Payload In-Switch Accelerator

Slicer

Elements

Pkt i

Seg i

Page 98: iSwitch: Accelerating Distributed Reinforcement Learning

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Buffer ModuleHeader

Seg

Idx

Payload

Counter Module

In-Switch Accelerator

Slicer

Elements

Pkt i

Seg i

Page 99: iSwitch: Accelerating Distributed Reinforcement Learning

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Buffer ModuleHeader

Seg

Idx

Payload

Counter Module

In-Switch Accelerator

Slicer

Elements

Pkt i

Seg i

Pkt i

Page 100: iSwitch: Accelerating Distributed Reinforcement Learning

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Buffer ModuleHeader

Seg

Idx

Payload

Counter Module

In-Switch Accelerator

Slicer

Elements

Pkt i

Seg i

Pkt i

Page 101: iSwitch: Accelerating Distributed Reinforcement Learning

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Buffer ModuleHeader

Seg

Idx

Payload

Counter Module

In-Switch Accelerator

Slicer

Elements

Pkt i

Seg i

Page 102: iSwitch: Accelerating Distributed Reinforcement Learning

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Buffer ModuleHeader

Seg

Idx

Payload

Counter Module

In-Switch Accelerator

Slicer

Elements

Pkt i

Seg i

Threshold

Page 103: iSwitch: Accelerating Distributed Reinforcement Learning

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Buffer ModuleHeader

Seg

Idx

Payload

Counter Module

In-Switch Accelerator

Output

Module

Slicer

Elements

Pkt i

Seg i

Threshold

Page 104: iSwitch: Accelerating Distributed Reinforcement Learning

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Buffer ModuleHeader

Seg

Idx

Payload

Counter Module

Pkt i

In-Switch Accelerator

Output

Module

Slicer

Elements

Pkt i

Seg i

Threshold

Page 105: iSwitch: Accelerating Distributed Reinforcement Learning

Separator

13

Developing Light-Weight Accelerator for Aggregation

Seg 0 Seg 1 … Seg i … Seg NGradient Vector

Parser

Buffer ModuleHeader

Seg

Idx

Payload

Counter Module

Pkt i

In-Switch Accelerator

Output

Module

Slicer

Elements

Pkt i

Seg i

Threshold

Accelerator Resource Consumption:

extra 18.6% of LUT, 17.3% of FF, and 17 DSP

Page 106: iSwitch: Accelerating Distributed Reinforcement Learning

14

Aggregating Gradient at Packet-Level for Improved Parallelism

Conventional Vector-Level Aggregation

Sum

Result

Page 107: iSwitch: Accelerating Distributed Reinforcement Learning

14

Aggregating Gradient at Packet-Level for Improved Parallelism

Conventional Vector-Level Aggregation

Packet-Level Aggregation in Our iSwitch

Sum

Result

Page 108: iSwitch: Accelerating Distributed Reinforcement Learning

14

Aggregating Gradient at Packet-Level for Improved Parallelism

Conventional Vector-Level Aggregation

Packet-Level Aggregation in Our iSwitch

Sum

Result

Further Reduce

Aggregation Time

Page 109: iSwitch: Accelerating Distributed Reinforcement Learning

15

Extending Network Protocol for In-Switch Computing

Regular Packet:

ETH IP UDP Application Data

Page 110: iSwitch: Accelerating Distributed Reinforcement Learning

15

Extending Network Protocol for In-Switch Computing

ETH IP UDP Application Data

Data Packet of iSwitch:

Page 111: iSwitch: Accelerating Distributed Reinforcement Learning

15

Extending Network Protocol for In-Switch Computing

ETH IP UDP Application Data

Type-of-Service Field

Data Packet of iSwitch:

Page 112: iSwitch: Accelerating Distributed Reinforcement Learning

15

Extending Network Protocol for In-Switch Computing

ETH IP UDP Application Data

Type-of-Service Field

Seg Gradient

Data Packet of iSwitch:

Page 113: iSwitch: Accelerating Distributed Reinforcement Learning

15

Extending Network Protocol for In-Switch Computing

ETH IP UDP Application Data

Type-of-Service Field

Seg Gradient

Data Packet of iSwitch:

Control Packet of iSwitch:

ETH IP UDP Application DataAction Value (optional)

Page 114: iSwitch: Accelerating Distributed Reinforcement Learning

15

Extending Network Protocol for In-Switch Computing

Action Description

Join Join the training job

Leave Leave the training job

Reset Clear the accelerator on the switch

SetH Set aggregation threshold H on switch

FBcast Force broadcast a segment on switch

Help Request a lost data packet for a worker

Ack Confirm the success of some actions

ETH IP UDP Application Data

Type-of-Service Field

Seg Gradient

Data Packet of iSwitch:

Control Packet of iSwitch:

ETH IP UDP Application DataAction Value (optional)

Page 115: iSwitch: Accelerating Distributed Reinforcement Learning

15

Extending Network Protocol for In-Switch Computing

Action Description

Join Join the training job

Leave Leave the training job

Reset Clear the accelerator on the switch

SetH Set aggregation threshold H on switch

FBcast Force broadcast a segment on switch

Help Request a lost data packet for a worker

Ack Confirm the success of some actions

iSwitch extension will NOT affect regular network functions

ETH IP UDP Application Data

Type-of-Service Field

Seg Gradient

Data Packet of iSwitch:

Control Packet of iSwitch:

ETH IP UDP Application DataAction Value (optional)

Page 116: iSwitch: Accelerating Distributed Reinforcement Learning

16

Supporting Different (Sync & Async) Training Execution Modes

Synchronous Distributed Training

Programmable Switch

Aggregation

Accelerator

Page 117: iSwitch: Accelerating Distributed Reinforcement Learning

16

Supporting Different (Sync & Async) Training Execution Modes

Synchronous Distributed Training

In-Switch Acceleration Directly Applies

Programmable Switch

Aggregation

Accelerator

Page 118: iSwitch: Accelerating Distributed Reinforcement Learning

16

Supporting Different (Sync & Async) Training Execution Modes

Synchronous Distributed Training

In-Switch Acceleration Directly Applies

Asynchronous Distributed Training

Programmable Switch

Aggregation

Accelerator

Programmable Switch

Aggregation

Accelerator

Page 119: iSwitch: Accelerating Distributed Reinforcement Learning

16

Supporting Different (Sync & Async) Training Execution Modes

Synchronous Distributed Training

In-Switch Acceleration Directly Applies

Asynchronous Distributed Training

Programmable Switch

Aggregation

Accelerator

Programmable Switch

Aggregation

Accelerator

Keep

Computing

Page 120: iSwitch: Accelerating Distributed Reinforcement Learning

16

Supporting Different (Sync & Async) Training Execution Modes

Synchronous Distributed Training

In-Switch Acceleration Directly Applies

Asynchronous Distributed Training

Programmable Switch

Aggregation

Accelerator

Programmable Switch

Aggregation

Accelerator

Keep

Computing

Keep

Aggregating

Page 121: iSwitch: Accelerating Distributed Reinforcement Learning

16

Supporting Different (Sync & Async) Training Execution Modes

Synchronous Distributed Training

In-Switch Acceleration Directly Applies

Asynchronous Distributed Training

Programmable Switch

Aggregation

Accelerator

Programmable Switch

Aggregation

Accelerator

Keep

Computing

Keep

Aggregating

HW/Algo Co-Design For Improved Parallelism

Page 122: iSwitch: Accelerating Distributed Reinforcement Learning

Racks of

Servers

Top-of-Rack

Switches

Core

Switches

“Aggregate”

Switches

17

Scaling In-Switch Computing in Rack-Scale Data Centers

The Typical Network Architecture at Data Center

Page 123: iSwitch: Accelerating Distributed Reinforcement Learning

Racks of

Servers

Top-of-Rack

Switches

Core

Switches

“Aggregate”

Switches

17

Scaling In-Switch Computing in Rack-Scale Data Centers

The Hierarchical Aggregation of iSwitch

Page 124: iSwitch: Accelerating Distributed Reinforcement Learning

Racks of

Servers

Top-of-Rack

Switches

Core

Switches

“Aggregate”

Switches

17

Scaling In-Switch Computing in Rack-Scale Data Centers

The Hierarchical Aggregation of iSwitch

Grad Pkt

Grad Pkt

Grad Pkt

Grad Pkt

Grad Pkt

Grad Pkt

Grad Pkt

Grad Pkt

Page 125: iSwitch: Accelerating Distributed Reinforcement Learning

Racks of

Servers

Top-of-Rack

Switches

Core

Switches

“Aggregate”

Switches

17

Scaling In-Switch Computing in Rack-Scale Data Centers

The Hierarchical Aggregation of iSwitch

Grad PktGrad Pkt Grad PktGrad Pkt Grad PktGrad Pkt Grad PktGrad Pkt

Page 126: iSwitch: Accelerating Distributed Reinforcement Learning

Racks of

Servers

Top-of-Rack

Switches

Core

Switches

“Aggregate”

Switches

17

Scaling In-Switch Computing in Rack-Scale Data Centers

The Hierarchical Aggregation of iSwitch

Grad PktGrad Pkt Grad PktGrad Pkt

Page 127: iSwitch: Accelerating Distributed Reinforcement Learning

Racks of

Servers

Top-of-Rack

Switches

Core

Switches

“Aggregate”

Switches

17

Scaling In-Switch Computing in Rack-Scale Data Centers

The Hierarchical Aggregation of iSwitch

Grad PktGrad Pkt Grad PktGrad Pkt

Page 128: iSwitch: Accelerating Distributed Reinforcement Learning

Racks of

Servers

Top-of-Rack

Switches

Core

Switches

“Aggregate”

Switches

17

Scaling In-Switch Computing in Rack-Scale Data Centers

The Hierarchical Aggregation of iSwitch

Grad PktGrad Pkt

Page 129: iSwitch: Accelerating Distributed Reinforcement Learning

Racks of

Servers

Top-of-Rack

Switches

Core

Switches

“Aggregate”

Switches

17

Scaling In-Switch Computing in Rack-Scale Data Centers

The Hierarchical Aggregation of iSwitch

Grad PktGrad Pkt

Page 130: iSwitch: Accelerating Distributed Reinforcement Learning

Racks of

Servers

Top-of-Rack

Switches

Core

Switches

“Aggregate”

Switches

17

Scaling In-Switch Computing in Rack-Scale Data Centers

The Hierarchical Aggregation of iSwitch

Page 131: iSwitch: Accelerating Distributed Reinforcement Learning

Racks of

Servers

Top-of-Rack

Switches

Core

Switches

“Aggregate”

Switches

17

Scaling In-Switch Computing in Rack-Scale Data Centers

The Hierarchical Aggregation of iSwitch

Grad Pkt

Grad Pkt

Grad Pkt

Grad Pkt

Grad Pkt

Grad Pkt

Grad Pkt

Grad Pkt

Page 132: iSwitch: Accelerating Distributed Reinforcement Learning

Racks of

Servers

Top-of-Rack

Switches

Core

Switches

“Aggregate”

Switches

17

Scaling In-Switch Computing in Rack-Scale Data Centers

The Hierarchical Aggregation of iSwitch

Grad Pkt

Grad Pkt

Grad Pkt

Grad Pkt

Grad Pkt

Grad Pkt

Grad Pkt

Grad Pkt

No Additional Cost or Topology Change for Scaling In-Switch Computing

Page 133: iSwitch: Accelerating Distributed Reinforcement Learning

18

In-Switch

Computing

Implementation

RL Training

Benchmarks

NetFPGA-SUME Board

GPU Cluster

DQN A2C PPO DDPG

Page 134: iSwitch: Accelerating Distributed Reinforcement Learning

19

Reducing the End-to-End Training Time with iSwitch

-25

-20

-15

-10

-5

0

5

10

15

20

25

0 250 500 750 1000 1250 1500 1750 2000

Avera

ge E

pis

ode R

ew

ard

Training Time (min) of DQN

Page 135: iSwitch: Accelerating Distributed Reinforcement Learning

19

Reducing the End-to-End Training Time with iSwitch

-25

-20

-15

-10

-5

0

5

10

15

20

25

0 250 500 750 1000 1250 1500 1750 2000

Avera

ge E

pis

ode R

ew

ard

Training Time (min) of DQN

Parameter Server (PS)

Page 136: iSwitch: Accelerating Distributed Reinforcement Learning

19

Reducing the End-to-End Training Time with iSwitch

-25

-20

-15

-10

-5

0

5

10

15

20

25

0 250 500 750 1000 1250 1500 1750 2000

Avera

ge E

pis

ode R

ew

ard

Training Time (min) of DQN

Parameter Server (PS)AllReduce (AR)

Page 137: iSwitch: Accelerating Distributed Reinforcement Learning

19

Reducing the End-to-End Training Time with iSwitch

-25

-20

-15

-10

-5

0

5

10

15

20

25

0 250 500 750 1000 1250 1500 1750 2000

Avera

ge E

pis

ode R

ew

ard

Training Time (min) of DQN

Parameter Server (PS)AllReduce (AR)iSwitch (iSW)

Page 138: iSwitch: Accelerating Distributed Reinforcement Learning

19

Reducing the End-to-End Training Time with iSwitch

-25

-20

-15

-10

-5

0

5

10

15

20

25

0 250 500 750 1000 1250 1500 1750 2000

Avera

ge E

pis

ode R

ew

ard

Training Time (min) of DQN

Parameter Server (PS)AllReduce (AR)iSwitch (iSW)

Page 139: iSwitch: Accelerating Distributed Reinforcement Learning

19

Reducing the End-to-End Training Time with iSwitch

-25

-20

-15

-10

-5

0

5

10

15

20

25

0 250 500 750 1000 1250 1500 1750 2000

Avera

ge E

pis

ode R

ew

ard

Training Time (min) of DQN

Parameter Server (PS)AllReduce (AR)iSwitch (iSW)

Page 140: iSwitch: Accelerating Distributed Reinforcement Learning

19

Reducing the End-to-End Training Time with iSwitch

-25

-20

-15

-10

-5

0

5

10

15

20

25

0 250 500 750 1000 1250 1500 1750 2000

Avera

ge E

pis

ode R

ew

ard

Training Time (min) of DQN

Parameter Server (PS)AllReduce (AR)iSwitch (iSW)

3.7x Speedup

1.9x

Page 141: iSwitch: Accelerating Distributed Reinforcement Learning

20

Performance Breakdown for Each Training IterationT

rain

ing

Tim

e (

No

rm)

0

0.2

0.4

0.6

0.8

1

1.2

PS AR iSW PS AR iSW PS AR iSW PS AR iSW

Agent Action Environment Buffer Sampling Memory Alloc

Forward Pass Backward Pass GPU Copy Grad Aggregation

Weight Update Others

DQN A2C PPO DDPG

Page 142: iSwitch: Accelerating Distributed Reinforcement Learning

20

Performance Breakdown for Each Training IterationT

rain

ing

Tim

e (

No

rm)

0

0.2

0.4

0.6

0.8

1

1.2

PS AR iSW PS AR iSW PS AR iSW PS AR iSW

Agent Action Environment Buffer Sampling Memory Alloc

Forward Pass Backward Pass GPU Copy Grad Aggregation

Weight Update Others

DQN A2C PPO DDPG

Page 143: iSwitch: Accelerating Distributed Reinforcement Learning

20

Performance Breakdown for Each Training IterationT

rain

ing

Tim

e (

No

rm)

0

0.2

0.4

0.6

0.8

1

1.2

PS AR iSW PS AR iSW PS AR iSW PS AR iSW

Agent Action Environment Buffer Sampling Memory Alloc

Forward Pass Backward Pass GPU Copy Grad Aggregation

Weight Update Others

DQN A2C PPO DDPG

Page 144: iSwitch: Accelerating Distributed Reinforcement Learning

20

Performance Breakdown for Each Training IterationT

rain

ing

Tim

e (

No

rm)

0

0.2

0.4

0.6

0.8

1

1.2

PS AR iSW PS AR iSW PS AR iSW PS AR iSW

Agent Action Environment Buffer Sampling Memory Alloc

Forward Pass Backward Pass GPU Copy Grad Aggregation

Weight Update Others

DQN A2C PPO DDPG

Page 145: iSwitch: Accelerating Distributed Reinforcement Learning

21

Improved Training Scalability with In-Switch Computing

Synchronous Training of PPO

1

1.5

2

2.5

3

4 6 9 12

Spee

dup

Number of Worker Nodes

PS

AR

iSW

Ideal

Page 146: iSwitch: Accelerating Distributed Reinforcement Learning

21

Improved Training Scalability with In-Switch Computing

Synchronous Training of PPO Asynchronous Training of PPO

1

1.5

2

2.5

3

4 6 9 12

Spee

dup

Number of Worker Nodes

PS

AR

iSW

Ideal

1

1.5

2

2.5

3

4 6 9 12

Spee

dup

Number of Worker Nodes

PS

iSW

Ideal

Page 147: iSwitch: Accelerating Distributed Reinforcement Learning

21

Improved Training Scalability with In-Switch Computing

Synchronous Training of PPO Asynchronous Training of PPO

1

1.5

2

2.5

3

4 6 9 12

Spee

dup

Number of Worker Nodes

PS

AR

iSW

Ideal

1

1.5

2

2.5

3

4 6 9 12

Spee

dup

Number of Worker Nodes

PS

iSW

Ideal

Close-to Linear Speedup for Both Training Modes

Page 148: iSwitch: Accelerating Distributed Reinforcement Learning

22

In-Switch

Computing

Summary

Programmable Switch

Aggregation Accelerator

+ + + =

3.7x Speedup for Both Sync/Async Training

Scales at Rack-Scale Clusters

Page 149: iSwitch: Accelerating Distributed Reinforcement Learning

Thanks!

Jian Huang

Youjie Li

[email protected]

Iou-Jen Liu Yifan Yuan

Deming Chen Alexander Schwing

University of Illinois at Urbana-Champaign