Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin...

Farzin Haddadpour

Trading Redundancy for Communication: Speeding up Distributed SGD for

Non-convex Optimization

Joint work with

Viveck CadambeMehrdad MahdaviMohammad Mahdi Kamani

min f(x) ,X

fi(x)Goal: Solving

min f(x) ,X

fi(x)Goal: Solving

x(t+1) = x(t) � ⌘1

|⇠(t)|rf(x(t); ⇠(t))SGD

min f(x) ,X

fi(x)Goal: Solving

x(t+1) = x(t) � ⌘1

|⇠(t)|rf(x(t); ⇠(t))

Parallelization due to computational cost

x(t+1) = x(t) � ⌘

|⇠(t)j |rf(x(t); ⇠(t)j )

Distributed SGD

min f(x) ,X

fi(x)Goal: Solving

x(t+1) = x(t) � ⌘1

|⇠(t)|rf(x(t); ⇠(t))

Parallelization due to computational cost

Communication is bottleneck

x(t+1) = x(t) � ⌘

|⇠(t)j |rf(x(t); ⇠(t)j )

Distributed SGD

Communication

Number of bits per iteration

Gradient compression based techniques

Communication

Number of roundsNumber of bits per iteration

Gradient compression based techniques

Local SGD with periodic averaging

x(t+1)j = 1

hx(t)j � ⌘ g̃(t)

iif ⌧ |T

x(t+1)j = x(t)

j � ⌘ g̃(t)j otherwise,

Averaging step (a)

Local update (b)

x(t+1)j = 1

hx(t)j � ⌘ g̃(t)

iif ⌧ |T

x(t+1)j = x(t)

Averaging step (a)

Local update (b)

W1 W3W2

W1 W2 W3

W1 W3W2

p = 3, ⌧ = 1

x(t+1)j = 1

hx(t)j � ⌘ g̃(t)

iif ⌧ |T

x(t+1)j = x(t)

Averaging step (a)

Local update (b)

p = 3, ⌧ = 1

W1 W3W2

W1 W2 W3

W1 W3W2

p = 3, ⌧ = 3

Convergence Analysis of Local SGD with periodic averaging

Table 1: Comparison of di↵erent SGD based algorithms.Strategy Convergence error Assumptions Com-round(T/⌧)SGD O(1/

ppT ) i.i.d. & b.g T

[Yu et.al.] O(1/ppT ) i.i.d. & b.g O(p

[Wang & Joshi] O(1/ppT ) i.i.d. O(p

RI-SGD (⌧, q) O(1/ppT ) +O((1� q/p)�) non-i.i.d. & b.d. O(p

b.g: Bounded gradient kgik22 G

Unbiased gradient estimation E[g̃j ] = gj

Convergence Analysis of Local SGD with periodic averaging

b.g: Bounded gradient kgik22 G

A. Residual error is observe in practice but theoretical understanding is missing?

B. How we can capture this in convergence analysis?

C. Any solution to improve it?

Unbiased gradient estimation E[g̃j ] = gj

Insufficiency of convergence analysis

Unbiased gradient estimation does not hold

B. How to capture this in convergence analysis?

Analysis based on biased gradientsOur work

B. How to capture this in convergence analysis?

C. Any solution to improve it?

Analysis based on biased gradients

Redundancy

Our work

Redundancy infused local SGD (RI-SGD)

D = D1 [D2 [D3

D1 D2 D3

p = 3, ⌧ = 3Local SGD

Redundancy infused local SGD (RI-SGD)

D = D1 [D2 [D3

D1 D2 D3

p = 3, ⌧ = 3Local SGD RI-SGD q = 2, p = 3, ⌧ = 3

D1D2 D2 D1

Explicit redundancy

Comparing RI-SGD with other schemes

b.d: Bounded inner product of gradients hgi,gji �Assumption

Redundancy

Biased gradients

q: Number of data chunks at each worker node

Comparing RI-SGD with other schemes

b.d: Bounded inner product of gradients hgi,gji �Assumption

Redundancy

Biased gradients

q: Number of data chunks at each worker node

1. Speed up not only due to larger effective mini-batch size, but also due to increasing intra-gradient diversity.

2. Fault-tolerance. 3. Extension to heterogeneous mini-batch size

and possible application to federated optimization.

Advantages of RI-SGD:

Faster convergence: Experiments over Image-net (top figures) and Cifar-100 (bottom figures)

Increasing intra-gradient diversity: Experiments over Cifar-10

Fault-Tolerance: Experiments over Cifar-10

For more details please come to my poster session

Wed Jun 12th 06:30 -- 09:00 PM @ Pacific

Ballroom #185

Thanks for your attention!

Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin...

Documents

2018 Data: Speeding

Speeding up Spectrum Analyzer Measurements · Speeding up Measurements 1EF90_2E Rohde & Schwarz Speeding up Spectrum Analyzer Measurements 7 Physical Connection of Instruments The

Speeding Ticket

Ellipsis Redundancy and Reduction Redundancy …conf.ling.cornell.edu/mr249/papers/ellipsis.pdfEllipsis Redundancy and Reduction Redundancy Mats Ro oth 1 F o cus and anaphoric destressing

IPN t600IPNext600 System Redundancy IPSystem Redundancy … filePSU Module A PSU Module B for System. System Redundancy FeaturesSystem Redundancy Features IPNext600 Next Generation

Speeding Up Rendering

REDUNDANCY AND REDEPLOYMENT POLICY AND PROCEDURE … MRC Redundancy Policy.pdf · REDUNDANCY AND REDEPLOYMENT POLICY AND PROCEDURE Version 2.0 Page 2 of 30 December 2012 Redundancy

Speeding Up Invalidity

We’re speeding, we’re speeding, we’re speeding. We’re speeding, we’re speeding on speedy shoes. Our message, our message, our message. Our message, our

Massachusetts Speeding Laws

Inference by Learning: Speeding-up Graphical Model ...papers.nips.cc/paper/5357-inference-by-learning-speeding...Inference by Learning: Speeding-up Graphical Model Optimization via

On-chip-redundancy On-chip-redundancy according to

201303 School Speeding

Speeding Up

Speeding up VirtualDub

Redundancy guide - StepChange Debt Charity...4 StepChange Debt Charity Redundancy Guide Facing redundancy If you’re facing redundancy, it’s vital that you understand your rights

Motivations for Speeding

A Redundancy of Remedies: Insider Trading and United

Redundancy Policy Redundancy Policy

DHCPv6 Redundancy Considerations Redundancy Proposals in RFC 6853