Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin...

Preview:

Citation preview

Farzin Haddadpour

Trading Redundancy for Communication: Speeding up Distributed SGD for

Non-convex Optimization

Joint work with

Viveck CadambeMehrdad MahdaviMohammad Mahdi Kamani

min f(x) ,X

i

fi(x)Goal: Solving

min f(x) ,X

i

fi(x)Goal: Solving

x(t+1) = x(t) � ⌘1

|⇠(t)|rf(x(t); ⇠(t))SGD

min f(x) ,X

i

fi(x)Goal: Solving

x(t+1) = x(t) � ⌘1

|⇠(t)|rf(x(t); ⇠(t))

Parallelization due to computational cost

x(t+1) = x(t) � ⌘

p

pX

j=1

1

|⇠(t)j |rf(x(t); ⇠(t)j )

SGD

Distributed SGD

min f(x) ,X

i

fi(x)Goal: Solving

x(t+1) = x(t) � ⌘1

|⇠(t)|rf(x(t); ⇠(t))

Parallelization due to computational cost

Communication is bottleneck

x(t+1) = x(t) � ⌘

p

pX

j=1

1

|⇠(t)j |rf(x(t); ⇠(t)j )

SGD

Distributed SGD

Communication

Number of bits per iteration

Gradient compression based techniques

Communication

Number of roundsNumber of bits per iteration

Gradient compression based techniques

Local SGD with periodic averaging

Local SGD with periodic averaging

x(t+1)j = 1

p

Ppj=1

hx(t)j � ⌘ g̃(t)

j

iif ⌧ |T

x(t+1)j = x(t)

j � ⌘ g̃(t)j otherwise,

Averaging step (a)

Local update (b)

Local SGD with periodic averaging

x(t+1)j = 1

p

Ppj=1

hx(t)j � ⌘ g̃(t)

j

iif ⌧ |T

x(t+1)j = x(t)

j � ⌘ g̃(t)j otherwise,

Averaging step (a)

Local update (b)

W1

W1 W3W2

W2 W3

W1 W2 W3

W1 W3W2

(a)

(a)

(a)

p = 3, ⌧ = 1

Local SGD with periodic averaging

x(t+1)j = 1

p

Ppj=1

hx(t)j � ⌘ g̃(t)

j

iif ⌧ |T

x(t+1)j = x(t)

j � ⌘ g̃(t)j otherwise,

Averaging step (a)

Local update (b)

W1

W1

W1

W3

W2

W2

W2 W3

W3

(a)

(b)

p = 3, ⌧ = 1

W1

W1 W3W2

W2 W3

W1 W2 W3

W1 W3W2

(a)

(a)

(a)

p = 3, ⌧ = 3

Convergence Analysis of Local SGD with periodic averaging

Table 1: Comparison of di↵erent SGD based algorithms.Strategy Convergence error Assumptions Com-round(T/⌧)SGD O(1/

ppT ) i.i.d. & b.g T

[Yu et.al.] O(1/ppT ) i.i.d. & b.g O(p

34T

14 )

[Wang & Joshi] O(1/ppT ) i.i.d. O(p

32T

12 )

RI-SGD (⌧, q) O(1/ppT ) +O((1� q/p)�) non-i.i.d. & b.d. O(p

32T

12 )

b.g: Bounded gradient kgik22 G

Unbiased gradient estimation E[g̃j ] = gj

Convergence Analysis of Local SGD with periodic averaging

Table 1: Comparison of di↵erent SGD based algorithms.Strategy Convergence error Assumptions Com-round(T/⌧)SGD O(1/

ppT ) i.i.d. & b.g T

[Yu et.al.] O(1/ppT ) i.i.d. & b.g O(p

34T

14 )

[Wang & Joshi] O(1/ppT ) i.i.d. O(p

32T

12 )

RI-SGD (⌧, q) O(1/ppT ) +O((1� q/p)�) non-i.i.d. & b.d. O(p

32T

12 )

b.g: Bounded gradient kgik22 G

A. Residual error is observe in practice but theoretical understanding is missing?

B. How we can capture this in convergence analysis?

C. Any solution to improve it?

Unbiased gradient estimation E[g̃j ] = gj

Insufficiency of convergence analysis

A. Residual error is observe in practice but theoretical understanding is missing?

Unbiased gradient estimation does not hold

Insufficiency of convergence analysis

A. Residual error is observe in practice but theoretical understanding is missing?

B. How to capture this in convergence analysis?

Unbiased gradient estimation does not hold

Analysis based on biased gradientsOur work

Insufficiency of convergence analysis

A. Residual error is observe in practice but theoretical understanding is missing?

B. How to capture this in convergence analysis?

C. Any solution to improve it?

Unbiased gradient estimation does not hold

Analysis based on biased gradients

Redundancy

Our work

Our work

Redundancy infused local SGD (RI-SGD)

D = D1 [D2 [D3

W1

W1

W1

W3

W2

W2

W2 W3

W3

D1 D2 D3

p = 3, ⌧ = 3Local SGD

Redundancy infused local SGD (RI-SGD)

D = D1 [D2 [D3

W1

W1

W1

W3

W2

W2

W2 W3

W3

D1 D2 D3

p = 3, ⌧ = 3Local SGD RI-SGD q = 2, p = 3, ⌧ = 3

W1

W1

W3

W2

W2

W3

W1

D1D2 D2 D1

W2

D3 D3

W3

Explicit redundancy

Comparing RI-SGD with other schemes

b.d: Bounded inner product of gradients hgi,gji �Assumption

Redundancy

Biased gradients

q: Number of data chunks at each worker node

Comparing RI-SGD with other schemes

b.d: Bounded inner product of gradients hgi,gji �Assumption

Redundancy

Table 1: Comparison of di↵erent SGD based algorithms.Strategy Convergence error Assumptions Com-round(T/⌧)SGD O(1/

ppT ) i.i.d. & b.g T

[Yu et.al.] O(1/ppT ) i.i.d. & b.g O(p

34T

14 )

[Wang & Joshi] O(1/ppT ) i.i.d. O(p

32T

12 )

RI-SGD (⌧, q) O(1/ppT ) +O((1� q/p)�) non-i.i.d. & b.d. O(p

32T

12 )

Biased gradients

q: Number of data chunks at each worker node

1. Speed up not only due to larger effective mini-batch size, but also due to increasing intra-gradient diversity.

2. Fault-tolerance. 3. Extension to heterogeneous mini-batch size

and possible application to federated optimization.

Advantages of RI-SGD:

Faster convergence: Experiments over Image-net (top figures) and Cifar-100 (bottom figures)

Increasing intra-gradient diversity: Experiments over Cifar-10

Fault-Tolerance: Experiments over Cifar-10

For more details please come to my poster session

Wed Jun 12th 06:30 -- 09:00 PM @ Pacific

Ballroom #185

Thanks for your attention!

Recommended