Dynamic Parameter Allocation in Parameter Servers · DynamicParameterAllocation inParameterServers AlexanderRenz-Wieland1,RainerGemulla2, SteffenZeuch1,3,VolkerMarkl1,3 1TUBerlin,2UniversityofMannheim,3DFKI

Dynamic Parameter Allocationin Parameter Servers

Alexander Renz-Wieland1, Rainer Gemulla2,Steffen Zeuch1,3, Volker Markl1,3

1TU Berlin, 2University of Mannheim, 3DFKI

VLDB 2020

1 / 11

TakeawaysI Key challenge in distributed Machine Learning (ML):

communication overheadI Parameter Servers (PSs)

I IntuitiveI Limited support for common techniques to reduce overhead

I How to improve support?I Dynamic parameter allocation

I Is this support beneficial?I Up to two orders of magnitude faster

2 / 11

Background: Distributed Machine LearningI Distributed training is a necessity for large-scale ML tasksI Parameter management is a key concernI Parameter servers (PS) are widely used

Logical

Parameter Server

worker workerworker ...

push()

pull()pus

h()

pull

()

push()

pull()

Physical

para

met

ers

worker

worker

worker

para

met

ers

worker

worker

worker

para

met

ers

worker

worker

worker

3 / 11

Problem: Communication OverheadI Communication overhead can limit scalabilityI Performance can fall behind a single node

Training knowledge graph embeddings (RESCAL, dimension 100):

4.5h

4h

2.4h

1.5h

Classic PS (PS−Lite)

0

100

200

1x4 2x4 4x4 8x4Parallelism (nodes x threads)

Epo

ch r

un ti

me

in m

inut

es

4 / 11



4.5h

1.2h

4h

2.4h

1.5h


Classic PSwith fast local

access

0

100

200


Epo

ch r

un ti

me

in m

inut

es

4 / 11



●

●

●●

4.5h

1.2h

4h

0.6h

2.4h

0.4h

1.5h

0.2h



access

Dynamic Allocation PS (Lapse),incl. fast local access

0

100

200


Epo

ch r

un ti

me

in m

inut

es

4 / 11

How to reduce communication overhead?I Common techniques to reduce overhead:

DATA

PARAMETERS

Data clustering

DATA

PARAMETERS

Parameter blocking

DATA

PARAMETERS

Latency hiding

I Key is to avoid remote accessesI Do PSs support these techniques?

I Techniques require local access at different nodes over timeI But PSs allocate parameters statically

5 / 11

worker 1 worker 2

parameter access

Dynamic Parameter AllocationI What if the PS could allocate parameters dynamically?

Localize(parameters)

I Would provide support forI Data clustering XI Parameter blocking XI Latency hiding X

I We call this dynamic parameter allocation

6 / 11

The Lapse Parameter ServerI Features

I Dynamic allocationI Location transparencyI Retains sequential consistency

PS per-key consistency guarantees (for synchronous operations)

Classic Lapse Stale

Eventual X X XPRAM X X XCausal X X ×Sequential X X ×Serializability × × ×

I Many system challenges (see paper)I Manage parameter locationsI Route parameter accesses to current locationI Relocate parametersI Handle reads and writes during relocationsI All while maintaining sequential consistency

7 / 11

Experimental studyTasks: matrix factorization, knowledge graph embeddings, word vectorsCluster: 1–8 nodes, each with 4 worker threads, 10 GBit Ethernet

1. Performance of Classic PSsI 2–8 nodes barely outperformed 1 node in all tested tasks

2. Effect of dynamic parameter allocationI 4–203x faster than a Classic PSs, up to linear speed-ups

3. Comparison to bounded staleness PSsI 2–28x faster and more scalable

4. Comparison to manual managementI Competitive to a specialized low-level implementation

5. Ablation studyI Combining fast local access and dynamic allocation is key

8 / 11

●

●

●●

4.5h

1.2h

4h

0.6h

2.4h

0.4h

1.5h

0.2h



access

Dynamic Allocation PS (Lapse),incl. fast local access

0

100

200


Epo

ch r

un ti

me

in m

inut

es

Comparison to Bounded Staleness PSI Matrix factorization (matrix with 1b entries, rank 100)I Parameter blocking

Single-nodeoverhead Non-linearscaling

Bounded staleness PS(Petuum), client sync.

Bounded staleness PS(Petuum), server sync.

Dynamic AllocationPS (Lapse)

●

●

●

●

0.6x

2.9x

8.4x0

10

20

30

40


Epo

ch r

un ti

me

in m

inut

es

9 / 11

Experimental studyTasks: matrix factorization, knowledge graph embeddings, word vectorsCluster: 1–8 nodes, each with 4 worker threads, 10 GBit Ethernet

1. Performance of Classic PSsI 2–8 nodes barely outperformed 1 node in all tested tasks

2. Effect of dynamic parameter allocationI 4–203x faster than a Classic PSs, up to linear speed-ups

3. Comparison to bounded staleness PSsI 2–28x faster and more scalable

4. Comparison to manual managementI Competitive to a specialized low-level implementation

5. Ablation studyI Combining fast local access and dynamic allocation is key

10 / 11

Dynamic Parameter Allocation in Parameter ServersI Key challenge in distributed Machine Learning (ML):

communication overheadI Parameter Servers (PSs)

I IntuitiveI Limited support for common techniques to reduce overhead

I How to improve support?I Dynamic parameter allocation

I Is this support beneficial?I Up to two orders of magnitude faster

I Lapse is open source:https://github.com/alexrenz/lapse-ps

11 / 11

https://github.com/alexrenz/lapse-ps

IntroBackground and ProblemDynamic Parameter AllocationExperimentsConclusion

Documents

Dynamic Parameter Allocation in Parameter Servers · DynamicParameterAllocation inParameterServers AlexanderRenz-Wieland1,RainerGemulla2, SteffenZeuch1,3,VolkerMarkl1,3 1TUBerlin,2UniversityofMannheim,3DFKI