15
Dynamic Parameter Allocation in Parameter Servers Alexander Renz-Wieland 1 , Rainer Gemulla 2 , Steffen Zeuch 1,3 , Volker Markl 1,3 1 TU Berlin, 2 University of Mannheim, 3 DFKI VLDB 2020 1 / 11

Dynamic Parameter Allocation in Parameter Servers · DynamicParameterAllocation inParameterServers AlexanderRenz-Wieland1,RainerGemulla2, SteffenZeuch1,3,VolkerMarkl1,3 1TUBerlin,2UniversityofMannheim,3DFKI

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

  • Dynamic Parameter Allocationin Parameter Servers

    Alexander Renz-Wieland1, Rainer Gemulla2,Steffen Zeuch1,3, Volker Markl1,3

    1TU Berlin, 2University of Mannheim, 3DFKI

    VLDB 2020

    1 / 11

  • TakeawaysI Key challenge in distributed Machine Learning (ML):

    communication overheadI Parameter Servers (PSs)

    I IntuitiveI Limited support for common techniques to reduce overhead

    I How to improve support?I Dynamic parameter allocation

    I Is this support beneficial?I Up to two orders of magnitude faster

    2 / 11

  • Background: Distributed Machine LearningI Distributed training is a necessity for large-scale ML tasksI Parameter management is a key concernI Parameter servers (PS) are widely used

    Logical

    Parameter Server

    worker workerworker ...

    push()

    pull()pus

    h()

    pull

    ()

    push()

    pull()

    Physical

    para

    met

    ers

    worker

    worker

    worker

    para

    met

    ers

    worker

    worker

    worker

    para

    met

    ers

    worker

    worker

    worker

    3 / 11

  • Problem: Communication OverheadI Communication overhead can limit scalabilityI Performance can fall behind a single node

    Training knowledge graph embeddings (RESCAL, dimension 100):

    4.5h

    4h

    2.4h

    1.5h

    Classic PS (PS−Lite)

    0

    100

    200

    1x4 2x4 4x4 8x4Parallelism (nodes x threads)

    Epo

    ch r

    un ti

    me

    in m

    inut

    es

    4 / 11

  • Problem: Communication OverheadI Communication overhead can limit scalabilityI Performance can fall behind a single node

    Training knowledge graph embeddings (RESCAL, dimension 100):

    4.5h

    1.2h

    4h

    2.4h

    1.5h

    Classic PS (PS−Lite)

    Classic PSwith fast local

    access

    0

    100

    200

    1x4 2x4 4x4 8x4Parallelism (nodes x threads)

    Epo

    ch r

    un ti

    me

    in m

    inut

    es

    4 / 11

  • Problem: Communication OverheadI Communication overhead can limit scalabilityI Performance can fall behind a single node

    Training knowledge graph embeddings (RESCAL, dimension 100):

    ●●

    4.5h

    1.2h

    4h

    0.6h

    2.4h

    0.4h

    1.5h

    0.2h

    Classic PS (PS−Lite)

    Classic PSwith fast local

    access

    Dynamic Allocation PS (Lapse),incl. fast local access

    0

    100

    200

    1x4 2x4 4x4 8x4Parallelism (nodes x threads)

    Epo

    ch r

    un ti

    me

    in m

    inut

    es

    4 / 11

  • How to reduce communication overhead?I Common techniques to reduce overhead:

    DATA

    PARAMETERS

    Data clustering

    DATA

    PARAMETERS

    Parameter blocking

    DATA

    PARAMETERS

    Latency hiding

    I Key is to avoid remote accessesI Do PSs support these techniques?

    I Techniques require local access at different nodes over timeI But PSs allocate parameters statically

    5 / 11

    worker 1 worker 2

    parameter access

  • How to reduce communication overhead?I Common techniques to reduce overhead:

    DATA

    PARAMETERS

    Data clustering

    DATA

    PARAMETERS

    Parameter blocking

    DATA

    PARAMETERS

    Latency hiding

    I Key is to avoid remote accessesI Do PSs support these techniques?

    I Techniques require local access at different nodes over timeI But PSs allocate parameters statically

    5 / 11

    worker 1 worker 2

    parameter access

  • Dynamic Parameter AllocationI What if the PS could allocate parameters dynamically?

    Localize(parameters)

    I Would provide support forI Data clustering XI Parameter blocking XI Latency hiding X

    I We call this dynamic parameter allocation

    6 / 11

  • The Lapse Parameter ServerI Features

    I Dynamic allocationI Location transparencyI Retains sequential consistency

    PS per-key consistency guarantees (for synchronous operations)

    Classic Lapse Stale

    Eventual X X XPRAM X X XCausal X X ×Sequential X X ×Serializability × × ×

    I Many system challenges (see paper)I Manage parameter locationsI Route parameter accesses to current locationI Relocate parametersI Handle reads and writes during relocationsI All while maintaining sequential consistency

    7 / 11

  • Experimental studyTasks: matrix factorization, knowledge graph embeddings, word vectorsCluster: 1–8 nodes, each with 4 worker threads, 10 GBit Ethernet

    1. Performance of Classic PSsI 2–8 nodes barely outperformed 1 node in all tested tasks

    2. Effect of dynamic parameter allocationI 4–203x faster than a Classic PSs, up to linear speed-ups

    3. Comparison to bounded staleness PSsI 2–28x faster and more scalable

    4. Comparison to manual managementI Competitive to a specialized low-level implementation

    5. Ablation studyI Combining fast local access and dynamic allocation is key

    8 / 11

    ●●

    4.5h

    1.2h

    4h

    0.6h

    2.4h

    0.4h

    1.5h

    0.2h

    Classic PS (PS−Lite)

    Classic PSwith fast local

    access

    Dynamic Allocation PS (Lapse),incl. fast local access

    0

    100

    200

    1x4 2x4 4x4 8x4Parallelism (nodes x threads)

    Epo

    ch r

    un ti

    me

    in m

    inut

    es

  • Comparison to Bounded Staleness PSI Matrix factorization (matrix with 1b entries, rank 100)I Parameter blocking

    Single-nodeoverhead Non-linearscaling

    Bounded staleness PS(Petuum), client sync.

    Bounded staleness PS(Petuum), server sync.

    Dynamic AllocationPS (Lapse)

    0.6x

    2.9x

    8.4x0

    10

    20

    30

    40

    1x4 2x4 4x4 8x4Parallelism (nodes x threads)

    Epo

    ch r

    un ti

    me

    in m

    inut

    es

    9 / 11

  • Comparison to Bounded Staleness PSI Matrix factorization (matrix with 1b entries, rank 100)I Parameter blocking

    Single-nodeoverhead Non-linearscaling

    Bounded staleness PS(Petuum), client sync.

    Bounded staleness PS(Petuum), server sync.

    Dynamic AllocationPS (Lapse)

    0.6x

    2.9x

    8.4x0

    10

    20

    30

    40

    1x4 2x4 4x4 8x4Parallelism (nodes x threads)

    Epo

    ch r

    un ti

    me

    in m

    inut

    es

    9 / 11

  • Experimental studyTasks: matrix factorization, knowledge graph embeddings, word vectorsCluster: 1–8 nodes, each with 4 worker threads, 10 GBit Ethernet

    1. Performance of Classic PSsI 2–8 nodes barely outperformed 1 node in all tested tasks

    2. Effect of dynamic parameter allocationI 4–203x faster than a Classic PSs, up to linear speed-ups

    3. Comparison to bounded staleness PSsI 2–28x faster and more scalable

    4. Comparison to manual managementI Competitive to a specialized low-level implementation

    5. Ablation studyI Combining fast local access and dynamic allocation is key

    10 / 11

  • Dynamic Parameter Allocation in Parameter ServersI Key challenge in distributed Machine Learning (ML):

    communication overheadI Parameter Servers (PSs)

    I IntuitiveI Limited support for common techniques to reduce overhead

    I How to improve support?I Dynamic parameter allocation

    I Is this support beneficial?I Up to two orders of magnitude faster

    I Lapse is open source:https://github.com/alexrenz/lapse-ps

    11 / 11

    https://github.com/alexrenz/lapse-ps

    IntroBackground and ProblemDynamic Parameter AllocationExperimentsConclusion