Distributed Machine Learning

Distributed Systems Spring 2019

Lecture 22

Obligatory ML is saving/destroying the world slide

Insert

Supervised Machine Learning In One Slide

• Goal: Fit a function over some data (xi , yi )• Functions are parametrized : f (w , x)• Example: Linear function f = ax + b, w = (a, b)• How to find the parameters/weights that fit best?• Loss function, L =

∑i error(yi , f (w , xi )) =

∑i |yi − f (w , xi )|2

• Computational goal: Find w to minimize the loss function• Deep neural networks: millions of weights

Distributed Machine Learning In One Slide

• What parallelism can we exploit?• Basic idea: distribute the loss function computation• Process 0 computes minw

∑N/2i=0 |yi − f (w , xi )|2 and finds w1

• Process 1 computes minw∑N

i=N/2 |yi − f (w , xi )|2 and finds w2• This is data parallelism.• Each process computes over a subset of the data (i.e., a mapoperation)• How to reduce, i.e., combine w1 and w2?

Model Fitting With SGD

• Loss function, L =∑

i error(yi , f (w , xi )) =∑

i |yi − f (w , xi )|2

• Central question: How to find w ?• Most common technique: Gradient Descent• wt+1 = wt − η∆L(wt)• ∆L(wt) is the gradient found using partial derivatives• Stochastic gradient descent (SGD): Evaluate gradient on onlysubset of data points• Also called a “mini-batch”

Parameter Server Architecture

• Workers compute gradients using SGD on small data batches• New gradients shared with other workers via parameter server• Other ways of communication: all-to-all reduce,...

Data Parallel SGD Timeline

Synchronous SGD

• Synchronous SGD uses a barrier.

HogWild!

• Asynchronous parallel SGD• Typical way: processes compute gradients, and update sharedgradient array.• Gradient array protected using a lock• HogWild! is a lock-free technique• Key idea: Let processes update their gradients without locking• Works well if updates are sparse

Consistency vs. Performance

• Synchronous SGD has synchronization overheads• But, Async SGD suffers from poor statistical efficiency• Because of updating models with stale weights• Synchronous: Per-iteration time is longer• Async: Per-iteration time is shorter, but more iterationsrequired

Model Parallelism

• Alternative is to replicate data on all machines, but split model

TensorFlow

• Computation structured as adataflow graph• TF programs create graphs (alaSpark)• TF engine evaluates graph• Many inbuilt operation kernels• Dataflow graphs can bepartitioned in many ways• Supports model and dataparallelism, and hybrid schemes

Some Open Questions

• Sync vs. Async• How many parameter servers vs. workers?• Role of GPUs• Geo-distributed training: how to minimize updates?• Accelerate inference (evaluate f (w , x)

References

Demystifying Parallel and Distributed Deep Learning: An In-DepthConcurrency Analysis. Tal Ben-Nun, Torsten Hoefler

Distributed Machine Learning - Indiana University...

Documents

Lec23 Semidef Opt

Lending A Hand: Detecting Hands and Recognizing Activities ...homes.sice.indiana.edu/sbambach/papers/iccv-egohands.pdfhands according to ground truth segmentations (resized to square

20151208 Lec23 SpecialRelativity Slides

lec23 - Washington University in St. Louisjst/cse/473/lec/lec23.pdfJon Turner - based on slides from Kurose & Ross . ... indirect routing of datagrams ... . manages call setup

Leveraging percolation theory to single out inﬂuential ...homes.sice.indiana.edu/filiradi/Mypapers/PhysRevE.93.062314.pdfdynamical processes, synthetic or real-world networks, using

lec23 casestudy software engineering

12.010.Lec23 10 - Massachusetts Institute of Technologygeoweb.mit.edu/~tah/12.010/12.010.Lec23.pdf · Comparison of Question 3 from the solutions ... Sudoku solver Mathematica Sudoku

More Vector Clocks And Leader Electionhomes.sice.indiana.edu/prateeks/ds/Lec9.pdf · •Useful for implementing centralized algorithms, since leader can broadcast messages to keep

Unleashing the Walking Dead: Understanding Cross-App ...homes.sice.indiana.edu/luyixing/bib/ccs17-unleashing.pdfUnleashing the Walking Dead: Understanding Cross-App Remote Infections

A Relational Hierarchical Model for Decision …homes.sice.indiana.edu/natarasr/Papers/Hierarchical...A Relational Hierarchical Model for Decision-Theoretic Assistance 3 rollouts and

Distributed Systems (3rd Edition)homes.sice.indiana.edu/prateeks/ds/Lec13.pdf · The only thing you really want is that the entries you updated and/or read at A, are in B the way

Comparisons Between Ripple-Carry Adder and Carry-Look ...homes.sice.indiana.edu/stsher/files/RCA_CLA_Comparison.pdf · UNIVERSITY OF SOUTHERN CALIFORNIA. FALL 2015. 20153 EE 277 31099

“Beards, Sandals, and Other Signs of Rugged Individualism”homes.sice.indiana.edu/nensmeng/files/Ensmenger2015.pdf · “Beards, Sandals, and Other Signs of Rugged Individualism”:

Scalpel: Customizing DNN Pruning to the Underlying ...homes.sice.indiana.edu/lukefahr/papers/jiecaoyu_isca17.pdfScalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism

Portfolio-driven Resource Management for Transient Cloud ...prateeks/papers/exosphere.pdf · Portfolio-driven Resource Management for Transient Cloud Servers PRATEEKSHARMA,DAVIDIRWIN,andPRASHANTSHENOY,University

Redundant Interdependencies Boost the Robustness of ...homes.sice.indiana.edu/filiradi/Mypapers/PhysRevX.7.011013.pdf · be placed. In multiplex networks, however, the nodes of each

Burglars’ IoT Paradise: Understanding and Mitigating ...homes.sice.indiana.edu/luyixing/bib/oakland20-mqtt.pdf · include AWS IoT Core [1], Microsoft’s Azure IoT Hub [2], and

Lec23 - Web Physicswebphysics.iupui.edu/342/phy342sp16/Lec23.pdf · A quantum particle can penetrate a potential barrier higher than its total energy. This phenomena are classical

Visual Data Mining: An Exploratory Approach to Analyzing ...homes.sice.indiana.edu/txu/files/yu_yurovsky_xu_2012_visual data mi… · VISUAL DATA MINING OF GAZE DATA 35. Decoding

Lec23 Gas Dynamics