Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Asynchronous Parallel Computing in Signal Processing andMachine Learning
Wotao Yin (UCLA Math)
joint with Zhimin Peng (UCLA), Yangyang Xu (IMA), Ming Yan (MSU)
Optimization and Parsimonious Modeling – IMA, Jan 25, 2016
1 / 49
Do we need parallel computing?
2 / 49
Back in 1993
3 / 49
2006
4 / 49
2015
5 / 49
35 Years of CPU Trend
1995 2000 2005 2010 2015
Number
of CPUs
Performance
per core
Cores
per CPU
D. Henty. Emerging Architectures and Programming Models for Parallel Computing, 2012.
• In May 2014, Intel cancelled its Tejas project (single-core) and announced anew multi-core project.
6 / 49
Today: 4x ADM 16-core 3.5GHz CPUs (64 cores total)
7 / 49
Today: Tesla K80 GPU (2496 cores)
8 / 49
Today: Octa-Core Headsets
9 / 49
Free lunch was over
• before 2005: a single-threaded algorithm automatically gets faster
• now, new algorithms must be developed for faster speeds by
• exploiting problem structures
• taking advantages of dataset properties
• using all the cores available
10 / 49
How to use all the cores available?
11 / 49
Parallel computing
Agent
Agent
Agent
t1t2tN · · ·
Problem
12 / 49
Parallel speedup
• definition:speedup = serial time
parallel timetime is in the wall-clock sense
• Amdahl’s Law: N agents, no overhead, ρ = percentage of parallelcomputing
ideal speedup = 1(ρ/N) + (1− ρ)
100
102
104
106
108
0
5
10
15
20
Number of processors
Speedup
25%50%90%95%
13 / 49
Parallel speedup
• ε := parallel overhead (startup, synchronization, collection)• in the real world
actual speedup = 1(ρ/N) + (1− ρ) + ε
100
102
104
106
108
0
2
4
6
8
10
Number of processors
Speedup
25%50%90%95%
100
102
104
106
108
0
5
10
15
20
Number of processors
Speedup
25%50%90%95%
when ε = N when ε = log(N)
14 / 49
Sync-parallel versus async-parallel
Agent 1
Agent 2
Agent 3
idle idle
idle
idle
Synchronous(wait for the slowest)
Agent 1
Agent 2
Agent 3
Asynchronous(non-stop, no wait)
15 / 49
Async-parallel coordinate updates
16 / 49
Fixed point iteration and its parallel version
• H = H1 × · · · × Hm
• original iteration: xk+1 = Txk =: (I − ηS)xk
• all agents do in parallel:
agent 1: xk+11 ← T1(xk) = xk1 − ηS1(xk)
agent 2: xk+12 ← T2(xk) = xk2 − ηS2(xk)
...agent m: xk+1
m ← Tm(xk) = xkm − ηSm(xk)
• assumption:1. coordinate friendliness: cost of Six ∼ 1
mcost of Sx
2. synchronization after each iteration
17 / 49
Comparison
Synchronousnew iteration = all agents finish
Agent 1
Agent 2
Agent 3
t0 t1 t2 t3 t4 t5 t6 t7 t10t8 t9
Asynchronousnew iteration = any agent finishes
18 / 49
ARock1: Async-parallel coordinate update
• H = H1 × · · · × Hm• p agents, possibly p 6= m
• each agent randomly picks i ∈ {1, . . . ,m} and updates just xi:
xk+11 ← xk1
...xk+1i ← xki − ηkSixk−dk
...xk+1m ← xkm
• 0 ≤ dk ≤ τ , maximum delay
1Peng-Xu-Yan-Y.’1519 / 49
Two ways to model xk−dk
definitions: let x0, ..., xk, ... be the states of x in the memory
1. xk−dk is consistent if dk is a scalar2. xk−dk is possibly inconsistent if dk is a vector, different components are
delayed by different amounts
ARock allows both consistent and inconsistent read.
20 / 49
Memory lock illustration
Agent 1 read [0, 0, 0, 0]T = x0
consistent read
Agent 1 read [0, 0, 0, 2]T 6∈ {x0, x1, x2}
inconsistent read
21 / 49
History and recent literature
22 / 49
Brief history of async-parallel algorithms(mostly worst case analysis)
• 1969 – a linear equation solver by Chazan and Miranker;
• 1978 – extended to the fixed-point problem by Baudet under theabsolute-contraction2 type of assumption.
• For 20–30 years, mainly solve linear, nonlinear and differential equations bymany people
• 1989 – Parallel and Distributed Computation: Numerical Methods byBertsekas and Tsitsiklis. 2000 – Review by Frommer and Szyld.
• 1991 – gradient-projection itr assuming a local linear-error bound by Tseng
• 2001 – domain decomposition assuming strong convexity by Tai & Tseng
2An operator T : Rn → Rn is absolute-contractive if |T (x)− T (y)| ≤ P |x− y|, component-wise, where|x| denotes the vector with components |xi|, i = 1, ..., n, and P ∈ Rn×n
+ and ρ(P ) < 1.23 / 49
Absolute-contraction
• Absolute-contractive operator T : Rn → Rn:if |T (x)− T (y)| ≤ P |x− y|, component-wise, where |x| denotes thevector with components |xi|, i = 1, ..., n, and P ∈ Rn×n+ and ρ(P ) < 1.
• Interpretation: a series of nested rectangular boxes for xk+1 = Txk
• Applications:• diagonally dominated A for Ax = b
• diagonally dominated ∇2f for minx f(x) (just strong convexity is notenough)
• some network flow problems
24 / 49
Recent work(stochastic analysis)
• AsySCD for convex smooth and composite minimization by Liu et al’14and Liu-Wright’14. Async dual CD (regression problems) by Hsieh et al.’15
• Async randomized (splitting/distributed/incremental) methods:Wei-Ozdaglar’13, Iutzeler et al’13, Zhang-Kwok’14, Hong’14, Chang etal’15
• Async SGD: Hogwild!, Lian’15, etc.
• Async operator sample and CD: SMART Davis’15
25 / 49
Random coordinate selection
• select xi to update with probability pi, where mini pi > 0
• drawback:• agents cannot cache data• either global memory or communication is required• pseudo-random number generation takes time
• benefits:• often faster than the fixed cyclic order• automatic load balance• simplifies certain analysis
26 / 49
Convergence summary
27 / 49
Convergence guarantees
m is # coordinates, τ is the maximum delay, uniform selection pi ≡ 1m
Theorem (almost sure convergence)Assume that T is nonexpansive and has a fixed point. Use step sizesηk ∈ [ε, 1
2m−1/2τ+1 ), ∀k. Then, with probability one, xk ⇀ x∗ ∈ FixT .
In addition, rates can be derived.Consequence: step size is O(1) if τ ∼ √m. Under equal agents and updates,attaining linear speedup if using p = O(
√m) agents. p can be bigger if T is
sparse.
28 / 49
Sketch of proof
• typical inequality:
‖xk+1 − x∗‖2 ≤‖xk − x∗‖2 − c‖Txk − xk‖2
+ harmful terms(xk−1, . . . , xk−τ )
• Descent inequality under a new metric:
E(‖xk+1 − x∗‖2
M
∣∣X k) ≤ ‖xk − x∗‖2M − c ‖Txk − xk‖2
where• the history up to iteration k• xk = (xk, xk−1, . . . , xk−τ ) ∈ Hτ+1, k ≥ 0• any x∗ = (x∗, x∗, . . . , x∗) ∈ X∗ ⊆ Hτ+1
• M is a positive definite matrix.• c = c(ηk,m, τ)
29 / 49
• apply the Robbins-Siegmund theorem:
E(αk+1|Fk) + vk ≤ (1 + ξk)αk + ηk
where all are nonnegative, αk is random, and ξk, ηk are summable. Thenαk converges a.s.
• prove weakly convergent clustering points are fixed-points;
• assume H is separable and apply results [Combettes, Pesquet 2014].
30 / 49
Applications and numerical results
31 / 49
Linear equations (asynchronous Jacobi)
• require: invertible square matrix A with nonzero diagonal entries
• let D be the diagonal part of A; then
Ax = b ⇐⇒ (I −D−1A)x+D−1b︸ ︷︷ ︸Tx
= x.
• T is nonexpansive if ‖I −D−1A‖2 ≤ 1, i.e., A is diagonal dominating
• xk+1 = Txk recovers the Jacobi algorithm
32 / 49
Algorithm 1: ARock for linear equations
Input : shared variables x ∈ Rn, K > 0;set global iteration counter k = 0;while k < K, every agent asynchronously and continuously do
sample i ∈ {1, . . . ,m} uniformly at random;add − ηk
aii(∑
jaij x
kj − bi) to shared variable xi;
update the global counter k ← k + 1;
33 / 49
Sample code
loadData (A, data_file_name );loadData (b, label_file_name );# pragma omp parallel num_threads (p) shared (A,b,x,para){ // A, b, x, and para are passed by reference
call Jacobi (A,b,x,para) or ARock(A,b,x,para );}
• p: the number of threads• A,b,x: shared variable• para: other parameters
34 / 49
Jacobi worker function
for(int itr =0; itr < max_itr ; itr ++){
// compute the update for the assigned x[i]// ...
# pragma omp barrier{
// write x[i] in global memory}
# pragma omp barrier}
Jacobi needs the barrier directive for synchronization
35 / 49
ARock worker function
for(int itr =0; itr < max_itr ; itr ++){
// pick i at random// compute the update for x[i]// ...// write x(i) in global memory
}
ARock has no synchronization barrier directive
36 / 49
Minimizing smooth functions
• require: convex and Lipschitz differentiable function f
• if ∇f is L-Lipschitz, then
minimizex
f(x) ⇐⇒ x =(I − 2
L∇f)︸ ︷︷ ︸
T
x.
where T is nonexpansive
• ARock will be efficient when ∇xif(x) is easy to compute
37 / 49
Minimizing composite functions
• require: convex smooth g(·) and convex (possibly nonsmooth) f(·)
• proximal map: proxγf (y) = arg min f(x) + 12γ ‖x− y‖
2.
minimizex
f(x) + g(x) ⇐⇒ x = proxγf ◦ (I − γ∇g)︸ ︷︷ ︸T
x.
• ARock will be fast if• ∇xig(x) is easy to compute• f(·) is separable (e.g., `1 and `1,2)
38 / 49
Example: sparse logistic regression
• n features, N labeled samples
• each sample ai ∈ Rn has its label bi ∈ {1,−1}
• `1 regularized logistic regression:
minimizex∈Rn
λ‖x‖1 + 1N
N∑i=1
log(1 + exp(−bi · aTi x)
), (1)
• compare sync-parallel and ARock (async-parallel) on two datasets:
Name N (#samples) n (# features) # nonzeros in {a1, . . . , aN}rcv1 20,242 47,236 1,498,952
news20 19,996 1,355,191 9,097,916
39 / 49
Speedup tests
• implemented in C++ and OpenMP.• 32 cores shared memory machine.
#coresrcv1 news20
Time (s) Speedup Time (s) Speedupasync sync async sync async sync async sync
1 122.0 122.0 1.0 1.0 591.1 591.3 1.0 1.02 63.4 104.1 1.9 1.2 304.2 590.1 1.9 1.04 32.7 83.7 3.7 1.5 150.4 557.0 3.9 1.18 16.8 63.4 7.3 1.9 78.3 525.1 7.5 1.1
16 9.1 45.4 13.5 2.7 41.6 493.2 14.2 1.232 4.9 30.3 24.6 4.0 22.6 455.2 26.1 1.3
40 / 49
More applications
41 / 49
Minimizing composite functions
• require: both f and g are convex (possibly nonsmooth) functions
minimize f(x) + g(x) ⇐⇒ z = reflγf ◦ reflγg︸ ︷︷ ︸TPRS
(z).
recover x = proxγg(z)
• TPRS is known as the Peaceman-Rachford splitting operator3
• also, the Douglas-Rachford splitting operator: 12I + 1
2TPRS
• ARock runs fast when• reflγf is separable• (reflγg)i is easy-to-compute/maintain
3reflective proximal map: reflγf := 2proxγf − I. The maps reflγf , reflγg and thusreflγf ◦ reflγg are nonexpansive
42 / 49
Parallel/distributed ADMM
• require: m convex functions fi
• consensus problem:
minimizex
∑m
i=1 fi(x) + g(x)
⇐⇒ minimizexi,y
∑m
i=1 fi(xi) + g(y)
subject to
I
I
. . .I
x1
x2...xm
−I
I...I
y = 0
• Douglas-Rachford-ARock to the dual problem ⇒ async-parallel ADMM:
• m subproblems are solved in the async-parallel fashion• y and zi (dual variables) are updated in global memory (no lock)
43 / 49
Decentralized computing
• n agents in a connected network G = (V,E) with bi-directional links E
• each agent i has a private function fi
• problem: find a consensus solution x∗ to
minimizex∈Rp
f(x) :=n∑i=1
fi(Aixi) subject to xi = xj , ∀i, j.
• challenges: no center, only between-neighbor communication
• benefits: fault tolerance, no long-dist communication, privacy
44 / 49
Async-parallel decentralized ADMM
• a graph of connected agents: G = (V,E).
• decentralized consensus optimization problem:
minimizexi∈Rd,i∈V
f(x) :=∑
i∈V fi(xi)
subject to xi = xj , ∀(i, j) ∈ E
• ADMM reformulation: constraints xi = yij , xj = yij , ∀(i, j) ∈ E
• apply• ARock version 1: nodes asynchronously activate• ARock version 2: edges asynchronously activate
no global clock, no central controller, each agent keeps fi private andtalks to its neighbors
45 / 49
notation:
• N(i) all edges of agent i, N(i) = L(i) ∪R(i)
• L(i) neighbors j of agent i, j < i
• R(i) neighbors j of agent i, j > i
Algorithm 2: ARock for the decentralized consensus problem
Input : each agent i sets x0i ∈ Rd, dual variables z0
e,i for e ∈ E(i), K > 0.while k < K, any activated agent i do
receive zkli,l from neighbors l ∈ L(i) and zkir,r from neighbors r ∈ R(i);update local xki , zk+1
li,i and zk+1ir,i according to (2a)–(2c), respectively;
send zk+1li,i to neighbors l ∈ L(i) and zk+1
ir,i to neighbors r ∈ R(i).
xki ∈ arg minxi
fi(xi) +(∑
l∈L(i) zkli,l +
∑r∈R(i) z
kir,r
)xi + γ
2 |E(i)|‖xi‖2,
(2a)
zk+1ir,i = zkir,i − ηk((zkir,i + zir,r)/2 + γxki ), ∀r ∈ R(i), (2b)zk+1li,i = zkli,i − ηk((zkli,i + zli,l)/2 + γxki ), ∀l ∈ L(i). (2c)
46 / 49
Summary
47 / 49
Summary of async-parallel coordinate descent
benefits:
• eliminate idle time
• reduce communication / memory-access congestion
• random job selection: load balance
mathematics: analysis of disorderly (partial) updates
• asynchronous delay
• inconsistent read and write
48 / 49
Thank you!
Acknowledgements: NSF DMSReference: Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin. UCLA CAM15-37.Website: http://www.math.ucla.edu/˜wotaoyin/arock
49 / 49