Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018....

Securing Distributed Machine Learningin High Dimensions

Jiaming Xu

The Fuqua School of BusinessDuke University

Joint work withYudong Chen (Cornell) and Lili Su (MIT)

Workshop on Information, Learning and DecisionShanghaiTech, July 1, 2018

Distributed Machine Learning: Robustness

• An attractive solution to large-scale problems

I Algorithms: [Boyd et al. 11], [Jordan, Lee and Yang 16], etc.

I Systems: [Map-Reduce, Dean and Ghemawat 08], etc.

• The necessity of robustness:

Corrupted data

I Statistical noise: [Candes et al, JACM 11] [Loh and Wainwright,NIPS 11]

I Adversarial corruption: No structural assumptions [Chen, Caramanisand Mannor, ICML 13] [Diakonikolas et al., FOCS 16] [Charikar etal., STOC 17]

• The necessity of robustness:

Corrupted data

• The necessity of robustness: Corrupted data

Our Goal

• Implicit assumption of previous work: Reliable learning system

I Each computing device follows some designed specification

• Our focus: Unreliable learning system

I Adversarial attacks: Some unknown subset of computing devices arecompromised, and behave adversarially – such as sending outmalicious messages

Goal: Secure model training in unreliable learning system

Our Goal

• Implicit assumption of previous work: Reliable learning system

I Each computing device follows some designed specification

• Our focus: Unreliable learning system

I Adversarial attacks: Some unknown subset of computing devices arecompromised, and behave adversarially – such as sending outmalicious messages

Why consider unreliable learning system?

Privacy Risk in Conventional Learning Paradigm

• Data is collected from providers and stored at clouds

• Serious privacy risks:I Facebook data scandalI PRISM: Facebook, Google, Yahoo!, Apple, Microsoft, Dropbox, etc.

Jiaming Xu (Duke) Machine Learning Security 5

Privacy Risk in Conventional Learning Paradigm

• Data is collected from providers and stored at clouds

• Serious privacy risks:I Facebook data scandalI PRISM: Facebook, Google, Yahoo!, Apple, Microsoft, Dropbox, etc.

New Learning Paradigm: Federated Learning

Key idea: Leave training data on mobile devices

• Learning with external workers (data providers)• Proposed by Google researcher [McMahan 16]• Tested by Gboard on Android and Google Keyboard

Security Risk in Federated Learning

• Less secured implementation environment• External workers are prone to adversarial attack – reprogrammed by

system hackers and behave maliciously

Challenges of Securing Unreliable Learning Systems

• Low local data volume versus high model complexity

I Local estimator is statistically inaccurate

I Hard to distinguish statistical errors from adversarial errors

I Call for close interaction between the learner (cloud) and the workers

• Communication constraints: Data transmission suffers high latencyand low throughout

Objectives

• Tolerate adversarial failures of the external workers

• Accurately learn highly complex models with low local data volume

• Use only a few communication rounds

Outline of the Remainder

1 Problem formulation

2 Algorithm 1: Geometric median of means

3 Algorithm 2 (Optimal Algorithm):Iterative rewriting + projecting + filtering

4 Summary and concluding remarks

Problem Formulation: Learning Model

• N i.i.d. data points Xii.i.d.∼ µ

• Collectively kept by m workers – each worker keeps Nm data points

• The learner wants to pick a model in Θ ⊆ Rd

• loss function f(x, θ): loss induced by x ∈ X under the model choiceθ ∈ Θ

Target: θ∗ ∈ arg minθ∈Θ F (θ) , E[f(X, θ)]

NOTE: the population risk F (θ) is unknown

Example: Linear Regression

• N i.i.d. data points Xi = (wi, yi)i.i.d.∼ µ

I wi can be the features of a house/apartment, and yi is its sold price

• Θ ⊆ Rd: the set of possible linear predictors

• Risk function f(x, θ) = 12(y − 〈w, θ〉)2

Target: θ∗ ∈ arg minθ∈Θ E[

12(y − 〈w, θ〉)2

Problem Formulation: Byzantine Fault Model

• In any iteration, up to q out of m workers are compromised andbehave arbitrarily;

• the set of faulty workers may be different across iterations;

• faulty workers have complete knowledge of the system;

• faulty workers can collude

Algorithm: Byzantine Gradient Descent

The learner:

1 Broadcast the current model parameter estimator θt−1;

2 Wait to receive all the gradients g(j)t from all workers j;

3 Aggregate gradients to obtain F̂ (θt−1);

4 Update: θt ← θt−1 − ηt × F̂ (θt−1);

Non-faulty worker j:

1 Compute the sample gradient g(j)t =

∑local data Xi

∇f(Xi, θt−1);

2 Send g(j)t back to the learner;

Averaging, i.e., taking F̂ (θt−1) = 1m

∑mj=1 g

(j)t , is not robust to

even a single Byzantine failure!

The learner:

3 Aggregate gradients to obtain F̂ (θt−1);

∑local data Xi

∇f(Xi, θt−1);

Averaging, i.e., taking F̂ (θt−1) = 1m

∑mj=1 g

The learner:

3 Robust gradient aggregate to obtain F̂ (θt−1);

∑local data Xi

∇f(Xi, θt−1);

Simple averaging, i.e., taking F̂ (θt−1) = 1m

∑mj=1 g

Generic Key Technical Challenges

• Suppose F (θ) is known: Perfect gradient descent –θt = θt−1 − η ×∇F (θt−1)

• But F (θ) is unknown: Approximate gradient descent –

θ′t = θ′t−1 − ηt ×∇F̂ (θ′t−1) = θ′t−1 − ηt ×∇F (θ′t−1) + ε(θ′t−1).

I The elements in{ε(θ′t−1)

}∞t=1

are dependent on each other;

I Complicated interplay between the randomness and the arbitrarybehaviors of Byzantine workers.

Our analysis plan: show uniform convergence,i.e., show ε(θ) ≈ 0 uniformly for all θ ∈ Θ

Standard concentration results might not suffice

Generic Key Technical Challenges

• Suppose F (θ) is known: Perfect gradient descent –θt = θt−1 − η ×∇F (θt−1)

• But F (θ) is unknown: Approximate gradient descent –

θ′t = θ′t−1 − ηt ×∇F̂ (θ′t−1) = θ′t−1 − ηt ×∇F (θ′t−1) + ε(θ′t−1).

I The elements in{ε(θ′t−1)

}∞t=1

are dependent on each other;

I Complicated interplay between the randomness and the arbitrarybehaviors of Byzantine workers.

Our analysis plan: show uniform convergence,i.e., show ε(θ) ≈ 0 uniformly for all θ ∈ Θ

Standard concentration results might not suffice

Algorithm I: Median of Means

Robust Gradient Aggregation: Median of Means

Median of Means

Given nk points X1, . . . , Xnk,

φ̂MM , median

n∑i=1

Xi, · · · ,1

kn∑i=(k−1)n+1

Definition (Geometric median)

y∗ , med {y1, · · · , ym} = arg miny∈Rd

∑mi=1 ‖y − yi‖2

Efficient computation of Geometric Median: Nearly linear time[Cohen et al. STOC 2016]

Robustness of Geometric Median

∑mi=1 ‖y − yi‖2

• One-dimension case: Geometric median = standard medianIf strictly more than bn/2c points are in [−r, r] for some r ∈ R,then median ALSO lies in [−r, r]

• Multi-dimension case:

Lemma (Minsker et al. 2015)

For any α ∈ (0, 1/2) and given r ∈ R, if∑n

i=1 1{‖yi‖2≤r} ≥ (1− α)n,

then ‖y∗‖2 ≤ Cαr, where Cα = 1−α√1−2α

Intuition: Majority voting in the noisy setting

∑mi=1 ‖y − yi‖2

i=1 1{‖yi‖2≤r} ≥ (1− α)n,

∑mi=1 ‖y − yi‖2

i=1 1{‖yi‖2≤r} ≥ (1− α)n,

Performance with Median of Means

(1) q ≥ 1: the maximum # of Byzantine workers;(2) d: model dimension, i.e., Θ ⊆ Rd

Theorem (Informal)

Suppose some mild technical assumptions hold, and 2(1 + ε)q ≤ k ≤ m.Assume F (θ) is M -strongly convex with L-Lipschitz gradient. Then whp

‖θt − θ∗‖ ≤ ρt‖θ0 − θ∗‖+ C

N, ∀t ≥ 1,

where ρ = 12 + 1

√1− M2

4L2 ∈ (0, 1).

• After logN rounds,√dq/N becomes the dominant part

• When q = 0, we choose k = 1

• When q is large, we choose k = 2(1 + ε)q, resulting error ofO(√dq/N)

Drawbacks of Geometric Median in High Dimensions

y∗ = arg min

m∑i=1

‖y − yi‖ ⇐⇒m∑i=1

yi − y∗

‖yi − y∗‖= 0

••

••••

••

• •

••

•• •

••

• ••

••

•••••••• ••••• •••••••

ε√d

• Good data yii.i.d.∼ N (µ, Id)

• ε fraction is adversariallycorrupted

• GM suffers from ε√d error

y∗ = arg min

m∑i=1

‖y − yi‖ ⇐⇒m∑i=1

yi − y∗

‖yi − y∗‖= 0

••

••••

••

• •

••

•• •

••

• ••

••

•••••••• ••••• •••••••

ε√d

y∗ = arg min

m∑i=1

‖y − yi‖ ⇐⇒m∑i=1

yi − y∗

‖yi − y∗‖= 0

••

••••

••

• •

••

•• •

••

• ••

••

•••••••• ••••• •••••••

ε√dX

Algorithm II:Optimal Algorithm in High Dimension

[Su and Xu, 2018] improves the estimation error

from O

(√qdN

)to O(

√d/N +

√q/N) – matching the minimax

error rate in the ideal failure-free setting as long as q = O(d).

Ideas of Iterative Filtering [SCV ’18]

••

••••

••

• •

••

•• •

••

•• ••••••••••••• •••• ••

• If the center µ were known, from

uu> ∈ arg max∑i

(yi − µ)> U (yi − µ)

s.t. U � 0

Tr(U) ≤ 1,

filter out outliers based on 〈yi − µ, u〉2

• However, µ is unknown!

• Idea: represent yi through∑

jWjiyj ;

Wji is constrained to be around 1(1−ε)m

Iterative Filtering Algorithm [SCV ’18]

Define cost function

φ(W,U) =∑i∈S

yi −∑j∈S

> Uyi −∑

j∈AWjiyj

1 Compute saddle point

(Center approxi.) W ∗ ∈ arg minW

φ(W,U)

(Extreme direction) U∗ ∈ arg maxU

φ(W,U)

2 If φ(W ∗, U∗) is small enough, stop; otherwise, down-weight ci

proportional to(yi −

∑j∈SW

∗jiyj

)>U∗(yi −

∑j∈SW

∗jiyj

), throw

away data points for which ci ≤ 1/2, and repeat.

Guarantees of Iterative Filtering Algorithm [SCV ’18]

Lemma (SCV ’18)

Define µS = 1m

∑mi=1 yi. Suppose that∥∥∥∥ 1

(yi − µS) (yi − µS)>∥∥∥∥

≤ σ2.

Then for ε ≤ 14 , Iterative Filtering Algorithm outputs µ̂ such that

‖µ̂− µS‖ = O(σ√ε).

• Gradient vectors {gj(θt−1)}mj=1 are not i.i.d.• Apply with yj = gradient functions:

gj(θ) =1

|Sj |∑i∈Sj

∇f (Xi, θ)

• Need concentration of matrix [g1(θ), . . . , gm(θ)] uniformly over θ

Uniform Concentration of Sample Covariance Matrix

• If gradient functions gj(θ) is sub-Gaussian, use ε-net

• However, in many cases such as linear regression, gj(θ) issub-exponential

• Existing tail bounds for matrices with sub-exponential columns arenot tightState-of-the-art: Standard concentration bounds [ALPTJ ’10]:√md+ d

Theorem (SX ’18)

Let A be a d×m matrix whose columns Aj are i.i.d. sub-exponential,zero-mean. Then with probability at least 1− e−d,

‖A‖2 .√m+ d log3 d

Remark:Tight up to poly-log factors

Guarantees of Aggregated Gradient by Iterative Filtering

Theorem (SX ’18)

Suppose some mild technical assumptions hold and N & d2. Let ∇F̂ (θ)be the aggregated gradient function by Iterative Filtering Algorithm.

Then with probability at least 1− 2e−√d,

∥∥∥∇F̂ (θ)−∇F (θ)∥∥∥ .

)‖θ − θ∗‖+

• N & d2 is due to our sub-exponential assumption and is inevitable

• If assuming sub-Gaussian instead, only N & d is needed

Main Convergence Result

Theorem (SX ’18)

Suppose some mild technical assumptions hold and N & d2. AssumeF (θ) is M -strongly convex with L-Lipschitz gradient. Then whp,

‖θt − θ∗‖ .(

1− M2

)t‖θ0 − θ∗‖+

• Improves over geometric median (√dq/N)

• If q = O(d), error rate is optimal

• Tolerate up to q/m = Θ(1) fraction of Byzantine errors

• Exponential convergence → only logarithmic communication rounds

References

• Lili Su and Jiaming Xu,Securing Distributed Machine Learning in High Dimensions,arXiv:1804.10140, April 2018

• Yudong Chen, Lili Su, Jiaming Xu:Distributed Statistical Machine Learning in Adversarial Settings:Byzantine Gradient Descent

I Conference version: SIGMETRICS 2018;

I Journal version: POMACS Proceedings of the ACM on Measurementand Analysis of Computing Systems, Dec. 2017.

Duke's Fuqua School of Business - Securing Distributed Machine …jx77/SML_LSNew.pdf · 2018....

Documents

Git: The Lean, Mean, Distributed Machine

Distributed Machine Learning with Communication Constraints · 2018-10-10 · Distributed Machine Learning with Communication Constraints by Yuchen Zhang A dissertation submitted

An Edge Computing Marketplace for Distributed Machine Learningweb.cecs.pdx.edu/~aryafare/2019-SIGCOMM-DEMO.pdf · An Edge Computing Marketplace for Distributed Machine Learning Susham

TuX : Distributed Graph Computation for Machine Learning · for Machine Learning ... 1 Introduction Distributed graph engines, such as Pregel [30], ... Figure 1: Examples of machine

Distributed Machine Learning - microsoft.com · machine learning algorithms into distributed learning algorithms. They de-termine conditions for which a distributed approach is better

A Proxy for Distributed Hash Table based Machine-to ...maguire/DEGREE-PROJECT-REPORTS/110629... · based Machine-to-Machine Networks ... A Proxy for Distributed Hash Table based Machine-to-Machine

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

Machine Learning & MapReduce - daslab.seas.harvard.edudaslab.seas.harvard.edu/classes/cs265/files/presentations/CS265... · Machine Learning & MapReduce MLbase: A Distributed Machine

Scaling Distributed Machine Learningmuli/file/mu_defense_animation.pdf · Hadoop/Spark BSP data ... Scaling Distributed Machine Learning Large Scale OptimizationDistributed Systems

High-Performance Distributed Machine Learning in ... · High-Performance Distributed Machine Learning in Heterogeneous Compute Environments Celestine Dünner Martin Jaggi (EPFL) Thomas

Mahout and Distributed Machine Learning 101

Distributed Machine Learning Toolkit (DMTK)DMTK •DMTK (current release) includes •Distributed machine learning framework (code name: Multiverso) •Parameter server + client SDK

MITIGATION OF SYNCHRONOUS MACHINE BASED DISTRIBUTED

Scaling Distributed Machine Learning with the …0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf…Scaling Distributed Machine Learning with the Parameter Server Mu Li

Distributed Machine Learning and Graph Processing with

Distributed Inference for Linear Support Vector Machine

Communication Efﬁcient Distributed Machine Learning …muli/file/parameter_server_nips14.pdf · Communication Efﬁcient Distributed Machine Learning with ... Such huge quantities

Windows-NT based Distributed Virtual Parallel Machine

Distributed Intelligent Systems – W6 Machine …...Distributed Intelligent Systems – W6 Machine-Learning and Multi-Level Modeling Methods Applied t Dibtdti Rbit S tto Distributed

Distributed Prolog Reasoning in the Cloud for Machine-2-Machine