31
SPARK SUMMIT EUROPE 2016 SCALING FACTORIZATION MACHINES ON APACHE SPARK WITH PARAMETER SERVERS Nick Pentreath Principal Engineer, IBM

Spark Summit EU talk by Nick Pentreath

Embed Size (px)

Citation preview

Page 1: Spark Summit EU talk by Nick Pentreath

SPARK SUMMIT EUROPE 2016

SCALING FACTORIZATION MACHINES ON APACHE SPARK WITH PARAMETER SERVERS

Nick PentreathPrincipal Engineer, IBM

Page 2: Spark Summit EU talk by Nick Pentreath

About• About me

– @MLnick– Principal Engineer at IBM working on machine

learning & Spark– Apache Spark PMC– Author of Machine Learning with Spark

Page 3: Spark Summit EU talk by Nick Pentreath

Agenda• Brief Intro to Factorization Machines• Distributed FMs with Spark and Glint• Results• Challenges• Future Work

Page 4: Spark Summit EU talk by Nick Pentreath

FACTORIZATION MACHINES

Page 5: Spark Summit EU talk by Nick Pentreath

Factorization MachinesFeature interactions

Page 6: Spark Summit EU talk by Nick Pentreath

Factorization MachinesLinear Models

𝑤" +$𝑤%𝑥%

'

%()

Not expressive enough!

w0

Bias terms

Page 7: Spark Summit EU talk by Nick Pentreath

Factorization MachinesPolynomial Regression

𝑤" +$𝑤%𝑥% +$ $ 𝑤%*𝑥%𝑥*

'

*(%+)

'

%()

'

%()w0

Bias terms Interaction term

O(d2)

n << d

Page 8: Spark Summit EU talk by Nick Pentreath

Factorization MachinesFactorization Machine

w0

Bias terms Factorized interaction term

𝑤" +$𝑤%𝑥% +$ $ 𝑣%𝑣* 𝑥%𝑥*

'

*(%+)

'

%()

'

%()

O(d2k)

Page 9: Spark Summit EU talk by Nick Pentreath

Factorization MachinesFactorization Machine

O(d2k) O(dk)Math hack

aka reformulation

Not convex, but efficient to train using SGD, coordinate descent, MCMC

Page 10: Spark Summit EU talk by Nick Pentreath

Factorization MachinesFactorization Machine

Standard matrix factorization(with biases)

Contextual featuresMetadata features

Model size can still be very large! e.g. video sharing, online ads, social networks

Page 11: Spark Summit EU talk by Nick Pentreath

DISTRIBUTED FM MODELS ON SPARK

Page 12: Spark Summit EU talk by Nick Pentreath

Linear Models on Spark• Data parallel

Master

Send model to workers

Update model

Send

Update

...

Final model

Worker

Compute partial gradient using latest full model

Compute partial gradient using latest full model

Worker

Compute partial gradient using latest full model

Compute partial gradient using latest full model

Tree aggregate

Broadcast

Bottleneck!

Page 13: Spark Summit EU talk by Nick Pentreath

Parameter Servers• Model & data parallel

Master

Schedule work

...

Worker

Compute partial gradient using latest partial model

Compute partial gradient using latest partial model

Parameter Server

Update model

Update model

Pull partial model

Push partial gradient

Only push/pull required features

Page 14: Spark Summit EU talk by Nick Pentreath

Distributed FMs• spark-libFM

– Uses old MLlib GradientDescent and LBFGS interfaces• DiFacto

– Async SGD implementation using parameter server (ps-lite)– Adagrad, L1 regularization, frequency-adaptive model-size

• Key is that most real-world datasets are highly sparse(especially high-cardinality categorical data), e.g. online ads, social network, recommender systems

• Workers only need access to a small piece of the model

Page 15: Spark Summit EU talk by Nick Pentreath

GlintFM• Procedure:

1. Construct Glint Client2. Create distributed

parameters3. Pre-compute required

feature indices (per partition)

4. Iterate:• Pull partial model

(blocking)• Compute partial gradient

& update• Push partial update to

parameter servers (can be async)

5. Done!

Glint is a parameter server built using AkkaFor more info, see Rolf’s talk at 17:15!

Page 16: Spark Summit EU talk by Nick Pentreath

RESULTS

Page 17: Spark Summit EU talk by Nick Pentreath

DataCriteo Display Advertising Challenge Dataset• 45m examples, 34m unique features, 48 nnz /example

0

2

4

6

8

10

12

Mill

ions Unique Values per Feature

0%

20%

40%

60%

80%

100%Feature Occurence (%)

Page 18: Spark Summit EU talk by Nick Pentreath

Feature Extraction

Raw Data StringIndexer OneHotEncoder VectorAssemblerOOM!

Page 19: Spark Summit EU talk by Nick Pentreath

Feature ExtractionSolution? “Stringify” + CountVectorizer

Row(i1=u'1', i2=u'1', i3=u'5', i4=u'0', i5=u’1382’,... )

Row(raw=[u'i1=1', u'i2=1', u'i3=5', u'i4=0', u'i5=1382', ...)

Convert set of String features into Seq[String]

Page 20: Spark Summit EU talk by Nick Pentreath

Feature Extraction

Raw Data Stringify Count Vectorizer

Page 21: Spark Summit EU talk by Nick Pentreath

Feature Extraction

Raw Data Stringify HashingTF

Page 22: Spark Summit EU talk by Nick Pentreath

Performance

0

500

1000

1500

2000

2500

3000

3500

4000

4500

k=0 k=6 k=16 k=32

Total run time (s)*

Mllib FM Glint FM

N/A

*10 iterations, 48 partitions, fit intercept

Model size 1.9GB

Model size 4.6GB

2GB limit

Page 23: Spark Summit EU talk by Nick Pentreath

Performance

Broadcast Read

*k = 6, 10 iterations, 48 partitions, fit intercept

Compute

Gradient computation

Aggregation & collect

Page 24: Spark Summit EU talk by Nick Pentreath

Performance

-

20

40

60

80

100

120

140

160

MLLib FM Glint FM

Median time per iteration (s)

OtherComputeCommunication

0

500

1000

1500

2000

2500

3000

3500

4000

4500

MB / iteration

Data Transfer

Mllib FM

Glint FM

*k = 6, 10 iterations, 48 partitions, fit intercept

Page 25: Spark Summit EU talk by Nick Pentreath

CHALLENGES & FUTURE WORK

Page 26: Spark Summit EU talk by Nick Pentreath

Challenges• Tuning configuration

– Glint - models/server, message size, Akka frame size– Spark - data partitioning (can be seen as “mini-batch”

size)• Lack of server-side processing in Glint

– For L1 regularization, adaptive sparsity, Adagrad– These result in better performance, faster execution

• Backpressure / concurrency control

Page 27: Spark Summit EU talk by Nick Pentreath

Challenges• Tuning models / server

0

500

1000

1500

2000

2500

3000

6 12 24Models / parameter server

Total runtime

*k = 16, 10 iterations, 48 partitions, fit intercept

0

50

100

150

200

250

300

6 12 24Models / parameter server

Median iteration time

Max iteration time

Page 28: Spark Summit EU talk by Nick Pentreath

Challenges• Index partitioning for “hot features”

0

500

1000

1500

2000

Total run time (s)*

CV features

Hashed features

– CountVectorizer for features leads to hot spots & straggler tasks due to sorting by occurrence

– OneHotEncoder OOMed... but can also face this problem

– Spreading out features is critical (used feature hashing)

*k = 16, 10 iterations, 48 partitions, fit intercept

Page 29: Spark Summit EU talk by Nick Pentreath

Future Work• Glint enhancements

– Add features from DiFacto, i.e. L1 regularization, Adagrad & memory-adaptive k

– Requires support for UDFs on the server– Built-in backpressure (Akka Artery / Streams?)– Key caching – 2x decrease in message size

• Mini-batch SGD within partitions• Distributed solvers for ALS, MCMC, CD• Relational data / block structure formulation

– www.vldb.org/pvldb/vol6/p337-rendle.pdf

Page 30: Spark Summit EU talk by Nick Pentreath

References• Factorization Machines

– http://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf– https://github.com/ibayer/fastFM– www.libfm.org– https://github.com/zhengruifeng/spark-libFM– https://github.com/scikit-learn-contrib/polylearn

• DiFacto– https://github.com/dmlc/difacto– www.cs.cmu.edu/~yuxiangw/docs/fm.pdf

• Glint / parameter servers– https://github.com/rjagerman/glint– https://github.com/dmlc/ps-lite

Page 31: Spark Summit EU talk by Nick Pentreath

SPARK SUMMIT EUROPE 2016

THANK YOU.@MLnick

github.com/MLnick/glint-fmspark.tc