Learn Faster: High-Performance Machine Learning on GPU Clusters

Learn Faster: High-Performance Machine Learning on GPU Clusters

Learn Faster: High-Performance MachineLearning on GPU Clusters

Peter Wittek

September 26, 2012


Machine Learning

What Machine Learning Is Not

It is not statisticsData-drivenStrict assumptions on underlying distributions

It is not AIModel-drivenUncertainty is addressed

It is not data miningAlthough there is a considerable overlap


Machine Learning

What Machine Learning Should Be About

Data-drivenLooking for patternsClasses, groups of similar objectsMainly quantitative, but can also be qualitative

Robust, tolerates noiseGeneralize well beyond training data


Machine Learning

Characteristics

Loose collection of algorithmsNo common ground

Few assumptionsParameters can be a major obstacleComputationally intensive

Not easy to parallelizeN:N access patterns are commonOr N:K through a proxy


Machine Learning

Nature-Inspired Methods

Many nature-inspired methodsComputational IntelligenceNeural networks, flocking algorithms, genetic algorithms,chemical reactions, etc.Also methods inspired by quantum mechanicsOthers: manifold learning, density-based clustering,support vector machines, etc.


Machine Learning

Learning Approach

SupervisedBiomedical: recognizing cancer cellsRecognizing handwritingSpam detection

UnsupervisedRecommendation enginesFinding groups of similar patentsIdentifying trends in a dynamic environment


Machine Learning

Ensembles


High-Performance Machine Learning

Why Do We Need It?

Petabytes of dataSparse, noisy, might be missing elements

There should be as few assumptions as possible

Large scale may not entail a need for quick learningmethods



A Case Study: Digital Preservation

Adding advanced services to digital librariesCloud paradigm is importantOverview of the SHAMAN core infrastructure:



Examples

An ensemble of unsupervised methods:Distributed indexing (not on GPUs)Dimensionality reductionVisualization of clusters

A supervised classifier



Dimensionality Reduction: Random Projection

Johnson-Lindenstrauss lemma (1984)Latent Semantic AnalysisCPU: Incremental approachGPU: 14.5x slow-downAw×dRd×k = A′w×k



Dimensionality Reduction: Random Projection

Dead end: MapReduceMPI and CuSparseVery irregular, sparse matrix ( 1 % nonzero)GPUs 2 4 8 16All With I/O 5.10terms Projection only 19.37Subset With I/O 2.34 3.46 4.53 5.38

Projection only 2.02 4.05 8.10 16.45



Visualization: Self-Organizing Maps

wj(t + 1) = wj(t) + αhbj(t)[x(t)− wj(t)]

hbj = exp(−||rb−rj ||δ(t) )

Batch formulation

wj(tf ) =∑tf

t′=t0hbj (t′)x(t′)∑tf

t′=t0hbj (t′)

Video




Critical operation: finding best matching unit

d(wj(t0), x(t)) =√∑N

i=1(xi(t)− wji(t0))2

Multi-step reduction to find the minimum

1: v1 = (X ◦ X )[1,1 . . . 1]′

2: v2 = (W ◦W )[1,1 . . . 1]′

3: P1 = [v1v1 . . . v1]4: P2 = [v2v2 . . . v2]

′

5: P3 = XW ′

6: D = (P1 + P2 − 2P3)




GPUs 2 4 8 16All With I/O 8.69terms One epoch 9.68Subset With I/O 8.57 7.49 6.48 4.85

One epoch 9.68 9.42 9.75 9.56



Classification: Support Vector Machines

w′φ(xi) + b ≥ 1− ξi if yi = +1,

w′φ(xi) + b ≤ −1 + ξi if yi = −1,

Making a problem linearly separable after embedding intoa feature space by a nonlinear map φ.Minimize min 1

2‖w‖2 + C

∑i ξi

Solve the dual with the Gram matrix K (xi ,xj) = φ(xi)φ(xj)′.

a) b)



Classification: Support Vector Machines

SVM ModelCreation

Cross ValidationKernel Matrix

Calculation (GPU)

SVM ParameterSelection

N-fold Validation

Different Sets of Parameters

TrainingData

SVMModel

Around 10x speedup


Quantum-Inspired Methods

Why Is Quantum Mechanics Relevant?

Contextual probabilityp(A ∩ B) 6= p(B ∩ A)If an event A happens, it implies a context

Robust and naturally fuzzyQuantum probability and quantum logic: same linalgframeworkBonus: HPC acceleration for free



Dynamic Quantum Clustering

Semi-classical methodEhrenfest’s theoremEvolves the Hamiltonian of a quantum system:

Hψ(x , t) = (T + V (x))ψ(x , t)



Dynamic Quantum Clustering

Speedups are impressive; square root of matrix below

0

10

20

30

40

50

60

70

80

90

64 128 256 512 1024 2048 4096 8192

Speedup

Matrix size

Without Memory TransferWith Memory Transfer



There Is More to It

Trotter-Suzuki AlgorithmAvoids eigendecompositionLinear scaling tested up to 64 GPUsSpeedup over SSE and cache optimized CPU variant: 4-8x

0

5

10

15

20

25

30

1 2 4 8 16 32

Tim

e (

s)

Nodes

cpusse

cudahybrid

Going beyond current HPC: Machine learning based onactual quantum computers



Summary

ML is about data and patternsBlend of algorithmsEnsembles

AI?Parallel and distributed computing with challenges

Large-scale versus HPCTowards a common ground: quantum-inspired methods

Bonus: HPC with little effort

Documents

Learn Faster: High-Performance Machine Learning on GPU Clusters