57
Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Fraunhofer Center Machnine Larning

Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Embed Size (px)

Citation preview

Page 1: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Towards Scalable Machine Learning

Janis Keuper

itwm.fraunhofer.de/ml

Competence Center High Performance ComputingFraunhofer ITWM, Kaiserslautern, Germany

Fraunhofer Center Machnine Larning

Page 2: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Outline

I Introduction / Definitions

II Is Machine Learning a HPC Problem?

III Case Study: Scaling the Training of Deep Neural Networks

IV Towards Scalable ML Solutions [current Projects]

V A look at the (near) future of ML Problems and HPC

Page 3: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Machine Learning @CC-HPC

Scalable distributed ML AlgorithmsDistributed Optimization MethodsCommunication Protocols

Distributed DL Frameworks

“Automatic” MLDL Meta-Parameter LearningDL Topology Learning

HPC-Systems for Scalable MLDistributed I/ONovel ML Hardware Low Cost ML Systems

DL Methods:Semi- and Unsupervised DLGenerative Models ND CNNs

ML HPC

Industry ApplicationsDL Software optimization for Hardware / Clusters DL for Seismic AnalysisDL Chemical Reaction Prediction DL for autonomous driving

I Introduction

Page 4: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

I Setting the Stage | Defnitions

Scalable ML vs Large Scale ML

● Large model size (implies large data as well)

● Extreme compute effort

● Goals:

● Larger models● (linear) strong an weak

scaling through (distributed) parallelization

→ HPC

● Very large data sets (online stream)

● “normal” model size and compute effort(traditional ML methods)

● Goals:

● Make training feasible● Often online training

→ Big Data

Page 5: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Scaling DNNs

Layer 1

Layer 2

Layer n-1

Layer n

...

Simple strategy in DL if it does not work: scale it!

Scaling in two dimensions:

1. Add more layers = more matrix mult more convolutions

2. Add more units = larger matrix mult more convolutions

Don't forget: in both cases

MORE DATA! → more iterations

Page 6: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Scaling DNNs

Network of Networks

137 billion free parameters !

Page 7: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

II Is Scalable ML a HPC Problem?

● In terms of compute needed (YES)

● Typical HPC Problem setting: is communication bound = non trivial parallelization (YES)

● I/O bound (New to HPC)

Page 8: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Success in Deep Learning is driven by compute power:

https://blog.openai.com/ai-and-compute/

→ # FLOP needed to compute leading model is ~ doubling every 3.5 months !

→ increase since 2012: factor ~300.000 !

Page 9: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Impact on HPC (Systems)

● New HPC Systems

● Like ONCL “Summit”● Power 9● ~30k NVIDIA Volta GPUs● New storage hierarchies

● New Users (=new demands)

● Still limited resources

https://www.nextplatform.com/2018/03/28/a-first-look-at-summit-supercomputer-application-performance/

Page 10: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

III Case Study: Training DNNs

I Overview: distributed parallel training of DNNs

II Limits of Scalability

Limitation I: Communication Bounds

Limitation II: Skinny Matrix Multiplication

Limitation III: Data I/O

Based on our paper from SC 16

Page 11: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Deep Neural NetworksIn a Nutshell

Layer 1

Layer 2

Layer 3 Layer 4

Layer 5

Layer 6

At an abstract level, DNNs are:

● directed (acyclic) graphs● Nodes are compute entities (=Layers)● Edges define the data flow through the graph

Inference / Training

Forward feed of data through the network

Page 12: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Deep Neural NetworksIn a Nutshell

At an abstract level, DNNs are:

● directed (acyclic) graphs● Nodes are compute entities (=Layers)● Edges define the data flow through the graph

Inference / Training

Forward feed of data through the network

Layer 1

Layer 2

Layer n-1

Layer n

...

Common intuition

Page 13: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Minimize Loss-Function via gradient descent (high dimensional and NON CONVEX!)

Computed via Back Propagation Algorithm:

1. feed forward and compute activation2. error by layer3. compute derivative by layer

Training Deep Neural NetworksThe Underlying Optimization Problem

Page 14: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Optimization Problem

Layer 1

Layer 2

Layer n-1

Layer n

...

forward backward

By Stochastic Gradient Descent (SGD)

1. Initialize weights W at random 2. Take small random subset X (=batch) of the train data3. Run X through network (forward feed)4. Compute Loss5. Compute Gradient6. Propagate backwards through the network7. Update W

Repeat 2-8 until convergence

Page 15: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Parallelization Common approaches to parallelize SGD for DL

Parallelization of SGD is very hard: it is an inherently sequential algorithm

1. Start at some state t (point in a billion dimensional space)2. Introduce t to data batch d13. Compute an update (based on the objective function)4. Apply the update →t+1

How to gain Speedup ?

Make faster updatesMake larger updates

State t State t+1State t+2

d1 d2d3

Page 16: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Parallelization Common approaches to parallelize SGD for DL

Layer 1

Layer 2

Layer n-1

Layer n

...

forward backwardInternal parallelization = parallel execution of the layer operation

● Mostly dense matrix multiplication: ● standard blas sgemm● MKL, Open-Blas● CuBlas on GPUs

● Task parallelization for special Layers● Cuda-CNN for fast convolutions

Page 17: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Parallelization Common approaches to parallelize SGD for DL

Layer 1

Layer 2

Layer n-1

Layer n

...

forward

External: data parallelization over the data batch

Layer 1

Layer 2

Layer n-1

Layer n

...

forward

Layer 1

Layer 2

Layer n-1

Layer n

...

forward

Master

1. Split batchand send to workers

Page 18: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Parallelization Common approaches to parallelize SGD for DL

Layer 1

Layer 2

Layer n-1

Layer n

...

forward backward

External: data parallelization over the data batch

Layer 1

Layer 2

Layer n-1

Layer n

...

forward backward

Layer 1

Layer 2

Layer n-1

Layer n

...

forward backward

Master

1. Split batchand send to workers

2. Workers compute forward+backward and send gradients to master

Page 19: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Parallelization Common approaches to parallelize SGD for DL

Layer 1

Layer 2

Layer n-1

Layer n

...

forward backward

External: data parallelization over the data batch

Layer 1

Layer 2

Layer n-1

Layer n

...

forward backward

Layer 1

Layer 2

Layer n-1

Layer n

...

forward backward

Master

1. Split batchand send to workers

2. Workers compute forward+backward and send gradients to master

3. Master combines gradients and computes updates of W. Sends new W' to workers

Page 20: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Parallelization Common approaches to parallelize SGD for DL

External: data parallelization over the data batch

Master

1. Split batchand send to workers

2. Workers compute forward+backward and send gradients to master

3. Master combines gradients and computes updates of W. Sends new W' to workers

Speedup

Page 21: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Limitation IDistributed SGD is heavily Communication Bound

Gradients have the same size as the model

● Model size can be hundreds of MB● Iteration time (GPU) <1s

Page 22: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Communication BoundExperimental Evaluation

Page 23: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Solving the Communication Bottleneck Solutions in Hardware: example NVLink

NVIDIA DGX-1 / HGX-1

● 8x P100● DGX-1: NVLink between all GPUs

NVLink spec: ~40GB/s (single direction)NVLink II (Volta): ~75GB/s (single direction)

PCIe v3 16x: ~15GB/s

Diagram DGX-1 by NVIDIA

Page 24: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Solving the Communication Bottleneck Solutions in Hardware: NVLink Benchmark

IBM Minsky

● 4x P100● 2x10 core Power8 (160 hw threads)● NVLink between all components

NVLink spec: ~40GB/s (single direction)

0.5 1 1.5 2 2.5 3 3.5 4 4.50

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Minsky Benchmark Deep Learning

AlexNet Topology with NVIDIA-Caffe on ImageNet Data (batch_size 1024)

Linear Scaling Theoretical Speedup Limit Actual Speedup PCIe 4x K80

# GPUs used

Sp

ee

du

p

Page 25: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Limitation IDistributed SGD is heavily Communication Bound

How to solve this in distributed environments?

Page 26: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Limitation IIMathematical Problems – aka the “Large Batch Size Problem”

Recall: speedup comes from pure data-parallelism

→ splitting the batch over the workers

Page 27: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Problem: Batch size decreasing with distributed scaling

Hard Theoretic Limit: b > 0

→ GoogLeNet: No Scaling beyond 32 Nodes→ AlexNet: Limit at 256 Nodes

External Parallelization hurts the internal (BLAS / cuBlas) parallelizationeven earlier.

In a nutshell: for skinny matrices there is simply not enough work for efficient internal parallelization over many threads.

“Small Batch Size Problem”Data Parallelization over the Batch Size

Page 28: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

“Small Batch Size Problem”Data Parallelization over the Batch Size

256 128 64 32 16 8 4 2 11

2

4

8

16

32

64

128

256

MKL SGEMM

linear scaling

batch sizesp

eedu

p

Computing Fully Connected Layers:

Single dense Matrix Multiplication

Page 29: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Experimental EvaluationIncreasing the Batch Size

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

256

512

1024

2048

Iteration

Acc

ura

cy

Solution proposed in literature:

Increase Batch size

But:

Linear speedup against original Problem only if we can reduce the number iterations accordingly

This leads to loss of accuracy

AlexNet on ImageNet

Page 30: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

The “large batch” ProblemCentral Questions

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

256

512

1024

2048

Iteration

Acc

urac

y

Why is this happening?

Is it dependent on the Topologie / other parameters?

How can it be solved?

→ large batch size SGD would solve most scalability problems!

Page 31: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

The “large batch” ProblemWhat is causing this effect? [theoretical and not so theoretical explanations]

→ The “bad minimum”

→ gradient variance / coupling of learning rate and batch size

Page 32: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

The “bad minimum” Theory: larger batch causes degrease in gradient variance, causing convergence to local minima...

→ empirical evaluation shows high correlation of sharp minima and weak generalization

Page 33: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Limitation IIMathematical Problems – aka the “Large Batch Size Problem”

Problem not fully understood

No general solutions (yet) !

Do we need novel optimization methods ?!

Page 34: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Limitation IIIData I/O

Hugh amounts of training data need toBe streamed to the GPUs

● Usually 50x – 100x the training data set● Random sampling (!)● Latency + Bandwidth competing with

optimization communication

Page 35: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Distributed I/ODistributed File Systems are another Bottleneck !

● Network bandwidth is already exceeded by the SGD communication ● Worst possible file access pattern:

● Access many small files at random

This problem already has effects on local multi-GPU computations

E.g. on DG-X1 or Minsky, single SSD (~0.5 GB/s) to slow to feed >= 4 GPUs

-> solution: Raid 0 with 4 SSDs

Page 36: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Distributed I/O

256 128 64 32 16 8 4 2 10%

10%20%30%40%50%60%70%80%90%

100%

Compute time by Layer

AlexNet (GPU + cuDNN)

Split

SoftmaxWithLoss

ReLU

Pooling

LRN

InnerProduct

Dropout

Data

Convolution

Concat

batch size

Results shown for SINGLE node access to a Lustre working directory (HPC Cluster, FDR-Infiniband)

Distributed File Systems are another Bottleneck !

256 128 64 32 16 8 4 2 10%

10%20%30%40%50%60%70%80%90%

100%

Compute time by Layer

AlexNet (GPU + cuDNN)

Split

SoftmaxWithLoss

ReLU

Pooling

LRN

InnerProduct

Dropout

Convolution

Concat

batch size

Results shown for SINGLE node Data on local SSD.

Page 37: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

IV Towards Scalable Solutions

Distributed Parallel Deep Leaning with HPC tools + Mathematics

Page 38: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

CaffeGPI: Distributed Synchronous SGD ParallelizationWith asynchronous communication overlay

Better scalability using asynchronous PGAS programming of optimizationalgorithms with GPI-2.

Direct RDMA access to main and GPUmemory instead of message passing

Optimized data-layers for distributedFile systems

2

Page 39: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

CaffeGPI: Approaching DNN Communication

● Communication Reduction Tree● Communication Quantization● Communication Overlay● Direct RDMA GPU→GPU

● Based on GPI

● Optimized distributed data-layer

CC-HPC Current Projects

Page 40: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

CaffeGPI: Open Source

https://arxiv.org/abs/1706.00095

https://github.com/cc-hpc-itwm/CaffeGPI

Page 41: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Projects: Low Cost Deep Learning Setup: Build on CaffeGPI

GTX1080:0

GTX1080:1

SSD:0

IB:1 FDR

IB:0 FDR

GTX1080:2

GTX1080:3

SSD:1

IB:1 FDR

IB:0 FDR CaffeGPI

Workstation 1 Workstation 2

Price: <8000 EUR standard components

Specs:

● 32 GB GPU Mem● 64 GB PGAS Mem● 2TB BeeGFS for Train Data● GPU interconnect: PCIe

Page 42: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

CaffeGPI: Benchmarks: Single Node

Specs:

4x K80 GPU (PCIe Interconnect)Cuda 8CuDNN 5.1

Topology: AlexNetBatch Size (global) (1024)

Page 43: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Low Cost Deep Learning Setup: Build on CaffeGPI

Specs: Topology: GoogLeNet, Cuda 8, cuDNN 5.1, CaffeNV 16.4, Batch Size/Node: 64

1 2 4 80

50

100

150

200

250

300

350

Time till Convergence

DGX-1 CaffeNVITWM CaffeGPI

#GPUs

Tra

inin

g ti

me

in h

1 2 4 81.00

10.00

1.00

1.96

3.80

6.84

1.00

1.73

3.21

Scaling

DGX-1 CaffeNVITWM CaffeGPI

#GPUs

Sp

ee

du

p

Page 44: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

CaffeGPI: Benchmarks

Specs:

1x K80 GPU (per node)Cuda 8CuDNN 5.1

Topology: GloogleNetBatch Size: 64 per node

1 2 4 8 161.00

10.00

100.00

1.00

1.97

3.97

7.97

15.93

Distrubuted Scaling of CaffeGPI

#Nodes

Sp

ee

du

p

Page 45: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

New Project: Low Cost Deep Learning Cluster

GPU Cluster for Fraunhofer Consortium @ITWM

● Low cost Hardware● Consumer GPUs● Novel AMD architecture● Hosting Cost per GPU ~ 1.25 k EUR. Compared to DGX-1 ~ 10k

● Fast Interconnect and Data I/O● Parallel FS with local NVMe

● Open Source Multi-User Management● Reservation system● Scheduling● Custom Containers● Web Interface

GTX1080ti:0

GTX1080ti:1

SSD:0

IB:1 FDR

IB:0 FDR

Page 46: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Low Cost Deep Learning SetupCurrently building: Prototype 16-Node System

CC-HPC - Fraunhofer ITWM, Kaiserslautern

N0 N1 N2 N3 N4 N5 N6 N7

N8 N9 N10 N11 N12 N13 N14 N15

HPC IB Switch 32 Ports FDE

BeeGFS Storage Server

BeeGFS Storage Server

BeeGFS Parameter Server

Ma

na

ge

me

nt

No

de

Test + BackupNode

Be

eG

FS

Ma

na

ge

me

nt

Se

rve

r

LD

AP

Se

rve

r

Pro

xy

No

de

Acc

ess

Co

ntr

ol

SL

UR

M S

erv

er

Pro

x y

Mo

nito

ring

Se

rve

r

11.11.11.253

Ext

ern a

l IP

USV

USV

NE

AN

EA

Ext

ern

al I

P

User Access

Page 47: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Project Carme An open source software stack for multi-user GPU clusters

Carme (/ˈkɑːrmiː/ KAR-mee; Greek: Κάρμη) is a Jupiter moon, also giving the name for a Cluster of Jupiter moons (the carme group).

Or in our case:

an open source frame work to mange resources for multiple users running Jupyter notebooks on a Cluster of compute nodes.

Page 48: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Project Carme An open source software stack for multi-user GPU clustersCommon problems in GPU-Cluster oppration:

● Interactive, secure multi user environment● ML and Data Science users want interactive GUI access to compute resources

● Resource Management● How to assign (GPU) resources to competing users?

● User management● Accounting ● Job scheduling ● Resource reservation

● Data I/O● Get user data to compute nodes (I/O Bottleneck)

● Maintenance ● Meet (fast changing and diverse) software demands of users

Page 49: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Project Carme An open source software stack for multi-user GPU clusters

Carme core idea:

● Combine established open source ML and DS tools with HPC back-ends● Use containers

● (for now) Docker ● Use Jupyter Notebooks as main web based GUI-Frontend

● All web front-end (OS independent, no installation on user side needed) ● Use HPC job management and scheduler

● SLURM● Use HPC data I/O technology

● ITWM’s BeeGFS ● Use HPC maintenance and monitoring tools

Page 50: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

CC-HPC - Fraunhofer ITWM, Kaiserslautern

Project Carme An open source software stack for multi-user GPU clusters

N0 N1 N2 N3 N4 N5 N6 Nn

BeeGFS Storage Server

BeeGFS Parameter Server

Proxy / Management Server

User DBSLURM

Container StorageBeeGFS Manger

Monitoring

User 2

User 1

...

Compute nodes

Page 51: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

HP-DLF

High Performance Deep Learning Framework

Hardware TopologieHardware Topologie

Unterstützte Hardware

Unterstützte Hardware

CompilerCompilerDNN Graph

(Import aus Caffe, TensorFlow, ...)

DNN Graph(Import aus Caffe,

TensorFlow, ...)

Performance Model

Performance Model

Petri Netz Petri Netz Daten ModellDaten Modell

Runtime SystemWorkflow Engine

Scheduler

Runtime SystemWorkflow Engine

Scheduler

DatenflussDatenflussGlobal Address SpaceGlobal Address Space

...

● Scalable● Transparent● generic ● Auto-parallel ● Elastic ● Automatic data flow● Automatic Hardware

selection● Portable● Monitoring● Simulation● New optimization methods

Page 52: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Our asynchronous optimization algorithm

● Sparse communication for multi model optimization● Lower demands on the communication bandwidth● Superior convergence

Optimization for HP-DLF

Page 53: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

V A Look a the near Future

For HPC:

● New Hardware Accelerators ● New Interconnects● New Architectures

From ML

● Models are still growing!● Learning to learn

Page 54: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Automatic DesignOf Deep Neural Networks

Layer 1

Layer 2

Layer 3 Layer 4

Layer 5

Layer 6

Basically, a graph optimization problem :

● Select Node Types ● And their Meta-Parameters

● Connect Edges to define the data flow through the graph

Optimization target: minimize test error

Problems: ● Huge and difficult search space ● Each iteration requires training of a DNN

Page 55: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Automatic DesignCurrent Approaches: Reinforcement Learning

Evaluation on CIFAR-10:Better than “State of the Art” (hand designed) performance● After ~12500 iterations● Compute time: ~ 10000 GPU days

Page 56: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

DeToL

Deep Topology Learning ● Genetic Algorithms● Reinforcement Learning● Early Stopping● Graph Embedding ● Meta-Learning● Pruning

On application size problems

Data basis generartion

Page 57: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

Discussion