41
Przemek Tredak, Simon Layton S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING

Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

Przemek Tredak, Simon Layton

S8906: FAST DATA PIPELINES FOR DEEP LEARNING TRAINING

Page 2: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

2

THE PROBLEM

Page 3: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

3

CPU BOTTLENECK OF DL TRAINING

• Multi-GPU, dense systems are more common (DGX-1V, DGX-2)

• Using more cores / sockets is very expensive

• CPU to GPU ratio becomes lower:

• DGX-1V: 40 cores / 8, 5 cores / GPU

• DGX-2: 48 cores / 16, 3 cores / GPU

CPU : GPU ratio

Page 4: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

4

CPU BOTTLENECK OF DL TRAININGComplexity of I/O pipeline

Alexnet

256x256 image 224x224 crop and mirror

ResNet 50

480p image Random resize Color augment224x224 crop

and mirror

Training

Training

Page 5: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

5

CPU BOTTLENECK OF DL TRAINING

Increased complexity of

CPU-based I/O pipelineHigher GPU to CPU ratio

CPU

GPU

Thro

ughput

Time

Page 6: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

6

LOTS OF FRAMEWORKS

Frameworks have their own I/O pipelines (often more than 1!)

Lots of duplicated effort to optimize them all

Training process is not portable even if the model is (e.g. via ONNX)

Lots of effort

Caffe2

ImageInputOp Python

MXNet

ImageRecordIter

Python

TensorFlow

Dataset

Python

ImageIOManual graph

construction

Page 7: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

7

LOTS OF FRAMEWORKS

Optimized I/O pipelines are not flexible and often unsuitable for research

Lots of effort

train = mx.io.ImageRecordIter(

path_imgrec = args.data_train,

path_imgidx = args.data_train_idx,

label_width = 1,

mean_r = rgb_mean[0],

mean_g = rgb_mean[1],

mean_b = rgb_mean[2],

data_name = 'data',

label_name = 'softmax_label',

data_shape = image_shape,

batch_size = 128,

rand_crop = True,

max_random_scale = 1,

pad = 0,

fill_value = 127,

min_random_scale = 0.533,

max_aspect_ratio = args.max_random_aspect_ratio,

random_h = args.max_random_h,

random_s = args.max_random_s,

random_l = args.max_random_l,

max_rotate_angle = args.max_random_rotate_angle,

max_shear_ratio = args.max_random_shear_ratio,

rand_mirror = args.random_mirror,

preprocess_threads = args.data_nthreads,

shuffle = True,

num_parts = 0,

part_index = 1)

vs

image, _ = mx.image.random_size_crop(image,

(data_shape, data_shape), 0.08, (3/4., 4/3.))

image = mx.nd.image.random_flip_left_right(image)

image = mx.nd.image.to_tensor(image)

image = mx.nd.image.normalize(image, mean=(0.485,

0.456, 0.406), std=(0.229, 0.224, 0.225))

return mx.nd.cast(image, dtype), label

Inflexible fast flexible slow

Page 8: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

8

SOLUTION: ONE LIBRARY

• Centralize the effort

• Integrate into all frameworks

• Provide both flexibility and performance

DALI

MXNet Caffe2 PyTorch TF etc.

Page 9: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

9

DALI: OVERVIEW

Page 10: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

10

DALI

• Flexible, high-performance image data pipeline

• Python / C++ frontends with C++ / CUDA backend

• Minimal (or no) changes to the frameworks required

• Full pipeline - from disk to GPU, ready to train

• OSS (soon) Fra

mew

ork

DALI

Plu

gin

Page 11: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

11

GRAPH WITHIN A GRAPH

Data pipeline is just a (simple) graph

I/O in Frameworks today

Loader

Decode Resize

Training

Images

Labels

JPEG

Augment

GPU

CPU

Page 12: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

12

GPU OPTIMIZED PRIMITIVES

High performance, GPU optimized implementations

DALI

Loader

Decode Resize

Training

Images

Labels

JPEG

Augment

GPU

CPU

Page 13: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

13

GPU ACCELERATED JPEG DECODE

Hybrid approach to JPEG decoding – can move fully to GPU in the future

Hu

DALI with nvJPEG

Loader

Decode Resize

Training

Images

Labels

JPEG

Augment

GPU

CPU

Page 14: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

14

SET YOUR DATA FREE

Use any file format in any framework

DALI

LMDB (Caffe,

Caffe2)

RecordIO

(MXNet)

TFRecord

(TensorFlow)

List of JPEGs

(PyTorch,

others)

Page 15: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

15

BEHIND THE SCENES: PIPELINE

Page 16: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

16

PIPELINEOverview

Framework

One pipeline per GPUThe same logic for multithreaded and multiprocess frameworks

Page 17: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

17

PIPELINEOverview

Framework

CPU

Mixed

GPU

Single direction3 stages

CPU -> Mixed -> GPU

Page 18: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

18

PIPELINEOverview

1

2

3

4

6 8

5 7

9 Framework

CPU

Mixed

GPUSimple scheduling of operations

Page 19: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

19

55

PIPELINECPU

1

2

3

4

5

1

2

3

4

5

1

2

3

4

5

1

2

3

4

5

1

2

3

4

5

1

2

3

4

5

5 5 5Operations processed per-sample in a thread pool

Page 20: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

20

PIPELINEGPU

8

9

8

9

9Batched processing of data

Page 21: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

21

PIPELINEMixed

Mixed

9

A bridge between CPU and GPUPer-sample input, batched output

Used also for batching CPU data (for CPU outputs of the pipeline)

Page 22: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

22

EXECUTORPipelining the pipeline

CPU, Mixed and GPU stages need to be executed serially

But each batch of data is independent…

Mixed 1 GPU 1CPU 1 Mixed 2 GPU 2CPU 2 Mixed 4CPU 3

time

Page 23: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

23

EXECUTORPipelining the pipeline

Mixed 1

Each stage is asynchronous

Stages of given batch synchronized via events

GPU 1

CPU 1

time

Mixed 2

GPU 2

CPU 2

Mixed 3

GPU 3

CPU 3…

Page 24: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

24

OPERATORSGallery

Page 25: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

25

USING DALI

Page 26: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

26

EXAMPLE: RESNET-50 PIPELINEPipeline class

import dali

import dali.ops as ops

class HybridRN50Pipe(dali.Pipeline):

def __init__(self, batch_size, num_threads, device_id, num_devices):

super(HybridRN50Pipe, self).__init__(batch_size,

num_threads, device_id)

# define used operators

def define_graph(self):

# define graph of operations

Page 27: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

27

EXAMPLE: RESNET-50 PIPELINEDefining operators

def __init__(self, batch_size, num_threads, device_id, num_devices):

super(HybridRN50Pipe, self).__init__(batch_size, num_threads,

device_id)

self.loader = ops.Caffe2Reader(path=lmdb_path, shard_id=dev_id,

num_shards=num_devices)

self.decode = ops.HybridDecode(output_type=dali.types.RGB)

self.resize = ops.Resize(device="gpu", resize_a=256,

resize_b=480, random_resize=True,

image_type=types.RGB)

self.crop = ops.CropMirrorNormalize(device="gpu",

random_crop=True, crop=(224, 224),

mirror_prob=0.5, mean=[128.,128.,128.],

std=[1.,1.,1.], output_layout=dali.types.NCHW)

Page 28: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

28

EXAMPLE: RESNET-50 PIPELINEDefining graph

def define_graph(self):

jpeg, labels = self.loader(name="Reader")

images = self.decode(jpeg)

resized_images = self.resize(images)

cropped_images = self.crop(resized_images)

return [cropped_images, labels]

Loader

Decode Resize Crop

MakeContiguous

Data

Label

jpeg

labels

Page 29: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

29

EXAMPLE: RESNET-50 PIPELINEUsage: MXNet

import mxnet as mx

from dali.plugin.mxnet import DALIIterator

pipe = HybridRN50Pipe(128, 2, 0, 1)

pipe.build()

train = DALIIterator(pipe, pipe.epoch_size("Reader"))

model.fit(train,

# other parameters

)

Page 30: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

30

EXAMPLE: RESNET-50 PIPELINEUsage: TensorFlow

import tensorflow as tf

from dali.plugin.tf import DALIIterator

pipe = HybridRN50Pipe(128, 2, 0, 1)

serialized_pipe = pipe.serialize()

train = DALIIterator()

with tf.session() as sess:

images, labels = train(serialized_pipe)

# rest of the model using images and labels

sess.run(...)

Page 31: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

31

EXAMPLE: RESNET-50 PIPELINEUsage: Caffe 2

from caffe2.python import brew

pipe = HybridRN50Pipe(128, 2, 0, 1)

serialized_pipe = pipe.serialize()

data, label = brew.dali_input(model, ["data", "label"],

serialized_pipe=serialized_pipe)

# Add the rest of your network as normal

conv1 = brew.conv(model, data, “conv1”, …)

Page 32: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

32

PERFORMANCE

Page 33: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

33

PERFORMANCEI/O Pipeline

5150 5450

8000

14350

23000

0

5000

10000

15000

20000

25000

Imag

es /

Sec

on

dThroughput, DGX-2, RN50 pipeline, Batch 128, NCHW

Page 34: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

34

PERFORMANCEEnd-to-end training

8000

15500

17000

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

Native DALI Synthetic

imag

es /

sec

on

dEnd-to-end DGX-2, RN50 training - MXNet, Batch 192 / GPU

Page 35: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

35

NEXT STEPS

Page 36: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

36

NEXT: MORE WORKLOADSSegmentation

def define_graph(self):

images, masks = self.loader(name="Reader")

images = self.decode(images)

masks = self.decode(masks)

# Apply identical transformations

resized_images, resized_masks = self.resize([images, masks], …)

cropped_images, cropped_masks = self.crop([resized_images,

resized_masks], …)

return [cropped_images, cropped_masks]

Page 37: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

37

NEXT: MORE FORMATS

What would be useful to you?

PNG Video frames

Page 38: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

38

NEXT++: MORE OFFLOADING

Fully GPU-based decode

HW-based via. NVDEC

Transcode to video

Page 39: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

39

SOON: EARLY ACCESS

Looking for:

General feedback

New workloads

New transformations

Contact: Milind Kukanur

[email protected]

Page 40: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores

40

ACKNOWLEDGEMENTS

Trevor Gale

Andrei Ivanov

Serge Panev

Cliff Woolley

DL Frameworks team @ NVIDIA

Page 41: Fast Data Pipelines for deep learning training · 2018. 3. 30. · 3 CPU BOTTLENECK OF DL TRAINING • Multi-GPU, dense systems are more common (DGX-1V, DGX-2) • Using more cores