Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Preview:

Citation preview

Applying Deep Learning at Facebook Scale

Director of EngineeringApplied Machine Learning

Hussein Mehanna

Event prediction

Machinetranslation

Large scale computer vision

Natural languageprocessing

Applications of deep learning

Event prediction

Applications of deep learning

Machinet

Large scale computer vision

Natural languageprocessing

Why should I like this story?

1B+new stories every day

+ Billionsof stories from

this day years ago

Billion peopleThousand of stories

In milliseconds

Deep learning for rankingTitle

Deep learning

I like soccer I am fromAustralia

I am 26 I traveledto Argentina

Massive sparse logistic regression

Deep neural networks

+

Deep learning for ranking

Massive sparse logistic regression

Deep neural networks

+

Title

Deep learning

I like soccer I am fromAustralia

I am 26 I traveledto Argentina

Applications of deep learning

Event prediction

Machinet

Large scale computer vision

Natural languageprocessing

Applications of deep learning

Event prediction Machine

translation

Large scale computer vision

Natural languageprocessing

Recurrent neural networks with attention decoderMachine translation with neural networks

Encoder input Encoded statesEncoder

DecoderDecoder input Decoder

have some todayGonna fun

Vamos a divertirnos hoy

Attention model

Applications of deep learning

Event prediction Machine

translation

Large scale computer vision

Natural languageprocessing

Applications of deep learning

Natural languageprocessing

Large scale computer vision

Event prediction

Machinet

Applications of deep learning

Natural languageprocessing

Large scale computer vision

Event prediction

Machinet

Applications of deep learning

Large scale computer vision

Event prediction

Machinet

Natural languageprocessing

VIDEOVIDEOVIDEO

VIDEO

VIDEO

VIDEO

VIDEOHundreds of Convolutional neural networks run on photos uploaded to Facebook

Classification Detection Segmentation

person, plate, drink

Improving Inference for deep learning

Compress modelsMemory usagein deep networksCompute faster

Improving Inference for deep learning

Memory usagein deep networksCompute faster Compress models

Convolution implementation strategies

90%+of runtime for modern

vision models

Faster convolution algorithms for deep learningCompute faster

201520142013

im2col + sgemm

FFT Tiled FFT Winograd

CuDNN for CPUsNNPACK

Easy integrationCuDNN-style C interface, easy to integrate

Supports the computationally-intensive layers:• Convolutions (tiled FFT, Winograd)• Pooling• Fully connected layers (GEMM/GEMV)

Via an x86-64 meta-assembler (PeachPy)

Computationally-intensive

Implementation

(2x-6x) vs baseline CPU

Excellent performance

Open source, integrated into frameworks

NNPACK

Caffe/Caffe2: github.com/ajtulloch/caffe/tree/nnpack-prTorch: github.com/szagoruyko/nnpack.torch

github.com/Maratyszcza/NNPACK

Integrated into several deep learning frameworks:

Improving Inference for deep learning

Memory usagein deep networksCompute faster Compress models

Compress modelsMemory usagein deep networks

Compute faster

Improving Inference for deep learning

The Memory Andy-Bill Theorem

Trend

• ResNets in vision• deep LSTMs in language

modeling

GPU memory relatively stable (12GB on Titan X/M4, 16GB on P100)

CPU memory has many constraints, especially in applied settings

Scale Constraints

Spend in activationsThe bulk of memory is in the activations – must reuse

Memory savings for modern ConvNets

View 'activations' as virtual registers and run a register allocator (graph coloring on interference graph)

50%-90%

Ideas from compilersRun inference in an O(N)-ResNet in O(1) memory!

Run inference

AlexNet

AlexNet

Inception Network

Some implementations

MXNet: github.com/dmlc/mxnet-memonger

Caffe/Caffe2: github.com/facebook/fb-caffe-exts/Torch: github.com/fmassa/optimize-net

Can go further and explicitly trade-off compute and memory:

ResNet-1000 from 48GB to 7GB for 30% slower timings

Improving Inference for deep learning

Compress modelsMemory usagein deep networks

Compute faster

Improving Inference for deep learning

Memory usagein deep networks Compress modelsCompute faster

Train Connectivity

Train Weights

Prune Connections

Generate Code Book

Retrain Code Book

Quantize the Weights with the

Cluster the Weights

Encode Weights

Encode Index

originalartwork

originalsize

sameaccuracy

10xreduction

sameaccuracy

27x-31xreduction

sameaccuracy

35x-50xreduction

PruningLess number of weights Huffman Encoding

Quantizationless bits per weight

Deep compression pipeline (Han et al)

All together: Pruning + Quantization + Huffman coding

11.32%10.91%

31.50%31.17%

49X552MB 11.3 MB

Event Machine

Large scale computer

Natural language

Compress

Memory usagein deep networks

Compute faster

Recommended