A Powerful, Flexible, and Intui5ve Deep Learning...

APowerful,Flexible,andIntui5veDeepLearningFramework�

@NVIDIAGTC,April6th,2016 �

ShoheiHido

ChiefResearchOfficer

PreferredNetworks,Inc.

Overview

l  ChainerisaPython-baseddeeplearningframework

l  Chainerv1.0wasreleasedasanopensourceonJune2015

l  ItDOESN’TrelyonTheano,unlikeotherPythonframeworks

l  ChainerusesauniqueschemenamedDefine-by-Run

http://chainer.org/

l  WhydouserssOllneedanotherframework?

l  HowdifferentandeffecOveChaineris?

Preferred Networks (PFN) A startup that applies deep learning to industrial IoT

l  Founded: March 2014

l  Headquarter: Tokyo, Japan

l  U.S. Subsidiary: San Mateo, California

l  Company size: 35 engineers & researchers

l  Investors: Toyota, FANUC, NTT

Deep learning Industrial IoT

Manufacturing

Automotive

Healthcare

Partnering with world-leading companies using Chainer

l  R&DcollaboraOononindustrialproblemswithreal-worlddata  Specificrequirements,modifiedalgorithms,manytrialsanderrors,etc

  Differentfrommakinggeneral-purposerecogniOonsystem

Toyota FANUC

Panasonic

Cisco NVIDIA

Two types of background behind DL frameworks

1.Scalability-oriented

l  Use-casesinmind

  Image/speechrecogniOonsystem

  FastDLasaserviceincloud

l  Problemtype

  AfewgeneralapplicaOons

  10+milliontrainingsamples

  10+nodesclusterw/fastnetwork

l  PossibleboZleneck

  Tuningofwell-knownalgorithms

  DistributedcomputaOonformodel/data-paralleltraining

2.Flexibility-oriented

l  Use-casesinmind

  Algorithmresearch

  R&Dprojectsfornewproducts

l  Problemtype

  VariousspecificapplicaOons

  10+ktrainingsamples

  1nodewithmulOpleGPUs

l  PossibleboZleneck

  Trial-and-errorinprototyping

  Debugging,profiling&refactoring

  (waitOmeduringcompilaOon)

Designed for efficient research & development

l  Flexible:newkindsofcomplexmodelsforvariousapplicaOons

l  IntuiOve:rapidprototypingandefficienttrial-and-error

l  Powerful:comparableperformancefor1node&mulO-GPUs

Scalability-oriented Flexibility-oriented

Agenda

l  Deeplearningframeworkbasics

l  IntroducOontoChainer

l  CuPy:NumPy-compaObleGPUlibrary

l  PerformanceandapplicaOons

Neural network and computation

・・

・・・・

Forward computation

Backward computation (backpropagation)

・・

Input Hidden units OutputText

Sensor

Object: Tulip

Anomaly score: 0.35

Category: Sports

・・

Chainer focuses on network representation/training

l  Designchoicesfordeeplearningframeworks

  Howtobuildneuralnetworks?

  Howtotrainneuralnetworks?

  Whichtextformat/languageformodeling?

  WhichlanguageforcompuOng?

  RunwithGPU?

  RunonmulOpleGPUs?

  RunonmulOplecomputenodes?

Building and training neural networks: Computational graph construction is the key

1.  ConstructacomputaOonalgraph

  BasedonnetworkdefiniOongivenbyusers

  ChainsoffuncOonsandoperaOonsoninputvariables

2.  Computelossandgradients

  ForwardcomputaOontocalculatelossforaminibatch

  BackpropagaOongivesgradientstoallofparameters

3.  OpOmizemodel

  Updateeachparameterwiththegradient

  RepeatunOlconvergence

Step 1. is the most important and there are many approaches

Building blocks

l  ThesefuncOonaliOesareverysimilarbetweenframeworks

l  Butthestructure,abstracOonlevel,andinterfacearedifferent

l  Itcomestothedesignofdomain-specificlanguageforNN

Array data structure (vector/matrix/tensor)

Operations & functions

Network (computational graph)

Optimizer (SGD/AdaGrad/Adam)

Types of domain-specific language for neural networks

l  TextDSL  Ex.Caffe(prototxt)

  Ex.CNTK(NDL)

l  Symbolicprogram  OperaOons

onsymbols

  Ex.Theano

  Ex.TensorFlow

l  ImperaOveprogram  DirectcomputaOons

onrawdataarrays

  Ex.Torch.nn

  Ex.Chainer

#SymbolicdefiniOonA=Variable(‘A’)B=Variable(‘B’)C=B*AD=C+Constant(1)#Compilef=compile(D)d=f(A=np.ones(10),B=np.ones(10)*2)

#ImperaOvedeclaraOona=np.ones(10)b=np.ones(10)*2c=b*ad=c+1

%%DefiniOonintextf:{“A”:“Variable”,“B”:“Variable”,“C”:[“B”,“*”,“A”],“ret”:[“C”,“+”,1]}

#Compilef=compile(“f.txt”)d=f(A=np.ones(10),B=np.ones(10)*2)

Ex. MXNet

Comparison of DSL type

DSLtype Pros. Cons.

TextDSL

•  Human-readabledefiniOon•  Non-programmercaneasily

editthenetwork

•  Usersmuststudytheformat•  Formatmighthavetobe

extendedfornewalgorithms

InternalDSL Symbolic

•  StaOcanalysisatcompile•  OpOmizaOonbeforetraining•  Easytoparallelize

•  Usersmuststudyspecialsyntax•  Mayneedmoreeffortsto

implementnewalgorithms

ImperaOve

•  Lesseffortstolearnsyntax•  Easydebuggingandprofiling•  Suitablefornewalgorithms

withcomplexlogic

•  HardtoopOmizeinadvance•  Lessefficientinmemory

allocaOonandparallelizaOon

ChainerisattheextremeendofimperaOveprogramforhighflexibility

Agenda

Chainer as an open-source project

l  hZps://github.com/pfnet/chainerl  50contributors

l  1,277stars&255fork

l  3,708commits

l  AcOvedevelopment&releaseforlast10months  v1.0.0(June2015)tov1.7.2(March2016)

Original developerSeiya Tokui

Chainer software stack

CPU NVIDIAGPU

Chainer

l  ChainerisbuiltontopofNumPyandCUDA

l  CuPyisalsointroducedasanequivalentofNumPyonGPU

Define

Graph build scheme (1/2) - Define-and-Run: Most of frameworks use this scheme (Chainer does not)

l  Define:buildacomputaOonalgraphbasedondefiniOon

l  Run:updatethemodel(parameters)usingtrainingdataset

NetworkdefiniOon

ComputaOonalgraph

GradientfuncOon

Parameters

ComputaOonalgraph

GradientfuncOon

Parameters

Trainingdata

Update

Loss&gradient

AutodifferenOaOon

Define-by-Run

Graph build scheme (2/2) - Define-by-Run: Computational graph construction on the fly

l  Nographisconstructedbeforetraining

l  Instead,thegraphisbuiltateachforwardcomputaOon

l  ComputaOonalgraphcanbemodifieddynamicallyforeachiteraOon/sampleordependingonsomecondiOons

ModeldefiniOon

ComputaOonalgraph

GradientfuncOon

Parameters

Trainingdata

Update

DynamicchangeCondiOons

Define-by-Run example: MLP for MNIST

l  OnlytransformaOonsbetweenunitsaresetbeforetraining

l  ConnecOonisgivenasforwardcomputaOon

l1 = Linear(784, n_units) l2 = Linear(n_units, 10))

Linear l2

Linear l1 ｘ yh1

W bias

０５９

W bias

def forward(x): h1 = ReLU(l1(x)) return l2(h1)

Define-by-Run: An interpreted language for neural network

l  Idea  ForwardcomputaOonactuallygoesthroughcomputaOonalgraph

  Byrememberingthehistory,theactualgraphcanbeobtained

l  Advantage  Flexibilityfornewalgorithmswithcomplexcomponents

u  Ex.recurrent,recursive,aZenOon,memory,adversarial,etc

  IntuiOvecodingwithhighlyimperaOvenetworkdefiniOon

u  Ex.stochasOcnetworkofwhichgraphchangesforeachiteraOon

l  Currentdrawbacks  GraphisgeneratedeveryOmealsoforfixednetworks

  NoopOmizaOonevenforstaOcpartofgraphs

u  JIT-likeanalysisandsubgraphcachemightbeuseful20

Basic components (1/2): Variable and Function

l  Variable  Variablewrapsarrays(.data)

  ItremembersparentfuncOon(.creator)

  Itwillbeassignedgradient(.grad)

  ItkeepstrackofnotonlydatabutalsocomputaOons

l  FuncOon  TransformaOonbetweenVariable

  Stateless

  e.g.sigmoid,tanh,ReLU,maxpooling,dropout

Function

Variable

ｘ yh1０５９

Chain (MLP2)

Basic components (2/2): Link and Chain

l  Link=funcOonwithstate  ParametersarealsoVariable

andgradientswillbeassigned

  e.g.Linear(fully-connected),LSTMConvoluOon2d,word-embedding

l  Chain=network  ChainhasasetofchildLink

  ForwardcomputaOonisdefinedin.__call__()

  e.g.MLP2,AlexNet,GoogLeNet,RNNLM,seq2seq,

Link (Linear)

y=f(W*x+b)

Linear l2

Linear l1 �� y�h1 �

W� bias�

Backpropagation through computational graph

l  ConsideranobjecOve(Link.Linear):L = f(x * w + b)

l  ThiscomputesthevalueofLinforwardcomputaOon,andsimultaneouslybuildsthefollowingcomputaOonalgraph

l  ThegradientofLcanbecomputedwithrespecttoanyvariablesbybackpropagaOon

l  ThentheopOmizerupdatesthevalueofparameters

isVariableisFuncOon

Code sample (1/4): Multi-layer perceptron

class MLP2(Chain): def __init__(self): super(MLP2, self).__init__( l1=L.Linear(784, 100), l2=L.Linear(100, 10), ) def __call__(self, x): h1 = F.relu(self.l1(x)) y = self.l2(h1) return y class Classifier(Chain): def __init__(self, predictor): super(Classifier, self). __init__(predictor=predictor) def __call__(self, x, t): y = self.predictor(x) self.accuracy = F.accuracy(y, t) self.loss = F.softmax_cross_entropy(y, t) return self.loss, self.accuracy

# Model and optimizer setup model = Classifier(MLP2()) optimizer = optimizers.SGD() optimizer.setup(model) # training loop with minibatch for i in range(0, datasize, batchsize): x = Variable(x_tr[i:i+batchsize]) t = Variable(y_tr[i:i+batchsize]) model.zerograds() loss, acc = model(x, t) loss.backward() optimizer.update()

Chain (MLP2)

Linear l2

Linear l1 �� y�h1 �

W� bias�

Code sample (2/4): Convolutional neural networkclass AlexNet(Chain): def __init__(self): super(AlexNet, self).__init__( conv1=L.Convolution2D(3, 96, 11, stride=4), conv2=L.Convolution2D(96, 256, 5, pad=2), conv3=L.Convolution2D(256, 384, 3, pad=1), conv4=L.Convolution2D(384, 384, 3, pad=1), conv5=L.Convolution2D(384, 256, 3, pad=1), fc6=L.Linear(9216, 4096), fc7=L.Linear(4096, 4096), fc8=L.Linear(4096, 1000), ) def __call__(self, x, t): h = F.max_pooling_2d(F.relu( F.local_response_normalization(self.conv1(x))), 3, stride=2) h = F.max_pooling_2d(F.relu( F.local_response_normalization(self.conv2(h))), 3, stride=2) h = F.relu(self.conv3(h)) h = F.relu(self.conv4(h)) h = F.max_pooling_2d(F.relu(self.conv5(h)), 3, stride=2) h = F.dropout(F.relu(self.fc6(h)), train=self.train) h = F.dropout(F.relu(self.fc7(h)), train=self.train) y = self.fc8(h) return y

* ImageNet Classification with Deep Convolutional Neural Networks http://www.image-net.org/challenges/LSVRC/2012/supervision.pdf

conv2d

linear

Code sample (3/4): Recurrent neural network

class SimpleRNN(Chain): def __init__(self, n_vocab, n_units): super(SimpleRNN, self).__init__( embed=L.EmbedID(n_vocab, n_units) x2h=L.Linear(n_units, n_units), h2h=L.Linear(n_units, n_units), h2y=L.Linear(n_units, n_vocab),) self.h = None def __call__(self, x): y, h_new = self.fwd_one_step(x, self.h) self.h = h_new return y def fwd_one_step(self, x, h): x = F.tanh(self.embed(x)) if h is None: h = F.tanh(self.x2h(x)) else: h = F.tanh(self.x2h(x) + self.h2h(h)) y = F.softmax(self.h2y(h)) return y, h

x_1 h y_1

x_2 h y_2

x_3 h y_3

x_4 h y_4

BPTTlength=3

Inputword OutputRecurrentstate

# Truncated BPTT (length=3) for i in range(0, datasize, batchsize): ... accum_loss += model(x, t) if i % bptt_length == 0: model.zerograds() accum_loss.backward() accum_loss.unchain_backward() optimizer.update()

Code sample (4/4): Deep Networks with Stochastic Depth A paper published on arXiv, March 30, 2016

l  AvariantofResidualNetthatskipsconnecOonsstochasOcally  OutperformedtheoriginalResidualNet(ImageNet2015winner,MSR)

  StochasOcskip:

Taken from http://arxiv.org/abs/1603.09382v2G. Huang et al.

# Mock code in Chainer

class StochasticResNet(Chain):

def __init__(self, prob, size, …):

super(StochasticResNet, size, …).__init__(

## Define f[i] as same for Residual Net )

self.p = prob # Survival probabilities

def __call__(self, h):

for i in range(self.size): b = numpy.random.binomial(1, self.p[i])

c = self.f[i](h) + h if b == 1 else h

h = F.relu(c)

return h

w/ survival probability:

Miscellaneous

l  Otherfeatures  Installwithpipinoneline:

  MulO-GPUsupportbyexplicitlyselecOngtheIDtouse

  Pre-trainedCaffemodelimportfromModelZoo

  ModelserializaOon&save&load:HDF5orNumPynpz

l  FuturedirecOon(notonlyforChainer)  JIT-likeopOmizaOonduringDefine-by-Run

  MemoryconsumpOonreducOon(GPUmemoryissOllsmall)

  Handlingvariable-lengthinputswithoutminibatch

  MaximizingperformanceonmulO-node&mulO-GPUenvironment

$ pip install chainer

Agenda

CuPy: (partially-)NumPy-compatible GPU library

l  MoOvaOon:NumPy+CUDA=CuPy  NumPyisthestandardlibraryinPythonfornumericalcomputaOon

  CUDAisthestandardAPIsforusingGPUforhigh-performance

  Unfortunately,NumPydoesNOTworkwithCUDA

l  CuPysupports:  FastcomputaOonusingNVIDIA’scuBLASandcuDNN

  Arrayindexing,slicing,transpose,andreshape

  MostofoperaOons/funcOonsinNumPy

u  Chainerv1.7.2alreadysupportsmorethan170funcOons

  User-definedfuncOonsandkernels

  alldtypes,broadcasOng,memorypool,etc30

How to use CuPy

l  UsageofCuPy:justreplaceNumPy withCuPy

l  Conversionbetweennumpy.ndarrayandcupy.ndarray

l  Ex.CPU/GPU-agnosOclogsumexpfuncOon def logsumexp(x, axis=None): xp = cuda.get_array_module(x) #Get CuPy or NumPy x_max = x.max(axis) exp_sum = xp.exp(x -　x_max).sum(axis) return x_max + xp.log(exp_sum)

import numpy, cupy enable_cupy = True xp = cupy if enable_cupy else numpy

w_c = cupy.asarray(numpy.ones(10)) # cupy.ndarray w_n = cupy.asnumpy(cupy.ones(10)) # numpy.ndarray

CuPy implementation: Optimized for performance & NumPy-compatibility

l  UseCythonforcupy.core&cupy.cuda

l  DynamiccodegeneraOon&compile  CUDAcodeisgeneratedforspecifictensordimension&datatype

  On-the-flycompilebynvccandbinarycache(fasterawer1stuse)

CUDAlibraries(cuBLAS,cuRAND,cuDNN)�

ndarray�

ufunc,elementwise,reduc5on

CUDAPythonwrapper � cupy.cuda

cupy.core

Tensoropera5ons&func5ons � cupy

CuPy performance on linear algebra: 5 to 25 times faster than NumPy

def test(xp): a = xp.arange(1000000).reshape(1000, -1) return a.T * 2 test(numpy) t1 = datetime.datetime.now() for i in range(1000): test(numpy) t2 = datetime.datetime.now() print(t2 -t1) test(cupy) t1 = datetime.datetime.now() for i in range(1000): test(cupy) t2 = datetime.datetime.now() print(t2 -t1)

msec � speedup�

NumPy 2,929 1.0CuPy 585 5.0CuPy+MemoryPool

123 23.8

IntelCorei7-4790@3.60GHz,32GB,GeForceGTX970

Use CuPy for GPU-based computation

l  SupportthreepaZernsaswrappers  ElementwiseKernel:forelement-wisecomputaOon

  ReducOonKernel:forreduceoperaOonalongaxis

  ufunc:universalfuncOonasinNumpy

l  Ex.definiOonofanelement-wisefuncOon

l  Usage(automaOcbroadcastandtypecheckaresupported)

squared_diff = cupy.ElementwiseKernel( ‘float32 x, float32 y’, # Input

‘float32 z’, # Output

‘z = (x - y) * (x - y)’, # Operation

‘squared_diff’) # Name

squared_diff(cupy.arange(10), 10)

Agenda

Public benchmark results (CNN): Chainer shows comparable performance

l  ForwardcomputaOonisalmostthesamewithTensorFlow

l  TrainingwithbackwardcomputaOonisslower,butitcanbeoffsetbynocompilaOonOmewhiledebugging/tuning

AlexNet GoogLeNet VGG-A OverFeat

TorchTensorFlowChainerCaffe(naCve)

Forward computation (msec) Backward computation (msec)

Taken from https://github.com/soumith/convnet-benchmarks, using cuDNN except Caffe 36

Chainer can benefit from latest CUDA libraries: Ex. Winograd algorithm in cuDNN v5

l  Conv3x3iscommoninCNNs&nowcomputedwithWinograd

l  State-of-the-artCNNmodels(e.g.,GoogLeNet,VGG-A)canbeacceleratedupto2.0xattestOme(forwardonly)

cuDNNv4cuDNNv5

Forward computation (msec) Backward computation (msec)

Independently measured by a modified version of soumith/convnet-benchmarkscuDNN v5 can be used in Chainer v1.8.0 37

Algorithm implementation in Chainer: A Neural Algorithm of Artistic Style (Gatys et al., 2015)

l  hZps://github.com/maZya/chainer-gogh

Contentimage (cat)

Style image

New artistic image

Main code (45 lines) 38

l  ManycollaboraOonsareon-goingw/Chainer-basedcomputervision,deepreinforcementlearning,etc…

l  Ex.1Chainer-controlledtoycarsinToyotaboothatCES2016

l  Ex.2HighlyaccurateFANUC’sbin-pickingrobotatIREX2015  8hourstrainingtoreachexpert-level,commercializaOonby2016end

Chainer in industry: Used in demonstrations & being commercialized

http://tinyurl.com/pfn-irex15http://tinyurl.com/pfn-ces16

Summary

l  ChainerisaPython-baseddeeplearningframeworkwithdynamicnetworkconstrucOonschemeandCuPy

l  ItisdesignedforefficientresearchandprototypingwhilekeepingcomparableperformancethankstoNVIDIAGPU

l  Officialweb:hZp://chainer.org/

l  Github:hZps://github.com/pfnet/chainer

YourcontribuOonswillbeappreciated&wearehiring!

A Powerful, Flexible, and Intui5ve Deep Learning...

Documents

Deep Learning Early Work Why Deep Learning Stacked Auto Encoders Deep Belief Networks CS 678 – Deep Learning1

Lecture 1: Introduction to Deep Learning · uva deep learning course –efstratios gavves introduction to deep learning - 1

Deploying Enterprise Deep Learning Masterclass Preview - Enterprise Deep Learning

Deep learning / representation learning

Binary Deep Learning - TAUweb.eng.tau.ac.il/.../wp-content/uploads/2017/03/Binary-Deep-Learning.pdf · Binary Deep Learning Deep Learning Seminar, School of Electrical Engineering,

Tutorial: Deep Reinforcement Learning - Machine Learning ...hunch.net/~beygel/deep_rl_tutorial.pdfTutorial: Deep Reinforcement Learning - Machine Learning

CS 678 – Deep Learning1 Deep Learning Early Work Why Deep Learning Stacked Auto Encoders Deep Belief Networks

Deep Learning for Generic Object Detection: A Survey · deep learning, probabilistic models, autoencoders, manifold learning, and deep networks 17 Deep learning LeCun et al. (2015)

Deep Learning Lab Course 2017 (Deep Learning …ais.informatik.uni-freiburg.de/teaching/ws17/deep...Deep Learning Lab Course 2017 (Deep Learning Practical) Labs: (Computer Vision)

Introducing Deep Learning with MATLAB - MathWorks · Introducing Deep Learning with MATLAB7 How A Deep Neural Network ... relevant features of an image. With deep learning, ... Deep

Deep Learning - Oxford Statisticspalamara/teaching/SML19/HT19_lecture15.pdf · SB2b/SM4 - Deep Learning Deep Learning Infrastructure! Computational Infrastructure critical to deep

Introducing Deep Learning with MATLAB - Systematics · Introducing Deep Learning with MATLAB6 Inside a Deep Neural Network ... Deep learning is a subtype of machine learning. With

Deep Learning in real world @Deep Learning Tokyo

Deep Learning

Deep materials informatics: Applications of deep learning in … · deep learning: † Deep learning requires big data: In many cases, the biggest limiting factor for applying deep

Executing Deep Learning Strategies Masterclass Preview - Enterprise Deep Learning

An Introduction to Deep Learning · Deep Learning An Introduction to ... •Deep Learning •Feed Forward Network •Problems of Deep Learning •AutoEncoders •Restricted Boltzmann

Introduction to Deep Learning - Computer Graphics€¦ · Introduction to Deep Learning CS468 Spring 2017 Charles Qi. What is Deep Learning? Deep learning allows computational models

20151216 Computer Vision Meetup Deep Learning Deep Learning - Rene...Deep Learning . René Donner Deep Learning Overview 2 ... coursera ... 20151216 Computer Vision Meetup Deep Learning

Deep Learning