33
Hars Vardhan On Behalf of Dr. Steven Eliuk Computer Science Innovation Center Samsung Research America Feb 5 th 2017 Effectively Scaling out Deep Learning Frameworks with GPUs

Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

Hars VardhanOnBehalfofDr.StevenEliuk

ComputerScienceInnovationCenterSamsungResearchAmerica

Feb5th 2017

EffectivelyScalingoutDeepLearningFrameworkswithGPUs

Page 2: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

Outline

i. WhyScale?i. Motivation

ii. CurrentOpen-Sourceofferingsi. Oursentiments

iii. Bottlenecksi. MatrixMultiplicationii. DataLoading&Augmentation

iv. TipsandTricksv. Real-WorldResults

i. Strong&weakscaling,trainingtimevi. HPCEnvironment

i. Whatmakesforaneffectiveenvironment?vii. Q&A

Page 3: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

DeepLearning@ Samsung:

DeepLearning: utilizedinanevergrowingnumberofprojectsthroughoutSamsung,froms/wtoh/w.

SpeechTranslation

AIR

ASR

HandwritingRecognition

AutonomousVehicles

Object&SpeechRecognition

Machinetranslation&search

SRADLPlatform

ModelParallelism

LargeScale

Distributed&multi-GPU

AdTargeting

Training

Data

SceneUnderstanding

Page 4: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

Long Training Times Impact Time to Market

EffectofExperimentTime

Minutes,Hours• Interactiveinvestigation!Instantgratification!• Parameterexploration1-4Days• Tolerable• Interactivityreplacedbyparallelizationofexperiments1-4Weeks• Highvalueexperimentsonly• Progressstalls>1Month• Don’teventry Keynoteat2015NVIDIAGTCConference,JeffDean

Page 5: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

Current Open-Source Offerings:

Single-GPU Multi-GPU Distributed

Past

Page 6: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

Current Open-Source Offerings:

Single-GPU Multi-GPU Distributed

Now

Page 7: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

DeepLearningPlatform:Framework

“ScalesOut”EffectivelyComprehensiveLibraryDevelopedLargermodels– frontierresearchExtendedfeatureset

LAPACK FFT GeneralPurposeMath

Application(Caffe,Theano,CNTK,TF,MxNet)

CUDAcuBLAS cuDNN

Doesnot“ScalesOut”effectivelyAccuracycandegradeSmallmodelsTractabilityissues

CurrentApproaches Samsung– SRAApproach

LAPACK FFT GeneralPurposeMath

dMath

Node1

GPU#1 GPU#16…

NodeN

GPU#1 GPU#16…

Frameworks(DNN– Kaldi ASR,Caffe AIR,Theano,Torch,etc)

CUDAcuBLAS cuDNN

ApplicationApplication

Node1

GPU#1 GPU#16

NodeN

GPU#1 GPU#16… …… …

Page 8: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

dMath : A distributed math library

• Abstractedawaycomplicateddistributed&multi-GPU

programmingfromhigh-levelMachineLearninguser

• UsesMPI-basedclient-servercomputingparadigm

• Hasin-device(GPU)datastorage

• CustomizedroutinetoexploitPCIswitchingandunderlyingHW

• FullsupportforDNNpipeline

• Integrationw/existingopen-sourceframeworks.

Page 9: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

(b)(a)

(d)(c)

• Matrixdataisdividedintomultipleblocksanddistributedacrosstheworkers

• Ontheleftarefourexamplesofdistributingthesamematrixacrossfourworkers,thenumbersrepresenttheworkerthatstoresthatblock

• Blocksizeandlocationareimportantforefficiency

• Mostalgorithmsrequireconsistentblocksizes,somerequireconsistentlayoutontheworkers

• ArbitrarylayoutsaresupportedindMath

Data Layout

Page 10: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

= +

Inefficient,offdiagonalblocksrequirecommunicationbetweenGPUs

= +

Efficient,nocommunicationbetweenGPUs

=

Efficient,nocommunicationbetweenGPUs

=

Lessefficient,duetomemoryaccesspatterns

=

Veryinefficient,requirestemporarybuffersandcommunicationbetweenGPUs

DataLayoutandAlgorithmEfficiency

Page 11: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

MatrixMultiplication

• OneofthemostimportantalgorithmsinDL

• Probablythemostdifficulttooptimize

• MostbuiltoncuBLAS routinesfortheindividualblocks

• Typicalsizesindeeplearningapplicationshavecommunication

betweenGPUsandserversasthebottleneck

• Multipleimplementationsareneededfordifferentcircumstances

• Wehaveover15differentdistributedGEMMimplementationsthatare

usedforvarioustypesofmatrixdimensionsandh/wconfigurations

Page 12: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

MatrixMultiplication Scaling

• Smallerblocksdoesn’tscale,duetocommunicationoverheadandcuBLASGEMMefficiency

• Asmatricesgetlarger,communicationbecomeslessofanissuecomparedtocomputation,scalingislessdifficult

• Shownbelowistherelativeperformanceforthesamemultiplication,distributedoveradifferentnumberofGPUswithinourframework

0

5

10

15

1GPU

2GPUs

4GPUs

8GPUs

Badn

ess

Page 13: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

MatrixMultiplicationPerformance

• Shownbelowareruntimesforsquarematrixmultiplicationsofvarioussizes

• dMath performanceonavarietyofGPUsiscomparedwithacuBLAS 7.5baselineandcuBLAS-XT

* Tesla K80 result

Page 14: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

ReplicationMotivation

• Oftendesirableforworkerstohaveidenticalcopiesofamatrix

(abundantmemoryandrarelychangingdata)

• Usefulforparametersinanetwork

• Equivalenttothedistributionofparametersinaparameterserver

• Caching(usingreplication)ofadistributedweightmatrix,forwardpass

cachesit,backwardpassrequiresnocommunication

Page 15: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

ReplicationMotivation(AlexNet Example)

• Thematrixmultiplicationinthethreefullyconnectedlayersshownaboverequirestheweightmatrixtobesharedbetweentheworkersintheforwardpass

• Wecachethismatrix,allowingthebackwarddatagradienttobecomputedwithoutanycommunicationbetweenGPUs

Page 16: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

ReplicationImplementation

• Eachworkerstoresanentirecopyofthematrix,inadditiontotheblockstheyareresponsiblefor

• Whenamatrixischanged,theworkersautomaticallyredistributethematrix

• Replicationcanbetemporarilysuspendedwhilemultipleupdatesaremadetoamatrix,thenresumed

Page 17: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

Asynchronous Replication

• ParametersareupdatedallatonceandthenreplicatedinSGDtraining

• Updatedparametersoftennotneededuntilmuchdeeperintothenetwork

• Replicationfrequentlyfollowedbyforwardconvolutionlayers(highcomputation,zerocommunication)

• Replacesynchronousreplicationwithasynchronousreplication,overlappingparameterredistributionwithcomputation

Page 18: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

TipsandTricks

• Anyserialportionoftheprogramwillhurtscaling(Amdahl’slaw)• Dispatchingjobstotheworkershasasmalloverhead(dozensof

microseconds)• Additionaltimeisspentbroadcastingthemetadata,andsettingupdata

structuresrequiredtoperformadistributedjob• Forlargejobsthisoverheadisnotaproblem• ScalingtomoreGPUswithoutincreasingbatchsize,strongscaling,presents

significantissues(smallconvolutionsinGoogLeNet)• Canbealleviatedbyavoidinglazysolutions

• Beforehistory_data->scale(momentum);history_data->addMatrix(local_rate, *param_diff);

• Afterhistory_data->addMatrix(momentum, *history_data,

local_rate, *param_diff);

Page 19: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

TipsandTricks

• Combinecommonoperationstogetherintoasinglejob• Backwardconvolutionsinvolvecomputingdata,filterandbiasgradients,

canbedoneallatonce• AllowsformultipleCUDAstreamstobeused• Overlappingthefilterandbiasgradientreductionswithdatagradient

computationhelpsoptimizeperformance• AvoidmultipleMPIcallswhenpossible,insteadcopythedataintoasingle

bufferanddoasinglecall(e.g.filterandbiasgradientreductions)• Implementbatchedversionsofjobs(e.g.matrixadditionusedinthe

parameterupdate)• ForNvidia TeslaK80,wehavefoundmuchbetterperformancebylockingthe

clocksat758Mhzanddisablingauto-boost,e.g.~5-10%formulti-GPUjobs• RegisteringmemorywiththeIBdriveriscostly,trytoreuseyourbuffersto

preventthiscostlyregistration,i.e.useamemorymanagerofsomesort

Page 20: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

Real World: soumith/convnet-benchmarks -- 128 Batch, i.e. Strong Scaling

Expresso 121.8msaverageononetitan

• Expresso 64.5msaverageontwotitans• Expresso 48.6msaverageonfourtitans• Expresso 39.1msaverageoneighttitans

Page 21: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

Real World: Training AlexNet -- 256 Batch, i.e. Strong Scaling

0

500

1000

1500

2000

2500

Expresso CNTK Caffe Nvidia- Caffe(0.14)

1GPU 4GPU 8GPU 16GPU 32GPU 64GPU

FPS

Page 22: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

Real World: Inception v3, 32 Batch per GPU

0

200

400

600

800

1000

1200

1400

Expresso TensorFlow

Inceptionv3 2GPU 4GPU 8GPU 16GPU 32GPU 64GPU

FPS

Page 23: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

Real World: Training Time

AlexNet 1024:• Caffe -- 8 K80s -- 17h 20min• NV-Caffe 0.14 -- 8 K80s -- 12h 15min• Expresso 0.8 -- 8 K80s -- 05h 15min

GoogLeNet v11024Batch:• Caffe -- 8 K80s -- 03d 20hrs• Expresso -- 8 K80s -- 02d 12hrs• Expresso -- 32 K80s -- 01d 06hrs

ScalingbatchfromsingleGPUto64GPUsresultsinthesameaccuracy,e.g.AlexNet 25658.59+-0.5%topone.

Expresso results are for that of our version hybrid parallelism [Krizhevsky14].

Page 24: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

Real World: Training Time – continued…

GoogLeNet V3(akaInceptionv3):Tesla K40 GPUs• TensorFlow -- 16, 32, 64 GPUs -- 507h 58min• TensorFlow Internal – 100 GPUs -- 65h 00min• Expresso v0.8 -- 64 GPUs -- 48h 22min

Tesla m40 GPUs• Expresso v0.8 -- 96 GPUs -- 13h 45min

ImageNet2016 Scene Classification competition- Third place in single model category- 7th place in ensemble category

Page 25: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

Data Loading & Augmentation

• DataloadingquicklybecomesabottleneckwhenscalingtomultipleGPUs• Notpracticaltoloadfromasingleprocess,queues,etc.don’tworkwell• AutomaticallytunedCPU,GPUmulti-threadeddataloaderrunsoneach

worker• Periodicallyruns,(0.1%)acheckisdonetodetermineifthecurrentpolicyis

adequate…rememberingmulti-tenancy• Nopolicyisgoodforallmodelsasthereisadifferentproportionofworkin

differentmodels.

Page 26: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

Data Loading & Augmentation

• Atruntime,themasterprocesssamplesperformancefordifferentcombinationsofthreadcountsandtransferindices

• Thisallowsthesystemtoacceleratedataaugmentationwhenitisthebottleneck.

MemoryManagement:• Page-lockedhostallocationsanddeviceallocationscauseimplicit

synchronizationandareexpensive• Host&devicebuffersareallocatedbyeachthreadonstart-upandonly

resizedwhenasampleexceedsthecurrentbuffersize.Memoryfootprintquicklystabilizes

Page 27: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

Data Loading & Augmentation

AlexNet Perf on8M40s

SingleThreadMulti-ThreadAuto-tuned

Page 28: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

Low precision Training & Inference

Motivations• Lowerdatastorageandcommunication• BetterperformancewithnewerGPUs(Pascal)• Proventoworkgreatforinference– Fast,Accurate

Challenges• Instabilityforcertainlayersoftraining• Partialsupportbysoftwarestack(yet)

AddressingAccuracyLoss• Performingcertainoperationsinhigherprecession

• Softmax,Innerproduct,weightupdates

Page 29: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

H/W:SamsungAdvancedLearning

Page 30: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

Conclusion

- dMath provideseffectivescalingtoDLframework:Caffe,Kaldi

- DNNpipelinewithdataloaderandarrayoffeatures

- CanalsobeusedinotherapplicationslikeMedicalImaging,Finite

ElementAnalysis,etc.

FutureWork

- Halfprecisiontraining

- LayersimplementationforRNN,etc.

Page 31: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

Acknowledgments

….

Page 32: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

Contact

[email protected]

Page 33: Effectively Scaling out Deep Learning Frameworks with GPUsgpgpu10.athoura.com/SRA_scalingout.pdf · Tips and Tricks • Any serial portion of the program will hurt scaling (Amdahl’s

SamsungResearchAmerica– SiliconValley

Q & A

?