Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Hars VardhanOnBehalfofDr.StevenEliuk
ComputerScienceInnovationCenterSamsungResearchAmerica
Feb5th 2017
EffectivelyScalingoutDeepLearningFrameworkswithGPUs
SamsungResearchAmerica– SiliconValley
Outline
i. WhyScale?i. Motivation
ii. CurrentOpen-Sourceofferingsi. Oursentiments
iii. Bottlenecksi. MatrixMultiplicationii. DataLoading&Augmentation
iv. TipsandTricksv. Real-WorldResults
i. Strong&weakscaling,trainingtimevi. HPCEnvironment
i. Whatmakesforaneffectiveenvironment?vii. Q&A
SamsungResearchAmerica– SiliconValley
DeepLearning@ Samsung:
DeepLearning: utilizedinanevergrowingnumberofprojectsthroughoutSamsung,froms/wtoh/w.
SpeechTranslation
AIR
ASR
HandwritingRecognition
AutonomousVehicles
Object&SpeechRecognition
Machinetranslation&search
SRADLPlatform
ModelParallelism
LargeScale
Distributed&multi-GPU
AdTargeting
Training
Data
SceneUnderstanding
SamsungResearchAmerica– SiliconValley
Long Training Times Impact Time to Market
EffectofExperimentTime
Minutes,Hours• Interactiveinvestigation!Instantgratification!• Parameterexploration1-4Days• Tolerable• Interactivityreplacedbyparallelizationofexperiments1-4Weeks• Highvalueexperimentsonly• Progressstalls>1Month• Don’teventry Keynoteat2015NVIDIAGTCConference,JeffDean
SamsungResearchAmerica– SiliconValley
Current Open-Source Offerings:
Single-GPU Multi-GPU Distributed
Past
SamsungResearchAmerica– SiliconValley
Current Open-Source Offerings:
Single-GPU Multi-GPU Distributed
Now
SamsungResearchAmerica– SiliconValley
DeepLearningPlatform:Framework
“ScalesOut”EffectivelyComprehensiveLibraryDevelopedLargermodels– frontierresearchExtendedfeatureset
LAPACK FFT GeneralPurposeMath
Application(Caffe,Theano,CNTK,TF,MxNet)
CUDAcuBLAS cuDNN
Doesnot“ScalesOut”effectivelyAccuracycandegradeSmallmodelsTractabilityissues
CurrentApproaches Samsung– SRAApproach
LAPACK FFT GeneralPurposeMath
dMath
Node1
GPU#1 GPU#16…
NodeN
GPU#1 GPU#16…
Frameworks(DNN– Kaldi ASR,Caffe AIR,Theano,Torch,etc)
CUDAcuBLAS cuDNN
ApplicationApplication
Node1
GPU#1 GPU#16
NodeN
GPU#1 GPU#16… …… …
SamsungResearchAmerica– SiliconValley
dMath : A distributed math library
• Abstractedawaycomplicateddistributed&multi-GPU
programmingfromhigh-levelMachineLearninguser
• UsesMPI-basedclient-servercomputingparadigm
• Hasin-device(GPU)datastorage
• CustomizedroutinetoexploitPCIswitchingandunderlyingHW
• FullsupportforDNNpipeline
• Integrationw/existingopen-sourceframeworks.
SamsungResearchAmerica– SiliconValley
(b)(a)
(d)(c)
• Matrixdataisdividedintomultipleblocksanddistributedacrosstheworkers
• Ontheleftarefourexamplesofdistributingthesamematrixacrossfourworkers,thenumbersrepresenttheworkerthatstoresthatblock
• Blocksizeandlocationareimportantforefficiency
• Mostalgorithmsrequireconsistentblocksizes,somerequireconsistentlayoutontheworkers
• ArbitrarylayoutsaresupportedindMath
Data Layout
SamsungResearchAmerica– SiliconValley
= +
Inefficient,offdiagonalblocksrequirecommunicationbetweenGPUs
= +
Efficient,nocommunicationbetweenGPUs
=
Efficient,nocommunicationbetweenGPUs
=
Lessefficient,duetomemoryaccesspatterns
=
Veryinefficient,requirestemporarybuffersandcommunicationbetweenGPUs
DataLayoutandAlgorithmEfficiency
SamsungResearchAmerica– SiliconValley
MatrixMultiplication
• OneofthemostimportantalgorithmsinDL
• Probablythemostdifficulttooptimize
• MostbuiltoncuBLAS routinesfortheindividualblocks
• Typicalsizesindeeplearningapplicationshavecommunication
betweenGPUsandserversasthebottleneck
• Multipleimplementationsareneededfordifferentcircumstances
• Wehaveover15differentdistributedGEMMimplementationsthatare
usedforvarioustypesofmatrixdimensionsandh/wconfigurations
SamsungResearchAmerica– SiliconValley
MatrixMultiplication Scaling
• Smallerblocksdoesn’tscale,duetocommunicationoverheadandcuBLASGEMMefficiency
• Asmatricesgetlarger,communicationbecomeslessofanissuecomparedtocomputation,scalingislessdifficult
• Shownbelowistherelativeperformanceforthesamemultiplication,distributedoveradifferentnumberofGPUswithinourframework
0
5
10
15
1GPU
2GPUs
4GPUs
8GPUs
Badn
ess
SamsungResearchAmerica– SiliconValley
MatrixMultiplicationPerformance
• Shownbelowareruntimesforsquarematrixmultiplicationsofvarioussizes
• dMath performanceonavarietyofGPUsiscomparedwithacuBLAS 7.5baselineandcuBLAS-XT
* Tesla K80 result
SamsungResearchAmerica– SiliconValley
ReplicationMotivation
• Oftendesirableforworkerstohaveidenticalcopiesofamatrix
(abundantmemoryandrarelychangingdata)
• Usefulforparametersinanetwork
• Equivalenttothedistributionofparametersinaparameterserver
• Caching(usingreplication)ofadistributedweightmatrix,forwardpass
cachesit,backwardpassrequiresnocommunication
SamsungResearchAmerica– SiliconValley
ReplicationMotivation(AlexNet Example)
• Thematrixmultiplicationinthethreefullyconnectedlayersshownaboverequirestheweightmatrixtobesharedbetweentheworkersintheforwardpass
• Wecachethismatrix,allowingthebackwarddatagradienttobecomputedwithoutanycommunicationbetweenGPUs
SamsungResearchAmerica– SiliconValley
ReplicationImplementation
• Eachworkerstoresanentirecopyofthematrix,inadditiontotheblockstheyareresponsiblefor
• Whenamatrixischanged,theworkersautomaticallyredistributethematrix
• Replicationcanbetemporarilysuspendedwhilemultipleupdatesaremadetoamatrix,thenresumed
SamsungResearchAmerica– SiliconValley
Asynchronous Replication
• ParametersareupdatedallatonceandthenreplicatedinSGDtraining
• Updatedparametersoftennotneededuntilmuchdeeperintothenetwork
• Replicationfrequentlyfollowedbyforwardconvolutionlayers(highcomputation,zerocommunication)
• Replacesynchronousreplicationwithasynchronousreplication,overlappingparameterredistributionwithcomputation
SamsungResearchAmerica– SiliconValley
TipsandTricks
• Anyserialportionoftheprogramwillhurtscaling(Amdahl’slaw)• Dispatchingjobstotheworkershasasmalloverhead(dozensof
microseconds)• Additionaltimeisspentbroadcastingthemetadata,andsettingupdata
structuresrequiredtoperformadistributedjob• Forlargejobsthisoverheadisnotaproblem• ScalingtomoreGPUswithoutincreasingbatchsize,strongscaling,presents
significantissues(smallconvolutionsinGoogLeNet)• Canbealleviatedbyavoidinglazysolutions
• Beforehistory_data->scale(momentum);history_data->addMatrix(local_rate, *param_diff);
• Afterhistory_data->addMatrix(momentum, *history_data,
local_rate, *param_diff);
SamsungResearchAmerica– SiliconValley
TipsandTricks
• Combinecommonoperationstogetherintoasinglejob• Backwardconvolutionsinvolvecomputingdata,filterandbiasgradients,
canbedoneallatonce• AllowsformultipleCUDAstreamstobeused• Overlappingthefilterandbiasgradientreductionswithdatagradient
computationhelpsoptimizeperformance• AvoidmultipleMPIcallswhenpossible,insteadcopythedataintoasingle
bufferanddoasinglecall(e.g.filterandbiasgradientreductions)• Implementbatchedversionsofjobs(e.g.matrixadditionusedinthe
parameterupdate)• ForNvidia TeslaK80,wehavefoundmuchbetterperformancebylockingthe
clocksat758Mhzanddisablingauto-boost,e.g.~5-10%formulti-GPUjobs• RegisteringmemorywiththeIBdriveriscostly,trytoreuseyourbuffersto
preventthiscostlyregistration,i.e.useamemorymanagerofsomesort
SamsungResearchAmerica– SiliconValley
Real World: soumith/convnet-benchmarks -- 128 Batch, i.e. Strong Scaling
Expresso 121.8msaverageononetitan
• Expresso 64.5msaverageontwotitans• Expresso 48.6msaverageonfourtitans• Expresso 39.1msaverageoneighttitans
SamsungResearchAmerica– SiliconValley
Real World: Training AlexNet -- 256 Batch, i.e. Strong Scaling
0
500
1000
1500
2000
2500
Expresso CNTK Caffe Nvidia- Caffe(0.14)
1GPU 4GPU 8GPU 16GPU 32GPU 64GPU
FPS
SamsungResearchAmerica– SiliconValley
Real World: Inception v3, 32 Batch per GPU
0
200
400
600
800
1000
1200
1400
Expresso TensorFlow
Inceptionv3 2GPU 4GPU 8GPU 16GPU 32GPU 64GPU
FPS
SamsungResearchAmerica– SiliconValley
Real World: Training Time
AlexNet 1024:• Caffe -- 8 K80s -- 17h 20min• NV-Caffe 0.14 -- 8 K80s -- 12h 15min• Expresso 0.8 -- 8 K80s -- 05h 15min
GoogLeNet v11024Batch:• Caffe -- 8 K80s -- 03d 20hrs• Expresso -- 8 K80s -- 02d 12hrs• Expresso -- 32 K80s -- 01d 06hrs
ScalingbatchfromsingleGPUto64GPUsresultsinthesameaccuracy,e.g.AlexNet 25658.59+-0.5%topone.
Expresso results are for that of our version hybrid parallelism [Krizhevsky14].
SamsungResearchAmerica– SiliconValley
Real World: Training Time – continued…
GoogLeNet V3(akaInceptionv3):Tesla K40 GPUs• TensorFlow -- 16, 32, 64 GPUs -- 507h 58min• TensorFlow Internal – 100 GPUs -- 65h 00min• Expresso v0.8 -- 64 GPUs -- 48h 22min
Tesla m40 GPUs• Expresso v0.8 -- 96 GPUs -- 13h 45min
ImageNet2016 Scene Classification competition- Third place in single model category- 7th place in ensemble category
SamsungResearchAmerica– SiliconValley
Data Loading & Augmentation
• DataloadingquicklybecomesabottleneckwhenscalingtomultipleGPUs• Notpracticaltoloadfromasingleprocess,queues,etc.don’tworkwell• AutomaticallytunedCPU,GPUmulti-threadeddataloaderrunsoneach
worker• Periodicallyruns,(0.1%)acheckisdonetodetermineifthecurrentpolicyis
adequate…rememberingmulti-tenancy• Nopolicyisgoodforallmodelsasthereisadifferentproportionofworkin
differentmodels.
SamsungResearchAmerica– SiliconValley
Data Loading & Augmentation
• Atruntime,themasterprocesssamplesperformancefordifferentcombinationsofthreadcountsandtransferindices
• Thisallowsthesystemtoacceleratedataaugmentationwhenitisthebottleneck.
MemoryManagement:• Page-lockedhostallocationsanddeviceallocationscauseimplicit
synchronizationandareexpensive• Host&devicebuffersareallocatedbyeachthreadonstart-upandonly
resizedwhenasampleexceedsthecurrentbuffersize.Memoryfootprintquicklystabilizes
SamsungResearchAmerica– SiliconValley
Data Loading & Augmentation
AlexNet Perf on8M40s
SingleThreadMulti-ThreadAuto-tuned
SamsungResearchAmerica– SiliconValley
Low precision Training & Inference
Motivations• Lowerdatastorageandcommunication• BetterperformancewithnewerGPUs(Pascal)• Proventoworkgreatforinference– Fast,Accurate
Challenges• Instabilityforcertainlayersoftraining• Partialsupportbysoftwarestack(yet)
AddressingAccuracyLoss• Performingcertainoperationsinhigherprecession
• Softmax,Innerproduct,weightupdates
SamsungResearchAmerica– SiliconValley
H/W:SamsungAdvancedLearning
SamsungResearchAmerica– SiliconValley
Conclusion
- dMath provideseffectivescalingtoDLframework:Caffe,Kaldi
- DNNpipelinewithdataloaderandarrayoffeatures
- CanalsobeusedinotherapplicationslikeMedicalImaging,Finite
ElementAnalysis,etc.
FutureWork
- Halfprecisiontraining
- LayersimplementationforRNN,etc.
SamsungResearchAmerica– SiliconValley
Acknowledgments
….
SamsungResearchAmerica– SiliconValley
Q & A
?