Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Thorsten Kurth
Optimizing Code for Intel Xeon Phi 7250 (Knight’s Landing)
NUGTraining,06/09/2017
2
Multicore vs. manycore• multicore(Edison)
‣ 5000nodes
‣ 12physicalcores/CPU
‣ 24HWthreads/CPU
‣ 2.4-3.2GHz
‣ 4DPops/cycle
‣ 30MBL3cache
‣ 64GB/node
‣ 100GB/smemorybandwidth
• manycore(Cori-KNL)
‣ 9600nodes
‣ 68physicalcores/CPU
‣ 272HWthreads/CPU
‣ 1.2-1.6GHz
‣ 2x8DPops/cycle
‣ noL3cache
‣ 16GB/node(fast) 96GB/node(slow)
‣ 450GB/smemorybandwidth(fast)
Recompile and go?
3
• x86-64compatible:canusecodesforolderarchitecturesorrecompile
• self-hosted:noneedforoffloading
• medianspeedupvs.Edison:1.15x
• medianspeedupvs.Haswell:0.70x
Why should I optimize my code?
• pros
• getmoreforyourbucks:makingefficientuseofexistingmanycoreHPCsystems
• fastsuccesspossible:manylowhangingfruitsinunoptimizedcodes
• investinginthefuture:heterogeneousarchitecturesareenergyefficientandthuswillstayaroundforawhile
• benefitsonmulticore:optimizationstargetingmanycorearchitecturesmostlyimproveperformanceonmulticoresystemsaswell
• cons
• effort:manymostbeneficialoptimizationsrequiresignificantcodechanges
• investinginthefuture:whatifIbetonthewronghorse?
4
Optimization targets
• singlenodeperformance
• starthere:forrepresentativelocalproblemsize,singlenodeperformanceisupperboundofwhatyougetinmulti-node
• manyoptimizationopportunities,fastturnaroundtimes
• manyprofilingtoolsavailable
• multi-nodeperformance
• feweroptimizationopportunities,profiling/debuggingtedious
• IOperformance
• notmanyopportunitiesforimprovement
5
Where do I start?
• gettoknowyourapplication:don’tassumeyoualreadydo!
• determinehotspots
• manualtimers:becarefulwiththreadsafety/syncbarriers
• profilingtools:NERSCoffersalot
• CrayPat(verylightweight)
• Advisor(findtime-consumingloops)
• VTune(candoalotofthingsbutalsoveryslow)
• MAP(comparablylightweight)
• foundhotspots,nowwhat?
6
What architectural feature shall I target?
• KNLhasmanynewfeaturestoexplore
• manythreads
• biggervectorunits
• complexintrinsics(ISA)
• multiplememorytiers
• understandyourhotspots
• computebound:morethreads,vectorization,ISA
• memoryBWbound:memorytiers,morethreads
• memorylatencybound:morethreads,vectorization
7
Prerequisites - compile and run
• recompileyourcodeforKNL:codeforolderCPUsissupportedbutthosedonotmakefulluseofnewarchitecture
• Cray(wrappers): module swap craype-haswell craype-mic-knl
• Intel: -xmic-avx512
• GNU:-march=knl
• useproperOpenMPsettings:export OMP_NUM_THREADS=64export OMP_PLACES=threadsexport OMP_PROC_BIND=spread
• usejob-script-generatoronmy.nersc.govorNERSCwebsite
• nodeconfiguration:use-C knl,quad,cacheasastart
8
Prerequisites - #FLOPS
• :numberoffloatingpointoperations
• manualcalculation:
• floatadditionandmultiplication:+1
• complexmultiplication:+6(4multiplications+2additions)
• etc.
• measurewithSDE:
• usingSDEismoreprecise,becauseitaccountsformasking
9
#FLOPS
Prerequisites - #BYTES
10
• :numberofbytestransferredfrommainmemory
• manualcalculation(notrecommended,butgoodcheck):
• countthebytesofdatatobereadandwritteninthekernel
• doesnotaccountfordatareusethroughcaching
• measurewithVTune:
• preciselyobtainuncorecounterevents
#BYTES
What is limiting my performance?
• Rooflineperformancemodel
• arithmeticintensity
• performance
• plotPvs.AIwitharchitecturalrooflineR
11
AI =#FLOPS
#BYTES
P =#FLOPS
time[s]
R(AI) = min(memory bw ·AI, peak flops)
Example roofline
12
Example roofline
12
memory bandwidth bound
Example roofline
12
use MCDRAM
Example roofline
12
compute bounduse MCDRAM
Example roofline
12
use MCDRAM
utilize vectorization
Example roofline
12
(possibly) memory latency bounduse MCDRAM
utilize vectorization
Example roofline
12
use MCDRAM
utilize vectorization
threading, vectorization
Example roofline
12
use MCDRAM
utilize vectorization
threading, vectorization
improve AI
How to improve AI?
• definitionofarithmeticintensity
• twopossibilities
• numberofflops⬆ numberofbytes➡
(notpossible/easy,choiceofalgorithmdeterminesflops)
• numberofflops ➡ numberofbytes ⬇
• reality:tradeoffbetweenboth
13
AI =#FLOPS
#BYTES
Create more work/thread
• loop/kernelfusion:improvescachere-useandreduceoverhead
• collapsenestedloops:
• rearrangedatastructures:moveOpenMPout(coarsegrain)
14
Loop transformations I
• looptiling:improvescachere-useandcansignificantlyimproveperformance
15
Loop transformations I
• looptiling:improvescachere-useandcansignificantlyimproveperformance
• especiallyrelevantonKNLbecauseofmissingL3
• blockingtosharedL2(512KiB)usuallygood
• wasmytransformationsuccessful?checkL1,L2missrates,e.g.inVTune
15
Loop transformations II
• shortloopunrolling:helpsthecompilervectorizingtherightloops
16
Loop transformations II
• shortloopunrolling:helpsthecompilervectorizingtherightloops
• unrollingpragmasarehelpfultoo
• checkcompileroptimizationreports
• useIntelAdvisor
16
Data alignment
• align(andpad)datato64bitwordstoimproveprefetching
• canbedoneeasilyinmajorprogramminglanguages
• FORTRAN:-align array64byte (ifort,gfortrandoesitautomagically)
• C/C++:aligned_alloc(64, <size>), __attribute__ ((aligned(64))), __declspec(align(64))
• C++trick:overloadnew operator
• advanced:manuallypaddataifarrayextentsarepowerof2tominimizecacheassociativityconflicts
17
Make use of ISA
• helpthecompilertogenerateefficientintrinsics
18
runtime example for app with kernel: 1.2 sec
Make use of ISA
• helpthecompilertogenerateefficientintrinsics
18
runtime example for app with kernel: 1.2 sec
if condition inside loop
Make use of ISA
• helpthecompilertogenerateefficientintrinsics
18
runtime example for app with kernel: 0.8 sec
1.5x speedup
Reduced precision math
• transcendentalfunctions,squareroots,etc.areexpensive
• use-fp-model fast=2 -no-prec-divduringcompilation
• replacedivisionsbyconstantswithmultiplicationswithinverse
• donotexpecttoomuch:benefitsusuallyonlyvisibleinheavilycompute-boundcodesections
• reducedprecisionmightnotalwaysbeacceptable
19
Benefits of AVX-512
20
• medianspeedup:1.2x
• benefitscanbelargerthan2x(probablymoreefficientprefetching)
• automaticallyenabledwhencompilingforKNLarchitecture
Use MCDRAM
• alwaysuse16GiBon-packagememory(MCDRAM)
• cacheworkswell:requestwith-C knl,cache
• codefitsinto16GiB:request-C knl,flatand prependexecutablewithnumactl -m 1
21
A note on heap allocation
• KNLmemoryallocationiscomparablyslow
• avoidallocatingandde-allocatingmemoryfrequently
• removeallocations/deallocationsinloopbodiesorfunctionswhicharecalledmanytimes
• tooinvolved?poolallocatorlibraries(e.g.IntelTBBscalablememorypools)
• pros:
• overloadsnew/malloc,no/minimalsourcecodechangesnecessary
• cangivesignificantperformanceboostforcertaincodes
• takecareofthread-safety
• cons:
• memoryfootprintneedstobeknown/computedinadvance
• codemightbecomelessportable
22
Multi-node optimizations
• singleKNLthreadcannotsaturateAriesinjectionrate
• usethread-levelcommunicationormultipleMPIrankspernode
• recommended:>4rankspernode
• dedicatecorestoOS:-S <ncores>insbatch(ncores=2goodchoice)
23
Hugepages, DMAPP and hardware AMO
• hugepagescanreduceAriesTLBmisses
• loadcorrespondingmoduleatcompileandruntime
• canusedifferentmodulesatcompileandruntime
• MPI-collective-heavycodes:enableDMAPP(add-ldmapp) export MPICH_RMA_OVER_DMAPP=1export MPICH_USE_DMAPP_COLL=1export MPICH_NETWORK_BUFFER_COLL_OPT=1
• enablehardwareAMOforMPI-3RMAatomicsexport MPICH_RMA_USE_NETWORK_AMO=1
24
Some notes on IO
25
SingleCoreI/OPerformanceonCori
Rela4vePerfo
rmance(KN
L/H
SW)
0.0
0.3
0.6
0.9
1.2
WriteBa
ndwidth(M
B/s)
0
300
600
900
1200
BufferedI/O SyncI/O DirectI/O
HSWKNLKNL/HSW
Use multiple processes
• usemoreprocesses(e.g.withMPIIO)
• unfortunately,nogoodthreadedIOsolutionsavailableyet
• always:pool(writebigchunks),reducefileoperations(open,close)
• largefiles:burstbuffer
26
Does it help?
27
• medianspeedupvs.Edison:1.15x
• medianspeedupvs.Haswell:0.70x
Does it help?
27
• medianspeedupvs.Edison:1.8x
• medianspeedupvs.Haswell:1.0x
Summary
• singlenodeperformance(goforthatonefirst)
• loopfusionandtiling
• ensuregoodvectorization
• useMCDRAM
• multi-nodeperformance
• hugepages
• DMAPP
• IOperformance
• usemultiplenodes,poolIO,reducefileoperationstominimum
28
NERSC training material
• runningjobs
• process/threadbinding
• codeprofilingandtools
• measuringarithmeticintensity(AI)
• improvingOpenMPscaling
• vectorizationhelp
• howtouseMCDRAM
• NESAPcasestudies
29
Thank you
30